Real-Time Occlusion-Aware Mixed Reality through Motion Capture Reconstruction and Instance Segmentation

Creator
Martin Rønning
Completion

Realistic occlusion in mixed reality (MR) broadcast applications requires per-pixel depth ordering between real performers and inserted virtual objects, information that is unavailable from standard broadcast video alone. This work proposes a real-time MR pipeline that addresses this problem by combining motion capture reconstruction (MCR) as an explicit depth source with frame-wise instance segmentation for per-pixel masking, implemented within a Unity environment. A spatially-sorted projection mapping algorithm associates AI-generated segmentation masks with their corresponding reconstructed performers, enabling depth-ordered composition via a custom shader. A synthetic fallback mechanism based on reconstructed mesh outlines handles frames where live segmentation fails. The pipeline is evaluated across five experimental runs and two sequences of varying scene complexity, using a three-phase benchmarking framework covering intersection detection, depth classification, and visual fidelity. MCR-based depth achieved near-perfect depth classification accuracy (>99%) compared to 68-80% for monocular depth estimation, confirming its feasibility for broadcast MR occlusion. The primary real-time configuration (Live+MCR) achieved a median pipeline latency of ~35ms, narrowly missing the 33.3ms threshold for 30 FPS, with optimization pathways identified. Also, a comparative evaluation of RF-DETR and YOLOv11 for real-time instance segmentation demonstrates that transformer-based architectures are viable for live MR inference, with RF-DETR-M achieving 8.94ms latency at competitive mask fidelity.

Research questions investigated are:
* can the combination of MCR and real-time instance segmentation produce precise, real-time explicit depth classification and consequent explicit occlusion compositing for occlusion-aware MR broadcast footage?
* How do CNN-based (YOLOv11) and transformer-based (RF-DETR) architectures compare in instance segmentation quality and real-time performance for multi-person broadcast footage?
* How does MC-derived explicit depth compare to monocular depth estimation (Video Depth Anything) as a source of depth information for occlusion compositing in MR applications?
* How close does real-time instance segmentation approach the segmentation quality achievable by a high-
precision offline model (SAM3)?

Full thesis:

 

Short demo:

Sequence 1 (dynamic scene with flicker examples)

Sequence 2 (more stable scene but with more occlusions)