The MindSight Pipeline¶
MindSight processes each video frame through four sequential stages, connected by a shared FrameContext that carries data between them. This page explains the architecture at a conceptual level and points you to deeper references.
Overview¶
Every frame passes through the same four-stage pipeline:
- Object Detection -- find objects and persons in the frame.
- Gaze Estimation -- estimate where each person is looking and test for intersections.
- Phenomena Detection -- identify higher-level social behaviors from gaze data.
- Data Collection -- log results to CSV, accumulate heatmaps, and compose dashboards.
Each stage reads from and writes to the FrameContext, a per-frame data bus that keeps stages decoupled.
Pipeline Architecture¶
flowchart LR
subgraph Stage1["Stage 1: Object Detection"]
OD[YOLO / YOLOE]
end
subgraph Stage2["Stage 2: Gaze Estimation"]
GE[RetinaFace + Gaze Backend]
end
subgraph Stage3["Stage 3: Phenomena Detection"]
PD[Enabled Trackers]
end
subgraph Stage4["Stage 4: Data Collection"]
DC[CSV / Heatmap / Dashboard]
end
FC[(FrameContext)]
Stage1 -- "objects, persons" --> FC
FC -- "persons" --> Stage2
Stage2 -- "persons_gaze, hits, hit_events" --> FC
FC -- "hits, persons_gaze" --> Stage3
Stage3 -- "tracker outputs" --> FC
FC -- "all keys" --> Stage4
FrameContext: The Data Bus¶
A FrameContext is created for every frame. It behaves like a dictionary: each pipeline stage reads the keys it needs and writes its results back. This design means stages have no direct dependencies on each other -- they only depend on the data contract defined by FrameContext keys.
Key examples:
| Key | Written by | Consumed by |
|---|---|---|
objects |
Object Detection | Gaze Estimation, Data Collection |
persons |
Object Detection | Gaze Estimation |
persons_gaze |
Gaze Estimation | Phenomena Detection, Data Collection |
hits |
Gaze Estimation | Phenomena Detection, Data Collection |
hit_events |
Gaze Estimation | Phenomena Detection |
Under the hood
See developer/frame-context.md for the full FrameContext key reference and typing information.
Stage 1: Object Detection¶
YOLO (or YOLOE for open-vocabulary detection) runs on each frame to produce bounding boxes for objects and persons. Detections are written to the FrameContext as objects and persons.
An ObjectPersistenceCache handles short-term occlusions by retaining recently-seen objects for a configurable number of frames, preventing flickering when objects are momentarily hidden.
More details
See user-guide/object-detection.md for configuration options, model selection, and persistence tuning.
Stage 2: Gaze Estimation¶
This stage has three sub-steps:
- Face detection -- RetinaFace (via
uniface) locates faces within person bounding boxes. - Gaze inference -- The selected backend (MobileOne, Gazelle, L2CS, or UniGaze) estimates pitch and yaw angles for each face.
- Ray-object intersection -- A gaze ray is constructed from the face center using the estimated angles and tested against all object bounding boxes. Intersections are recorded as
hits.
Additional features applied at this stage:
- Smoothing -- Exponential moving average reduces jitter in gaze angles.
- Lock-on -- Once a gaze ray hits an object, the hit is sustained for a configurable grace period to handle brief look-aways.
- Snap -- Gaze rays within a threshold of an object edge are snapped to the object center.
More details
See user-guide/gaze-estimation.md for backend selection, smoothing parameters, and intersection logic.
Stage 3: Phenomena Detection¶
Enabled phenomenon trackers receive per-frame data and detect social gaze behaviors. Each tracker is an independent module that reads from the FrameContext and writes its own output keys.
Built-in phenomena include:
- Joint Attention -- Two or more people looking at the same object.
- Mutual Gaze -- Two people looking at each other.
- Gaze Following -- One person shifts gaze to match another's target.
- Gaze Leadership -- One person consistently leads gaze shifts.
- Gaze Aversion -- A person breaks eye contact.
- Social Referencing -- A person looks at another after encountering a stimulus.
- Attention Span -- Duration a person fixates on a single target.
- Scanpath -- Sequence of gaze targets over time.
More details
See user-guide/phenomena-overview.md for enabling/disabling trackers and configuring their parameters.
Stage 4: Data Collection¶
The final stage reads from the FrameContext and produces output:
- CSV logging -- Per-frame rows with gaze angles, hit objects, and active phenomena.
- Heatmap accumulation -- Spatial attention maps accumulated across frames.
- Dashboard composition -- An overlay combining the annotated frame, heatmap, and statistics panels.
More details
See user-guide/data-output.md for output file formats, heatmap configuration, and dashboard layout options.
Frame Skipping¶
MindSight provides two frame-skipping options to improve throughput on long videos:
--skip-frames N¶
Runs the full pipeline only every N-th frame. On intermediate frames, the most recent FrameContext is reused so that overlays and data output remain continuous without re-running detection and gaze inference.
--skip-phenomena N¶
Runs phenomena trackers only every N-th frame, while object detection and gaze estimation still execute every frame. Useful when phenomena detection is expensive but you need full-resolution gaze data.
Warning
High skip values reduce temporal resolution for phenomena that depend on frame-to-frame transitions (e.g., Gaze Following, Gaze Leadership). Start with small values and verify output quality.
Performance Modes¶
MindSight includes several flags to trade output richness for speed:
| Flag | Effect |
|---|---|
--fast |
Enables all speed optimizations (combines the flags below). |
--lite-overlay |
Draws only bounding boxes and gaze rays; skips text labels and statistics. |
--no-dashboard |
Disables the dashboard composition step entirely. |
--profile |
Prints per-stage timing after each frame for performance diagnosis. |
Under the hood
--profile writes a profile.csv alongside the output video, which you can load in a spreadsheet to identify bottlenecks.