The MindSight Pipeline¶

MindSight processes each video frame through four sequential stages, connected by a shared FrameContext that carries data between them. This page explains the architecture at a conceptual level and points you to deeper references.

Overview¶

Every frame passes through the same four-stage pipeline:

Object Detection -- find objects and persons in the frame.
Gaze Estimation -- estimate where each person is looking and test for intersections.
Phenomena Detection -- identify higher-level social behaviors from gaze data.
Data Collection -- log results to CSV, accumulate heatmaps, and compose dashboards.

Each stage reads from and writes to the FrameContext, a per-frame data bus that keeps stages decoupled.

Pipeline Architecture¶

flowchart LR
    subgraph Stage1["Stage 1: Object Detection"]
        OD[YOLO / YOLOE]
    end

    subgraph Stage2["Stage 2: Gaze Estimation"]
        GE[RetinaFace + Gaze Backend]
    end

    subgraph Stage3["Stage 3: Phenomena Detection"]
        PD[Enabled Trackers]
    end

    subgraph Stage4["Stage 4: Data Collection"]
        DC[CSV / Heatmap / Dashboard]
    end

    FC[(FrameContext)]

    Stage1 -- "objects, persons" --> FC
    FC -- "persons" --> Stage2
    Stage2 -- "persons_gaze, hits, hit_events" --> FC
    FC -- "hits, persons_gaze" --> Stage3
    Stage3 -- "tracker outputs" --> FC
    FC -- "all keys" --> Stage4

FrameContext: The Data Bus¶

A FrameContext is created for every frame. It behaves like a dictionary: each pipeline stage reads the keys it needs and writes its results back. This design means stages have no direct dependencies on each other -- they only depend on the data contract defined by FrameContext keys.

Key examples:

Key	Written by	Consumed by
`objects`	Object Detection	Gaze Estimation, Data Collection
`persons`	Object Detection	Gaze Estimation
`persons_gaze`	Gaze Estimation	Phenomena Detection, Data Collection
`hits`	Gaze Estimation	Phenomena Detection, Data Collection
`hit_events`	Gaze Estimation	Phenomena Detection

Under the hood

See developer/frame-context.md for the full FrameContext key reference and typing information.

Stage 1: Object Detection¶

YOLO (or YOLOE for open-vocabulary detection) runs on each frame to produce bounding boxes for objects and persons. Detections are written to the FrameContext as objects and persons.

An ObjectPersistenceCache handles short-term occlusions by retaining recently-seen objects for a configurable number of frames, preventing flickering when objects are momentarily hidden.

More details

See user-guide/object-detection.md for configuration options, model selection, and persistence tuning.

Stage 2: Gaze Estimation¶

This stage has three sub-steps:

Face detection -- RetinaFace (via uniface) locates faces within person bounding boxes.
Gaze inference -- The selected backend (MobileOne, Gazelle, L2CS, or UniGaze) estimates pitch and yaw angles for each face.
Ray-object intersection -- A gaze ray is constructed from the face center using the estimated angles and tested against all object bounding boxes. Intersections are recorded as hits.

Additional features applied at this stage:

Smoothing -- Exponential moving average reduces jitter in gaze angles.
Lock-on -- Once a gaze ray hits an object, the hit is sustained for a configurable grace period to handle brief look-aways.
Snap -- Gaze rays within a threshold of an object edge are snapped to the object center.

More details

See user-guide/gaze-estimation.md for backend selection, smoothing parameters, and intersection logic.

Stage 3: Phenomena Detection¶

Enabled phenomenon trackers receive per-frame data and detect social gaze behaviors. Each tracker is an independent module that reads from the FrameContext and writes its own output keys.

Built-in phenomena include:

Joint Attention -- Two or more people looking at the same object.
Mutual Gaze -- Two people looking at each other.
Gaze Following -- One person shifts gaze to match another's target.
Gaze Leadership -- One person consistently leads gaze shifts.
Gaze Aversion -- A person breaks eye contact.
Social Referencing -- A person looks at another after encountering a stimulus.
Attention Span -- Duration a person fixates on a single target.
Scanpath -- Sequence of gaze targets over time.

More details

See user-guide/phenomena-overview.md for enabling/disabling trackers and configuring their parameters.

Stage 4: Data Collection¶

The final stage reads from the FrameContext and produces output:

CSV logging -- Per-frame rows with gaze angles, hit objects, and active phenomena.
Heatmap accumulation -- Spatial attention maps accumulated across frames.
Dashboard composition -- An overlay combining the annotated frame, heatmap, and statistics panels.

More details

See user-guide/data-output.md for output file formats, heatmap configuration, and dashboard layout options.

Frame Skipping¶

MindSight provides two frame-skipping options to improve throughput on long videos:

`--skip-frames N`¶

Runs the full pipeline only every N-th frame. On intermediate frames, the most recent FrameContext is reused so that overlays and data output remain continuous without re-running detection and gaze inference.

`--skip-phenomena N`¶

Runs phenomena trackers only every N-th frame, while object detection and gaze estimation still execute every frame. Useful when phenomena detection is expensive but you need full-resolution gaze data.

Warning

High skip values reduce temporal resolution for phenomena that depend on frame-to-frame transitions (e.g., Gaze Following, Gaze Leadership). Start with small values and verify output quality.

Performance Modes¶

MindSight includes several flags to trade output richness for speed:

Flag	Effect
`--fast`	Enables all speed optimizations (combines the flags below).
`--lite-overlay`	Draws only bounding boxes and gaze rays; skips text labels and statistics.
`--no-dashboard`	Disables the dashboard composition step entirely.
`--profile`	Prints per-stage timing after each frame for performance diagnosis.

Under the hood

--profile writes a profile.csv alongside the output video, which you can load in a spreadsheet to identify bottlenecks.

The MindSight Pipeline¶

Overview¶

Pipeline Architecture¶

FrameContext: The Data Bus¶

Stage 1: Object Detection¶

Stage 2: Gaze Estimation¶

Stage 3: Phenomena Detection¶

Stage 4: Data Collection¶

Frame Skipping¶

--skip-frames N¶

--skip-phenomena N¶

Performance Modes¶

`--skip-frames N`¶

`--skip-phenomena N`¶