Gaze Estimation & Intersection¶

Overview¶

MindSight detects faces with RetinaFace, estimates a gaze direction for each face, constructs 2D rays, and intersects them with object bounding boxes to determine what each person is looking at.

The gaze sub-pipeline proceeds through the following stages:

graph LR
    A[Face Detection] --> B[Eye Landmark Extraction]
    B --> C[Gaze Estimation]
    C --> D[Temporal Smoothing]
    D --> E[Ray Construction]
    E --> F[Tip Snapping]
    F --> G[Lock-on]
    G --> H[Ray-BBox Intersection]

Each stage is configurable through command-line flags described in the sections below.

Gaze Backends¶

MindSight supports multiple gaze estimation backends. Select one by providing the appropriate model flag.

Backends are auto-discovered from GazeTracking/Backends/ via the backend registry. Any conforming backend placed in that directory is available automatically without code changes.

Backend	Trigger	Mode	Notes
MGaze (default)	`--mgaze-model <path>`	per-face	ONNX or PyTorch inference, auto-selected by file extension (`.onnx` or `.pt`). Fastest option for real-time use.
L2CS-Net	`--l2cs-model <weights>`	per-face	~3x more accurate than MGaze on MPIIGaze (3.92 deg MAE)
UniGaze	`--unigaze-model <variant>`	per-face	Best cross-dataset accuracy (~9.4 deg on Gaze360). Non-commercial license
Gazelle	`--gazelle-model <ckpt>`	scene-level	DINOv2 backbone; processes all faces in one forward pass; outputs heatmap

MGaze¶

MGaze is the default gaze estimation backend. It supports two inference modes, auto-detected from the model file extension:

ONNX mode (.onnx): Fastest. Hardware acceleration is auto-selected: CoreML on Apple Silicon Macs, CUDA on NVIDIA GPUs, CPU elsewhere.
PyTorch mode (.pt): For custom-trained models. Requires --mgaze-arch to specify the architecture (e.g., resnet18, resnet50). Slightly slower than ONNX due to lack of graph optimisation.

The default shipped model is mobileone_s0_gaze.onnx. Pass any weights file to --mgaze-model and the correct inference mode is selected automatically.

L2CS-Net¶

A higher-accuracy alternative that uses dual classification heads (one for pitch, one for yaw) to bin gaze angles into discrete classes, then refines with soft expectation. Approximately 3x more accurate than MGaze on the MPIIGaze benchmark (3.92 deg mean angular error). Heavier compute cost.

UniGaze¶

A unified gaze estimation model built on a ViT backbone with MAE pre-training, trained across multiple datasets. Achieves the best cross-dataset generalisation (~9.4 deg on Gaze360). Released under a non-commercial license -- check the license before deploying in production.

Gazelle¶

A scene-level model built on DINOv2. Instead of cropping individual faces, Gazelle processes the full frame and outputs a gaze heatmap for every detected face in a single forward pass. Best for multi-person scenes where per-face cropping is a bottleneck.

Ray Parameters¶

These flags control the geometry of the gaze ray drawn from each face.

--ray-length (float, default 1.0): Multiplier on ray length, expressed as a multiple of the detected face width. A value of 2.0 draws a ray twice as long as the face is wide.
--conf-ray: When enabled, scales the ray length by the gaze confidence score. High-confidence gazes produce longer rays; uncertain gazes produce shorter ones.
--gaze-cone (float, default 0.0): Replaces the single ray with a vision cone of the specified angle in degrees. A value of 0.0 disables the cone and uses a standard ray.

Adaptive Ray (Snapping)¶

Adaptive ray mode adjusts the ray endpoint toward nearby objects, simulating the tendency of gaze to land on salient targets.

--adaptive-ray <mode>

Modes¶

Mode	Behaviour
`off` (default)	No snapping. Ray follows raw gaze direction.
`extend`	Freely extends the ray toward the nearest qualifying object.
`snap`	Locks the ray endpoint to the centre of the nearest qualifying object.

Snap Distance¶

--snap-dist 150.0

The snap radius in pixels. Objects beyond this distance from the ray tip are not considered for snapping.

--snap-bbox-scale <fraction>

Adds a fraction of the object's bounding box diagonal to the snap radius. Larger objects become easier to snap to.

Weighted Scoring¶

When multiple objects fall within the snap radius, a weighted score determines the winner:

Weight Flag	Default	Factor
`--snap-w-dist`	1.0	Inverse distance from ray tip to object centre
`--snap-w-size`	0.0	Object bounding box area
`--snap-w-intersect`	0.0	Whether the raw ray already intersects the object

Hysteresis¶

--snap-switch-frames 8

The number of consecutive frames a new target must win the scoring before the snap actually switches to it. Prevents rapid flickering between nearby objects.

Gaze Lock-on¶

Lock-on detects sustained fixation on a single object and visually confirms it.

--gaze-lock

Parameters¶

--dwell-frames (int, default 15): Number of consecutive frames a gaze must remain on the same target before lock-on activates.
--lock-dist (int, default 100): Pixel radius around the target centre. The gaze ray tip must stay within this radius for dwell counting to continue.

Visual Feedback¶

A dwell arc appears around the face dot, filling progressively as dwell frames accumulate.
When lock-on activates, a "LOCKED" label is drawn on the target object.

Smoothing & Re-ID¶

Temporal Smoothing¶

Gaze direction is smoothed with an exponential moving average (EMA). The smoothing alpha is defined in constants.py. Lower alpha values produce smoother but more sluggish tracking; higher values are more responsive but noisier.

Face Re-ID¶

GazeSmootherReID tracks faces across frames using a combination of position proximity and colour histogram similarity. This allows the smoother to maintain per-identity state even when face detection IDs are not stable across frames.

--reid-grace-seconds (float, default 1.0): How long (in seconds) a lost face track remains in the re-ID buffer before being discarded.
--reid-max-dist (int, default 200): Maximum pixel distance between a new detection and a buffered track for re-ID matching to be considered.

Intersection Detection¶

Ray-BBox Intersection¶

MindSight uses the Liang-Barsky algorithm to test whether a gaze ray intersects an object's axis-aligned bounding box. When a hit is detected, the object is marked as a gaze target for that frame.

Cone-AABB Intersection¶

When --gaze-cone is enabled (value > 0), intersection testing switches to cone-AABB. The vision cone is tested against each object's bounding box to determine overlap.

Parameters¶

--hit-conf-gate (float, default 0.0): Minimum face detection confidence required for a gaze hit to be registered. Faces below this threshold are still drawn but their intersections are ignored.
--detect-extend (float, default 0.0): Extends detection by N pixels past the visual ray endpoint. Useful when the visual ray appears to just miss an object that the person is plausibly looking at.

Forward Gaze¶

--forward-gaze-threshold 5.0

When both pitch and yaw angles are below this threshold (in degrees), the gaze is classified as "looking at the camera." This is used by downstream phenomena such as mutual gaze and gaze aversion. Default: 5.0.

Parameter Reference¶

Flag	Type	Default	Description
`--mgaze-model`	str	None	Path to MGaze model (`.onnx` or `.pt`). Inference mode auto-detected from extension.
`--mgaze-arch`	str	None	Architecture name (required for `.pt` models only)
`--l2cs-model`	str	None	Path to L2CS-Net weights
`--unigaze-model`	str	None	UniGaze model variant
`--gazelle-model`	str	None	Path to Gazelle checkpoint
`--ray-length`	float	`1.0`	Ray length multiplier (x face width)
`--conf-ray`	flag	off	Scale ray length by gaze confidence
`--gaze-cone`	float	`0.0`	Vision cone angle in degrees (0 = ray)
`--adaptive-ray`	str	`off`	Snap mode: `off`, `extend`, `snap`
`--snap-dist`	float	`150.0`	Snap radius in pixels
`--snap-bbox-scale`	float	`0.0`	Fraction of bbox diagonal added to snap radius
`--snap-w-dist`	float	`1.0`	Snap weight: inverse distance
`--snap-w-size`	float	`0.0`	Snap weight: object area
`--snap-w-intersect`	float	`0.0`	Snap weight: raw intersection
`--snap-switch-frames`	int	`8`	Hysteresis frames before switching snap target
`--gaze-lock`	flag	off	Enable fixation lock-on
`--dwell-frames`	int	`15`	Frames of sustained gaze before lock activates
`--lock-dist`	int	`100`	Lock detection radius in pixels
`--reid-grace-seconds`	float	`1.0`	Re-ID buffer retention time in seconds
`--reid-max-dist`	int	`200`	Max pixel distance for re-ID matching
`--hit-conf-gate`	float	`0.0`	Minimum face confidence for gaze hits
`--detect-extend`	float	`0.0`	Extend detection past visual ray (pixels)
`--forward-gaze-threshold`	float	`5.0`	Pitch/yaw threshold for forward gaze (degrees)

Under the hood

For implementation details, see developer/gaze-processing-module.md.