Gaze Plugin Tutorial¶
See also: Phenomena Plugin Tutorial | Object Detection Plugin Tutorial | Data Collection Plugin Tutorial
This tutorial covers all three gaze plugin patterns by walking through real backends that ship with MindSight:
- Part A — Per-face mode: The MGaze backend (
GazeTracking/Backends/MGaze/MGaze_Tracking.py), which crops each face and estimates pitch/yaw angles independently. - Part B — Scene-level mode: The Gazelle backend (
Plugins/GazeTracking/Gazelle/gazelle_backend.py), which processes the full frame and all faces in a single DINOv2 forward pass. - Part C — Composite / processing augmentation: The GazelleSnap backend (
Plugins/GazeTracking/GazelleSnap/gazelle_snap_backend.py), which wraps an existing per-face backend and augments it with periodic Gazelle heatmap-based ray blending.
Per-Face vs Scene-Level: When to Use Which¶
| Aspect | Per-face (mode="per_face") |
Scene-level (mode="scene") |
|---|---|---|
| Core method | estimate(face_bgr) → (pitch, yaw, conf) |
estimate_frame(frame, bboxes) → [(xy, conf)] |
| Input | Single cropped face image | Full frame + all face bounding boxes |
| Gaze format | Pitch/yaw angles (radians) | Pixel coordinates in the original frame |
| GPU passes | One per face | One for all faces |
| Ray construction | Handled by run_pitchyaw_pipeline |
Handled by the gaze coordinator's default scene handler |
| Best for | Lightweight models, CPU, ONNX inference | Heavy models, GPU batch processing, heatmap outputs |
Choose your mode based on what your model produces. If it outputs pitch/yaw angles from a face crop, use per-face. If it takes the full scene and outputs gaze target coordinates, use scene-level.
Part A: Per-Face Backend (MGaze)¶
The MGaze plugin is MindSight's default gaze backend. It supports both ONNX and PyTorch inference, wrapping the vendored gaze-estimation library. It demonstrates the per-face pattern where estimate() receives a single cropped face and returns pitch/yaw angles.
Source: GazeTracking/Backends/MGaze/MGaze_Tracking.py
A1. File Structure¶
GazeTracking/Backends/MGaze/
├── __init__.py
├── MGaze_Tracking.py # PLUGIN_CLASS = MGazePlugin
├── MGaze_Config.py # DEFAULT_ONNX_MODEL, ARCH_CHOICES, DATA_CONFIG
└── gaze-estimation/ # Vendored gaze-estimation library
├── weights/
│ └── mobileone_s0_gaze.onnx # Default shipped model
├── models/
│ ├── resnet.py
│ ├── mobilenet.py
│ └── mobileone.py
├── onnx_inference.py # GazeEstimationONNX base class
└── config.py
Note
MGaze lives under GazeTracking/Backends/ (not Plugins/GazeTracking/). The gaze registry discovers both locations — built-in backends from GazeTracking/Backends/ and external plugins from Plugins/GazeTracking/.
A2. Configuration Module¶
MGaze_Config.py centralises model paths and dataset parameters:
DEFAULT_ONNX_MODEL = str(
Path(__file__).parent / "gaze-estimation" / "weights" / "mobileone_s0_gaze.onnx"
)
ARCH_CHOICES = [
"resnet18", "resnet34", "resnet50", "mobilenetv2",
"mobileone_s0", "mobileone_s1", "mobileone_s2",
"mobileone_s3", "mobileone_s4",
]
DATA_CONFIG = {
"gaze360": {"bins": 90, "binwidth": 4, "angle": 180},
"mpiigaze": {"bins": 28, "binwidth": 3, "angle": 42},
}
The DATA_CONFIG controls bin-based regression: gaze direction is predicted as a probability distribution over bins discrete bins, each binwidth degrees wide, spanning ±angle degrees.
A3. The Estimation Engines¶
MGaze wraps two interchangeable estimation engines behind the same estimate(face_bgr) interface.
PyTorch Engine¶
class GazeEstimationTorch:
def __init__(self, weight_path, arch, dataset="gaze360", device="auto"):
cfg = DATA_CONFIG[dataset]
self._bins, self._binwidth, self._angle = cfg["bins"], cfg["binwidth"], cfg["angle"]
self.device = resolve_device(device)
self.idx_tensor = torch.arange(self._bins, dtype=torch.float32, device=self.device)
model = utils_gaze.helpers.get_model(arch, self._bins, inference_mode=True)
model.load_state_dict(torch.load(weight_path, map_location=self.device))
self.model = model.to(self.device).eval()
self._tf = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize(448),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
The estimate() method:
def estimate(self, face_bgr):
t = self._tf(cv2.cvtColor(face_bgr, cv2.COLOR_BGR2RGB)).unsqueeze(0).to(self.device)
with torch.no_grad():
pitch_logits, yaw_logits = self.model(t)
pp = F.softmax(pitch_logits, 1)
yp = F.softmax(yaw_logits, 1)
to_rad = lambda p: float(np.radians(
(torch.sum(p * self.idx_tensor) * self._binwidth - self._angle).item()))
conf = _softmax_confidence(float(pp.max()), float(yp.max()), self._bins)
return to_rad(pp), to_rad(yp), conf
Step-by-step:
- Preprocess — Convert BGR→RGB, resize to 448×448, normalize with ImageNet stats.
- Forward pass — Model outputs two sets of logits: one for pitch bins, one for yaw bins.
- Softmax — Convert logits to probability distributions.
- Expectation — Compute the weighted sum
Σ(probability × bin_index)to get the predicted bin, then convert to degrees and radians. - Confidence — The peak softmax probability indicates how "certain" the model is. The
_softmax_confidencehelper maps the average peak from[1/n_bins, 1]onto[0, 1].
ONNX Engine¶
class _GazeONNXWithConf(GazeEstimationONNX):
def estimate(self, face_bgr):
out = self.session.run(self.output_names, {"input": self.preprocess(face_bgr)})
pitch, yaw = self.decode(out[0], out[1])
pp, yp = self.softmax(out[0]), self.softmax(out[1])
conf = _softmax_confidence(float(pp.max()), float(yp.max()), self._bins)
return pitch, yaw, conf
Extends the vendored GazeEstimationONNX class with confidence scoring using the same _softmax_confidence formula. The preprocess, decode, and softmax methods are inherited from the base class.
Confidence Scoring¶
Both engines share this helper:
def _softmax_confidence(pitch_probs_max, yaw_probs_max, n_bins):
uniform = 1.0 / n_bins
return float(np.clip(
((pitch_probs_max + yaw_probs_max) / 2 - uniform) / (1 - uniform),
0, 1
))
A uniform distribution (maximum uncertainty) maps to 0.0; a perfect one-hot (maximum certainty) maps to 1.0.
A4. The Plugin Class¶
class MGazePlugin(GazePlugin):
name = "mgaze"
mode = "per_face"
is_fallback = True
def __init__(self, engine):
self._engine = engine
def estimate(self, face_bgr):
return self._engine.estimate(face_bgr)
def run_pipeline(self, **kwargs):
from GazeTracking.pitchyaw_pipeline import run_pitchyaw_pipeline
return run_pitchyaw_pipeline(gaze_eng=self, **kwargs)
Key decisions¶
is_fallback = True— MGaze is tried last, only if no other gaze plugin was activated. This makes it the automatic default without blocking user-installed plugins.- Wrapper pattern — The plugin wraps an interchangeable engine (
GazeEstimationTorchor_GazeONNXWithConf). The plugin class itself is thin — it delegatesestimate()directly. run_pipeline()delegation — Instead of letting the gaze coordinator's default handler crop faces and callestimate()individually, MGaze delegates torun_pitchyaw_pipeline. This shared pipeline handles face cropping, left-to-right sorting, temporal smoothing, ray construction, and adaptive snap for any per-face pitch/yaw backend.
The run_pitchyaw_pipeline helper¶
Any per-face plugin that outputs (pitch, yaw, confidence) can use this shared pipeline:
def run_pipeline(self, **kwargs):
from GazeTracking.pitchyaw_pipeline import run_pitchyaw_pipeline
return run_pitchyaw_pipeline(gaze_eng=self, **kwargs)
The pipeline handles:
- Face cropping — Extracts face ROIs from the full frame using RetinaFace bounding boxes.
- Eye centre extraction — Uses RetinaFace keypoints for accurate gaze origin (falls back to bbox centre).
- Left-to-right sorting — Deterministic face ordering for consistent track ID assignment.
- Temporal smoothing — Applies the
GazeSmootherReIDif one is available in context. - Ray construction — Converts pitch/yaw to 2D direction, scales by ray length and face width.
- Forward gaze dead zone — Suppresses errant rays when both angles are near zero.
- Adaptive snap — Extends or snaps ray tips to nearby objects with hysteresis.
Returns the standard 7-tuple: (persons_gaze, face_confs, face_bboxes, face_track_ids, face_objs, ray_snapped, ray_extended).
A5. CLI Activation¶
@classmethod
def add_arguments(cls, parser):
g = parser.add_argument_group("MGaze backend")
g.add_argument("--mgaze-model", default=DEFAULT_ONNX_MODEL,
help="Path to MGaze model weights (.onnx or .pt)")
g.add_argument("--mgaze-arch", default=None, choices=ARCH_CHOICES,
help="Architecture name (required for .pt models)")
g.add_argument("--mgaze-dataset", default="gaze360",
help="Dataset config key (default: gaze360)")
The from_args method auto-selects between ONNX and PyTorch based on the file extension:
@classmethod
def from_args(cls, args):
model = getattr(args, "mgaze_model", None)
if not model:
return None
model = Path(model)
if not model.exists():
raise FileNotFoundError(f"MGaze model not found: {model}")
if model.suffix.lower() == ".pt":
if not arch:
raise ValueError("--mgaze-arch is required for .pt models")
engine = GazeEstimationTorch(str(model), arch, dataset, device=device)
else:
# ONNX path: auto-select execution provider
prov = [p for p in prefs if p in ort.get_available_providers()]
engine = _GazeONNXWithConf(model_path=None,
session=ort.InferenceSession(str(model), providers=prov))
return cls(engine)
ONNX provider selection¶
The ONNX path tries providers in priority order: CoreML (Apple Silicon) → CUDA → DirectML → CPU. This gives automatic hardware acceleration without user configuration.
A6. Running MGaze¶
# Default ONNX (auto-selected, shipped with MindSight)
python MindSight.py --source video.mp4
# Explicit ONNX model
python MindSight.py --source video.mp4 --mgaze-model weights/resnet18_gaze.onnx
# PyTorch model (requires architecture specification)
python MindSight.py --source video.mp4 \
--mgaze-model weights/resnet50_gaze360.pt \
--mgaze-arch resnet50 \
--mgaze-dataset gaze360
A7. Writing Your Own Per-Face Plugin¶
To create a new per-face gaze backend as a plugin:
# Plugins/GazeTracking/MyBackend/my_backend.py
from __future__ import annotations
import sys
from pathlib import Path
_REPO_ROOT = Path(__file__).parent.parent.parent.parent
if str(_REPO_ROOT) not in sys.path:
sys.path.insert(0, str(_REPO_ROOT))
from Plugins import GazePlugin
class MyGazeBackend(GazePlugin):
name = "my_gaze"
mode = "per_face"
def __init__(self, model_path: str):
# Load your model here
self._model = self._load_model(model_path)
def estimate(self, face_bgr):
"""
Receive a cropped face image (BGR numpy array).
Return (pitch_radians, yaw_radians, confidence).
"""
# Your inference here — preprocess, forward pass, postprocess
pitch, yaw = self._model.predict(face_bgr)
confidence = 0.8 # your confidence metric
return float(pitch), float(yaw), confidence
def run_pipeline(self, **kwargs):
"""Delegate to the shared per-face pipeline."""
from GazeTracking.pitchyaw_pipeline import run_pitchyaw_pipeline
return run_pitchyaw_pipeline(gaze_eng=self, **kwargs)
@classmethod
def add_arguments(cls, parser):
g = parser.add_argument_group("My Gaze Backend")
g.add_argument("--my-gaze-model", default=None,
help="Path to model weights. Activates this backend.")
@classmethod
def from_args(cls, args):
model = getattr(args, "my_gaze_model", None)
if not model:
return None
return cls(model_path=model)
PLUGIN_CLASS = MyGazeBackend
What you get for free¶
By implementing just estimate() and delegating run_pipeline() to run_pitchyaw_pipeline, your plugin automatically inherits:
- Face cropping from RetinaFace detections
- Eye-landmark gaze origin (with bbox-centre fallback)
- Temporal smoothing via
GazeSmootherReID - Left-to-right face sorting for deterministic track IDs
- Confidence-scaled ray length (
--conf-ray) - Adaptive ray snapping with hysteresis (
--adaptive-ray) - Forward gaze dead zone (
--forward-gaze-threshold) - All CLI gaze flags work without any extra code in your plugin
If you need more control¶
Override run_pipeline() entirely to handle smoothing, ray construction, or multi-face batching yourself. Your method receives:
| Kwarg | Type | Description |
|---|---|---|
frame |
ndarray | Full BGR frame |
faces |
list[dict] | RetinaFace face detections |
objects |
list[Detection] | Non-person detections |
gaze_cfg |
GazeConfig | Ray and snap parameters |
smoother |
GazeSmootherReID | None | Temporal smoothing tracker |
snap_hysteresis |
SnapHysteresisTracker | None | Snap hysteresis tracker |
aux_frames |
dict | Auxiliary per-participant video streams |
Must return the 7-tuple: (persons_gaze, face_confs, face_bboxes, face_track_ids, face_objs, ray_snapped, ray_extended).
Part B: Scene-Level Backend (Gazelle)¶
Gazelle is a scene-level gaze estimator built on DINOv2. It processes the entire scene image together with face bounding boxes in a single forward pass, producing per-face gaze-point heatmaps.
Source: Plugins/GazeTracking/Gazelle/gazelle_backend.py
B1. File Structure¶
Plugins/GazeTracking/Gazelle/
├── __init__.py
├── gazelle_backend.py # PLUGIN_CLASS = GazeEstimationGazelle
└── gazelle/ # Vendored Gazelle library
└── gazelle/
├── model.py # get_gazelle_model(), load_gazelle_state_dict()
├── backbone.py # DINOv2 backbone
├── dataloader.py
└── utils.py
B2. Class Definition¶
mode = "scene" tells the gaze coordinator to call estimate_frame() (full frame + all bounding boxes) rather than estimate() (single cropped face).
Model variants¶
_VALID_MODELS = {
"gazelle_dinov2_vitb14",
"gazelle_dinov2_vitl14",
"gazelle_dinov2_vitb14_inout",
"gazelle_dinov2_vitl14_inout",
}
| Variant | Backbone | In/Out scoring |
|---|---|---|
gazelle_dinov2_vitb14 |
ViT-B/14 | No |
gazelle_dinov2_vitb14_inout |
ViT-B/14 | Yes |
gazelle_dinov2_vitl14 |
ViT-L/14 | No |
gazelle_dinov2_vitl14_inout |
ViT-L/14 | Yes |
The _inout variants add a head predicting whether each person is looking inside or outside the frame. When in-frame confidence falls below the threshold, gaze confidence is attenuated.
Constructor¶
def __init__(self, model_name, ckpt_path, inout_threshold=0.5,
skip_frames=0, use_fp16=False, use_compile=False, device="auto"):
| Parameter | Purpose |
|---|---|
model_name |
Which variant to load |
ckpt_path |
Path to .pt checkpoint |
inout_threshold |
Confidence cutoff for _inout models (default 0.5) |
skip_frames |
Reuse cached results for N frames between inference |
use_fp16 |
Half-precision on CUDA/MPS |
use_compile |
torch.compile() wrapper (PyTorch 2.0+) |
device |
"auto", "cpu", "cuda", or "mps" |
B3. The estimate_frame() Method¶
The core method for scene-level backends:
Data flow¶
- Early return if no faces.
- Frame-skip check — reuse cached result if skip is active and face count unchanged.
- BGR→RGB —
frame_bgr[:, :, ::-1]zero-copy view, then PIL wrap. - Normalize bboxes —
(x1/w, y1/h, x2/w, y2/h)for Gazelle's[0,1]range. - Transform — Resize 448×448, ToTensor, ImageNet normalize, unsqueeze, to device.
- Forward pass —
model({"images": tensor, "bboxes": [norm]})withtorch.no_grad(). - Heatmap extraction —
out["heatmap"][0]gives[N, 64, 64]per-face heatmaps. - Batched peak extraction:
hm_flat = heatmaps.flatten(start_dim=1) # [N, 4096]
maxvals, argmaxes = hm_flat.max(dim=1) # [N], [N]
All N heatmaps are processed in one batched operation, with a single .cpu().numpy() call.
- Pixel coordinate conversion:
- Inout attenuation — For
_inoutmodels, if confidence < threshold:conf *= score. - Cache and return — Store results for frame-skip reuse.
B4. The raw_heatmaps() Method¶
Returns the full [N, 64, 64] sigmoid-activated heatmaps. Useful for visualization or analysis beyond the peak point.
B5. CLI Activation¶
| Flag | Type | Default | Description |
|---|---|---|---|
--gazelle-model PATH |
str | None | Checkpoint path. Activates the backend. |
--gazelle-name |
choice | gazelle_dinov2_vitb14 |
Model variant |
--gazelle-inout-threshold |
float | 0.5 | In/out confidence threshold |
--gazelle-device |
str | auto |
Device override |
--gazelle-skip-frames |
int | 0 | Frames between inference |
--gazelle-fp16 |
flag | False | Half-precision |
--gazelle-compile |
flag | False | torch.compile() |
B6. Running Gazelle¶
# Standard usage
python MindSight.py --source video.mp4 \
--gazelle-model checkpoints/gazelle_dinov2_vitb14_inout.pt \
--gazelle-name gazelle_dinov2_vitb14_inout --gazelle-fp16
# With frame-skipping for slower hardware
python MindSight.py --source video.mp4 \
--gazelle-model checkpoints/gazelle.pt --gazelle-skip-frames 2
Key Design Patterns (Both Modes)¶
Backend selection¶
The gaze coordinator tries plugins in registration order. The first from_args that returns a non-None instance wins. Plugins with is_fallback = True (like MGaze) are tried last, making them the automatic default.
Lazy loading¶
All expensive operations (model loading, weight transfer to GPU) happen inside from_args(), not at import time. If the activation flag is not set, no resources are consumed.
PLUGIN_CLASS sentinel¶
Both plugins expose PLUGIN_CLASS = ClassName at module level. This is what PluginRegistry.discover() looks for:
# At the bottom of your module:
PLUGIN_CLASS = MGazePlugin # per-face
PLUGIN_CLASS = GazeEstimationGazelle # scene-level
sys.path setup¶
Both plugins prepend the repo root to sys.path so that from Plugins import GazePlugin resolves from the subdirectory:
_REPO_ROOT = Path(__file__).parent.parent.parent.parent
if str(_REPO_ROOT) not in sys.path:
sys.path.insert(0, str(_REPO_ROOT))
The depth (number of .parent calls) depends on how deep the plugin file is relative to the repo root.
Part C: Composite Backend — Adding Gaze Processing Features (GazelleSnap)¶
Not every gaze plugin needs to replace the entire estimation backend. GazelleSnap demonstrates a composite pattern: it wraps an existing per-face pitch/yaw backend and augments it with periodic Gazelle heatmap inference. The pitch/yaw backend provides fast per-frame gaze rays; every N frames a Gazelle forward pass produces a heatmap whose centroid biases the ray endpoint before standard adaptive-snap logic is applied.
This pattern is useful when you want to add a processing feature (heatmap blending, multi-model fusion, custom filtering) on top of any existing backend without rewriting it.
Source: Plugins/GazeTracking/GazelleSnap/gazelle_snap_backend.py
C1. File Structure¶
Plugins/GazeTracking/GazelleSnap/
├── __init__.py
└── gazelle_snap_backend.py # PLUGIN_CLASS = GazelleSnapPlugin
No vendored libraries — GazelleSnap imports the existing Gazelle plugin at runtime to create its heatmap engine.
C2. The Composite Concept¶
flowchart LR
A[Per-face backend<br/>MGaze / L2CS / UniGaze] -->|pitch, yaw, conf<br/>every frame| C[GazelleSnap<br/>run_pipeline]
B[Gazelle heatmap engine] -->|64×64 heatmap<br/>every N frames| C
C -->|blended ray endpoint| D[Standard adaptive snap]
D --> E[Final gaze output]
GazelleSnap holds two inner engines:
_py_engine— Any per-face pitch/yaw backend (MGaze, L2CS, UniGaze). Called every frame for fast gaze rays._gz_engine— A Gazelle instance used only for itsraw_heatmaps()method. Called everysnap_intervalframes to produce heatmap-based snap targets.
The heatmap centroid is blended into the pitch/yaw ray endpoint with a confidence-weighted factor that decays over time as the heatmap ages.
C3. Class Definition¶
Despite wrapping a scene-level model, GazelleSnap declares mode = "per_face" because its primary estimation path is per-face pitch/yaw. is_fallback = False ensures it takes priority over the default MGaze backend when activated.
Constructor¶
def __init__(self, pitchyaw_engine, gazelle_engine, *,
snap_interval=30,
heatmap_threshold=0.5,
heatmap_weight=1.0,
heatmap_decay=0.85,
obj_snap="all"):
self._py_engine = pitchyaw_engine
self._gz_engine = gazelle_engine
self._snap_interval = max(1, snap_interval)
self._hm_threshold = heatmap_threshold
self._hm_weight = heatmap_weight
self._hm_decay = heatmap_decay
self._obj_snap = obj_snap
# Per-track cached heatmaps and ages
self._cached_heatmaps: dict[int, np.ndarray] = {}
self._heatmap_ages: dict[int, int] = {}
self._frame_counter = 0
| Parameter | Default | Purpose |
|---|---|---|
pitchyaw_engine |
— | The inner per-face backend instance (created by from_args) |
gazelle_engine |
— | The inner Gazelle instance (used for raw_heatmaps()) |
snap_interval |
30 | Run Gazelle inference every N frames |
heatmap_threshold |
0.5 | Fraction of heatmap peak for centroid thresholding |
heatmap_weight |
1.0 | Blend weight toward heatmap target (0=ignore, 1=full) |
heatmap_decay |
0.85 | Per-frame multiplicative decay for stale heatmap confidence |
obj_snap |
"all" |
Object snap targets: "all", "faces_only", or "off" |
Per-track state¶
_cached_heatmaps— Maps track ID → most recent 64×64 heatmap numpy array._heatmap_ages— Maps track ID → frames since last heatmap refresh. Used for confidence decay._frame_counter— Global frame counter for snap interval timing.
C4. The run_pipeline() Method¶
GazelleSnap overrides run_pipeline() entirely rather than using run_pitchyaw_pipeline, because it needs to inject heatmap blending between ray construction and adaptive snap. The pipeline has three phases:
Phase 1: Per-face pitch/yaw estimation (every frame)¶
for f in faces:
x1, y1 = max(0, int(f["bbox"][0])), max(0, int(f["bbox"][1]))
x2, y2 = min(w, int(f["bbox"][2])), min(h, int(f["bbox"][3]))
crop = frame[y1:y2, x1:x2]
pitch, yaw, gc = self._py_engine.estimate(crop)
# ... eye centre extraction, left-to-right sorting, temporal smoothing
This is identical to the standard run_pitchyaw_pipeline flow — crop faces, call the inner engine's estimate(), extract eye centres, sort, smooth.
Phase 2: Conditional Gazelle heatmap inference (every N frames)¶
run_gazelle = (
raw_face_bboxes
and self._frame_counter % self._snap_interval == 0
)
if run_gazelle:
heatmaps = self._get_raw_heatmaps(frame, raw_face_bboxes)
for fi, tid in enumerate(face_track_ids):
if fi < heatmaps.shape[0]:
self._cached_heatmaps[tid] = heatmaps[fi]
self._heatmap_ages[tid] = 0
Every snap_interval frames, Gazelle runs a single forward pass and caches the per-face heatmaps keyed by track ID. Between inference runs, heatmap ages increment and stale tracks are pruned:
# Age all heatmaps each frame
for tid in self._heatmap_ages:
self._heatmap_ages[tid] += 1
# Prune tracks no longer present
active_tids = set(face_track_ids)
for tid in list(self._cached_heatmaps):
if tid not in active_tids:
del self._cached_heatmaps[tid]
Phase 3: Ray construction with heatmap blending¶
This is where the composite magic happens. After constructing the standard pitch/yaw ray endpoint (fb), the plugin checks for a cached heatmap:
hm = self._cached_heatmaps.get(tid)
if hm is not None:
ht, hm_conf = _heatmap_to_target(hm, h, w, self._hm_threshold)
if ht is not None:
age = self._heatmap_ages.get(tid, 0)
hm_conf *= self._hm_decay ** age # exponential decay
blend = min(1.0, self._hm_weight * hm_conf * 2.0)
# Only snap if heatmap target is in front of the face
if np.dot(ht - c, d) > 0 and blend > 0.05:
fb = fb * (1.0 - blend) + ht * blend # linear interpolation
Key details:
_heatmap_to_target()converts the 64×64 heatmap to a pixel-space centroid by thresholding atpeak × threshold_fracand computing a weighted average of surviving pixels.- Exponential decay — Each frame without a fresh heatmap multiplies the confidence by
heatmap_decay(default 0.85). After 30 frames at 0.85 decay, confidence drops to ~0.4% of the original. - Directional check —
np.dot(ht - c, d) > 0ensures the heatmap target is in the same direction the person is looking (not behind them). - Linear blend — The ray endpoint is interpolated between the pure pitch/yaw position and the heatmap centroid, weighted by decaying confidence.
After heatmap blending, the standard adaptive-snap logic runs on the blended endpoint, with the obj_snap parameter controlling which targets are considered.
C5. The Heatmap Utility¶
def _heatmap_to_target(hm, h, w, threshold_frac):
peak = hm.max()
if peak < 0.05:
return None, 0.0
mask = hm >= (peak * threshold_frac)
ys, xs = np.where(mask)
weights = hm[mask]
cx = np.average(xs, weights=weights) / 64.0 * w
cy = np.average(ys, weights=weights) / 64.0 * h
confidence = float(weights.mean())
return np.array([cx, cy], dtype=float), confidence
This converts a 64×64 heatmap to a single pixel-space point by:
- Rejecting heatmaps with negligible peaks (< 0.05)
- Thresholding to keep only pixels above
peak × threshold_frac - Computing the weighted centroid of surviving pixels
- Scaling from heatmap coordinates (0–63) to pixel coordinates
Using a weighted centroid rather than a raw argmax gives a more stable target that accounts for the spatial distribution of gaze probability.
C6. CLI Activation and Engine Composition¶
GazelleSnap's from_args is the most complex of any gaze plugin because it must instantiate two inner engines:
@classmethod
def from_args(cls, args):
if not getattr(args, "gazelle_snap", False):
return None
# 1. Create the Gazelle heatmap engine
from Plugins.GazeTracking.Gazelle.gazelle_backend import GazeEstimationGazelle
gazelle_engine = GazeEstimationGazelle(gz_name, gazelle_ckpt, ...)
# 2. Create the inner pitch/yaw engine via the gaze factory
from GazeTracking.gaze_factory import create_gaze_engine
args.gazelle_snap = False # prevent recursion
try:
pitchyaw_engine = create_gaze_engine(plugin_args=args)
finally:
args.gazelle_snap = True # restore flag
# 3. Validate the inner engine is per-face
if getattr(pitchyaw_engine, "mode", None) != "per_face":
raise ValueError("--gazelle-snap requires a per-face pitch/yaw backend")
return cls(pitchyaw_engine, gazelle_engine, ...)
The recursion guard¶
The key trick is temporarily clearing args.gazelle_snap = False before calling create_gaze_engine(). This prevents the factory from selecting GazelleSnap again (infinite recursion). The factory then falls through to the next available backend (MGaze, L2CS, etc.) and returns it as the inner engine. The flag is restored in a finally block.
CLI flags¶
| Flag | Type | Default | Description |
|---|---|---|---|
--gazelle-snap |
flag | False | Activates the composite backend |
--gs-gazelle-model PATH |
str | None | Gazelle checkpoint for heatmap inference |
--gs-gazelle-name |
choice | gazelle_dinov2_vitb14 |
Gazelle model variant |
--gs-snap-interval N |
int | 30 | Frames between Gazelle inference |
--gs-heatmap-threshold F |
float | 0.5 | Heatmap peak threshold fraction |
--gs-heatmap-weight F |
float | 1.0 | Blend weight (0=ignore, 1=full) |
--gs-heatmap-decay F |
float | 0.85 | Per-frame confidence decay |
--gs-obj-snap |
choice | all |
Object snap targets: all, faces_only, off |
C7. Running GazelleSnap¶
# With default MGaze inner backend
python MindSight.py --source video.mp4 \
--gazelle-snap \
--gs-gazelle-model checkpoints/gazelle.pt \
--gs-snap-interval 20
# With L2CS as the inner backend
python MindSight.py --source video.mp4 \
--gazelle-snap \
--gs-gazelle-model checkpoints/gazelle.pt \
--l2cs-model weights/l2cs.pkl \
--gs-heatmap-weight 0.7
# Disable object snapping (heatmap only)
python MindSight.py --source video.mp4 \
--gazelle-snap \
--gs-gazelle-model checkpoints/gazelle.pt \
--gs-obj-snap off
Expected startup output:
C8. Key Design Patterns for Composite Plugins¶
Wrapping, not replacing¶
GazelleSnap does not implement its own gaze estimation. It wraps an existing backend (_py_engine) and adds a processing layer. The estimate() method simply delegates:
This means GazelleSnap automatically benefits from any improvements to the inner backend without code changes.
Owning run_pipeline() for injection¶
The reason GazelleSnap can't use run_pitchyaw_pipeline is that it needs to inject heatmap blending between ray construction and adaptive snap. By owning the full pipeline, it controls exactly where in the processing chain the blending occurs.
If your composite plugin only needs to modify the final output (e.g. filtering or scaling ray endpoints), you could instead use run_pitchyaw_pipeline and post-process its return value.
Per-track state keyed by track ID¶
Heatmaps are cached per track ID (not per face index), so they survive face reordering and brief occlusions. When a track disappears from the scene, its cached heatmap is pruned to prevent memory growth.
Exponential decay for temporal blending¶
The heatmap_decay ** age formula smoothly transitions from "high confidence" (fresh heatmap) to "no confidence" (stale heatmap) without a hard cutoff. At the default decay of 0.85:
| Age (frames) | Confidence retained |
|---|---|
| 0 | 100% |
| 5 | 44% |
| 10 | 20% |
| 20 | 4% |
| 30 | 0.4% |
This means a heatmap remains influential for roughly 10–15 frames before the blend fades to near-zero.
Factory recursion guard¶
The args.gazelle_snap = False / try / finally pattern is essential for composite plugins that create inner engines via the gaze factory. Without it, the factory would select GazelleSnap again, causing infinite recursion.
Writing your own composite plugin¶
To create a composite gaze plugin that adds a processing feature:
- Subclass
GazePluginwithmode = "per_face"andis_fallback = False - Accept an inner engine in
__init__ - Delegate
estimate()to the inner engine for compatibility - Override
run_pipeline()to add your custom processing step - In
from_args(), use the factory recursion guard to instantiate the inner engine - Validate the inner engine's mode if your processing requires it