Object Detection¶

Overview¶

MindSight uses YOLO for real-time object detection in video frames. Two modes are available:

Standard YOLO -- text-class prompting using COCO class names.
YOLOE -- visual prompting with reference images and annotated bounding boxes, enabling zero-shot detection of arbitrary objects.

Detected objects are separated into two categories:

Category	Purpose
Persons	Used as input for face detection and gaze estimation
Objects	Treated as gaze targets for intersection testing

This separation drives the downstream gaze pipeline: persons produce gaze rays, objects receive them.

YOLO Model Selection¶

Select a model with the --model flag:

--model yolov8n.pt

Available models, from fastest to most accurate:

Model	Size	Speed	Accuracy
`yolov8n.pt`	Nano (default)	Fastest	Lowest
`yolov8s.pt`	Small	Fast	Low
`yolov8m.pt`	Medium	Moderate	Moderate
`yolov8l.pt`	Large	Slow	High
`yolov8x.pt`	Extra-large	Slowest	Highest

Weights are auto-downloaded on first use and cached locally.

Filtering Classes¶

Restrict detection to specific COCO class names with --classes:

--classes person knife cup

Exclude specific classes with --blacklist:

--blacklist chair couch

Both flags accept one or more space-separated COCO class names. When --classes is set, only those classes are detected. When --blacklist is set, those classes are excluded from results. If both are provided, --classes is applied first, then --blacklist removes from that set.

Confidence Threshold¶

--conf 0.35

The confidence threshold (default 0.35) controls the minimum score a detection must reach to be kept. Higher values reduce false positives but may miss weaker detections. Lower values produce more detections at the cost of more noise.

Detection Scale¶

--detect-scale 1.0

Values less than 1.0 downscale the frame before running detection, then rescale the resulting coordinates back to the original resolution. This trades detection accuracy for speed -- useful on high-resolution video or slower hardware.

Value	Effect
`1.0`	Full resolution (default)
`0.5`	Half resolution, roughly 4x faster
`0.25`	Quarter resolution, roughly 16x faster

Object Persistence Cache¶

--obj-persistence N

When an object disappears from detection (due to momentary occlusion or a YOLO miss), the persistence cache keeps its last-known bounding box alive for N additional frames. This prevents downstream gaze hits from flickering.

Default: 0 (disabled).
Ghost detections are rendered slightly transparent to distinguish them from live detections.
The cached bounding box is static (not interpolated), so large values may produce stale positions.

Visual Prompt Mode (YOLOE)¶

Instead of detecting objects by COCO class name, visual prompt mode lets you provide reference images with annotated bounding boxes. This enables zero-shot detection of custom objects that are not in the COCO class set.

--vp-file prompt.vp.json --vp-model yoloe-26l-seg.pt

The .vp.json file describes the reference images and their annotated regions. YOLOE uses these visual examples to locate matching objects in video frames.

For a full walkthrough on creating and using visual prompts, see the Visual Prompts guide.

Parameter Reference¶

Flag	Type	Default	Description
`--model`	str	`yolov8n.pt`	YOLO model weights
`--conf`	float	`0.35`	Detection confidence threshold
`--classes`	str[]	`[]`	Filter to specific COCO class names
`--blacklist`	str[]	`[]`	Exclude specific COCO class names
`--skip-frames`	int	`1`	Run detection every N frames
`--detect-scale`	float	`1.0`	Scale factor for detection pass
`--vp-file`	str	None	Path to `.vp.json` visual prompt file
`--vp-model`	str	`yoloe-26l-seg.pt`	YOLOE model for VP mode
`--obj-persistence`	int	`0`	Frames to keep ghost detections alive

Under the hood

For implementation details, see developer/object-detection-module.md.