Object Detection¶
Overview¶
MindSight uses YOLO for real-time object detection in video frames. Two modes are available:
- Standard YOLO -- text-class prompting using COCO class names.
- YOLOE -- visual prompting with reference images and annotated bounding boxes, enabling zero-shot detection of arbitrary objects.
Detected objects are separated into two categories:
| Category | Purpose |
|---|---|
| Persons | Used as input for face detection and gaze estimation |
| Objects | Treated as gaze targets for intersection testing |
This separation drives the downstream gaze pipeline: persons produce gaze rays, objects receive them.
YOLO Model Selection¶
Select a model with the --model flag:
Available models, from fastest to most accurate:
| Model | Size | Speed | Accuracy |
|---|---|---|---|
yolov8n.pt |
Nano (default) | Fastest | Lowest |
yolov8s.pt |
Small | Fast | Low |
yolov8m.pt |
Medium | Moderate | Moderate |
yolov8l.pt |
Large | Slow | High |
yolov8x.pt |
Extra-large | Slowest | Highest |
Weights are auto-downloaded on first use and cached locally.
Filtering Classes¶
Restrict detection to specific COCO class names with --classes:
Exclude specific classes with --blacklist:
Both flags accept one or more space-separated COCO class names. When --classes is set, only those classes are detected. When --blacklist is set, those classes are excluded from results. If both are provided, --classes is applied first, then --blacklist removes from that set.
Confidence Threshold¶
The confidence threshold (default 0.35) controls the minimum score a detection must reach to be kept. Higher values reduce false positives but may miss weaker detections. Lower values produce more detections at the cost of more noise.
Detection Scale¶
Values less than 1.0 downscale the frame before running detection, then rescale the resulting coordinates back to the original resolution. This trades detection accuracy for speed -- useful on high-resolution video or slower hardware.
| Value | Effect |
|---|---|
1.0 |
Full resolution (default) |
0.5 |
Half resolution, roughly 4x faster |
0.25 |
Quarter resolution, roughly 16x faster |
Object Persistence Cache¶
When an object disappears from detection (due to momentary occlusion or a YOLO miss), the persistence cache keeps its last-known bounding box alive for N additional frames. This prevents downstream gaze hits from flickering.
- Default:
0(disabled). - Ghost detections are rendered slightly transparent to distinguish them from live detections.
- The cached bounding box is static (not interpolated), so large values may produce stale positions.
Visual Prompt Mode (YOLOE)¶
Instead of detecting objects by COCO class name, visual prompt mode lets you provide reference images with annotated bounding boxes. This enables zero-shot detection of custom objects that are not in the COCO class set.
The .vp.json file describes the reference images and their annotated regions. YOLOE uses these visual examples to locate matching objects in video frames.
For a full walkthrough on creating and using visual prompts, see the Visual Prompts guide.
Parameter Reference¶
| Flag | Type | Default | Description |
|---|---|---|---|
--model |
str | yolov8n.pt |
YOLO model weights |
--conf |
float | 0.35 |
Detection confidence threshold |
--classes |
str[] | [] |
Filter to specific COCO class names |
--blacklist |
str[] | [] |
Exclude specific COCO class names |
--skip-frames |
int | 1 |
Run detection every N frames |
--detect-scale |
float | 1.0 |
Scale factor for detection pass |
--vp-file |
str | None | Path to .vp.json visual prompt file |
--vp-model |
str | yoloe-26l-seg.pt |
YOLOE model for VP mode |
--obj-persistence |
int | 0 |
Frames to keep ghost detections alive |
Under the hood
For implementation details, see developer/object-detection-module.md.