// msc thesis · computer vision

VisionSleuth AI

Intelligent weapon detection & threat analysis platform for real-time video surveillance. Fine-tuned YOLO11 model with custom Threat Association Engine that evaluates spatial proximity and weapon-person associations.

Status: MSc Thesis Project (Defense Pending) | Lead: Fatma Duran

Thesis Supervisor: Prof. Dr. Kaan Yılancıoğlu
Head of Biosecurity Master's Program, Üsküdar University
kaan.yilancioglu@uskudar.edu.tr

try demo →research team →

System Architecture

Core Stack

→ Detection: YOLO11n/s (fine-tuned)
→ Tracking: ByteTrack + EMA smoothing
→ Threat Analysis: Proximity + Overlap scoring
→ Backend: FastAPI 0.115 + Python 3.11
→ Frontend: Next.js 14 + Canvas API
→ Database: PostgreSQL + SQLAlchemy
→ Deployment: Render + Vercel

Model Versions

V3 (YOLO11n)	2.6M params, 5.4 MB
Live analysis (CPU)	~94 ms/frame
V4 (YOLO11s)	9.4M params, 57 MB
Video upload (GPU)	~226 ms/frame

Machine Learning Pipeline

Two fine-tuned YOLO11 variants are maintained. V3 (YOLO11n) is deployed for live analysis due to low memory footprint; V4 (YOLO11s) provides higher accuracy (mAP@0.5 = 0.748) for offline video processing.

Custom Dataset: VisionSleuth Dataset

→ 28,409 images across 11 threat detection classes
→ Classes: Handgun, Rifle, Knife, Blunt Weapon, Scissors, Toothbrush, Smartphone, Person + 3 others
→ Hard-negative classes (scissors, toothbrush, smartphone) included to suppress false positives
→ Test set: 770 images with full evaluation across IoU thresholds 0.50-0.95

Validation Metrics (V4 - YOLO11s)

→ mAP@0.5: 0.748 (mean average precision over all 11 classes)
→ mAP@0.5:0.95: ~0.52 (stricter localization standard)
→ Precision: 0.805 (80.5% of detections are correct)
→ Recall: 0.686 (68.6% of all true objects detected)

Per-Class Performance

✓ Handgun: AP@0.5 = 0.908 — Excellent (4,777 training samples)

✓ Knife: AP@0.5 = 0.737 — Good (4,574 training samples)

✓ Blunt Weapon: AP@0.5 = 0.688 — Acceptable (2,960 training samples)

⚠ Rifle: AP@0.5 = 0.287 — CRITICAL (only 234 training samples) — Future work: augment data to 1,000+ samples

Threat Association Engine (v5)

Original academic contribution: Instead of simple binary IoU, computes continuous threat_score combining spatial proximity + overlap to determine weapon-person association.

threat_score = weapon_weight × (α × proximity_score + β × overlap_score) × weapon_confidence
α (proximity weight) = 0.40
β (overlap weight) = 0.60 ← weapon-in-hand is most certain signal
proximity_score = 1.0 when weapon at person's center; falls to 0 at 30% frame diagonal
overlap_score = 1.0 when weapon inside person bbox; 0.0 when no overlap

Alert Levels

CRITICAL	score ≥ 0.65	Red pulsing banner + audio alert
WARNING	score ≥ 0.35	Orange indicator
UNCONFIRMED	score > 0.05	Suppressed from UI (noise filtered)

Weapon Severity Weights

🔴 Rifle: 1.00 (highest lethality)
🔴 Handgun/Gun: 0.95 (high lethality, concealable)
🟠 Machete/Axe: 0.85–0.90 (high-lethality bladed)
🟠 Knife/Blade: 0.80 (common street weapon)
🟡 Blunt Weapon/Baseball Bat: 0.65 (severe but lower lethality)
🟡 Scissors: 0.50 (lowest threat; hard-negative class)

Temporal Tracking & Confirmation

ByteTrack + EMA smoothing maintains persistent object identity across frames. Exponential Moving Average (EMA) with α=0.40 provides fast response while dampening single-frame spikes.

Live Analysis Mode

CONFIRM_FRAMES: 1 (instant display)
FORGET_FRAMES: 5 (evict track after 5 missed)
EMA α: 0.40 (fast response)
Use case: Webcam real-time analysis

Video Upload Mode

CONFIRM_FRAMES: 2 (multi-frame confirmation)
FORGET_FRAMES: 5 (evict track after 5 missed)
EMA α: 0.40 (same smoothing)
Use case: Video file processing (more strict)

Frontend & Deployment

Detection Overlay

🔵 Blue: Person detected (no weapon)
🔴 Red (6px stroke): CRITICAL — armed suspect with outer glow
🟠 Orange: WARNING — potential threat
🟡 Yellow: Non-threat objects
🔊 Audio: Three-burst 880 Hz tone (Web Audio API)
⏱️ Debounce: 4-second suppression to prevent alert fatigue

Deployment Stack

→ Backend: Render (Web Service, uvicorn, 1 worker)
→ ML Model: best_v3.pt preloaded into RAM at startup
→ Frontend: Vercel (Next.js 14 static export)
→ Database: Render PostgreSQL (optional; memory-only fallback)
→ Monitoring: Sentry (error tracking), Prometheus (metrics)
→ CI/CD: GitHub Actions (lint, audit, test, secrets scan)

Known Limitations & Future Work

⚠️ Rifle Detection (AP=0.287): Severely imbalanced training data (234 samples). Future: augment via Roboflow to 1,000+ samples + re-fine-tune.
⚠️ Camera Perspective: Model trained on overhead CCTV footage. Frontal webcam introduces domain shift. Mitigation: lower confidence threshold (0.15).
⚠️ CPU-only Inference: 94 ms/frame on Render free tier limits live analysis to ~10 FPS. GPU deployment would achieve <20 ms/frame.
⚠️ Database Persistence: Without DATABASE_URL, job history lost on restart. Recommend Neon serverless PostgreSQL for production.
💡 Two-Stage Architecture: Person detector → weapon classifier within person bbox would reduce false positives from background objects.

Explore real-time threat detection and analysis

open visionsleuth →meet the team →