Scene Reader
Real-Time Visual Accessibility Through Vision-Language Models
GPT-4V · YOLOv8 · BLIP-2 · RAG · Chain-of-Thought
DS-5690 Generative AI | Vanderbilt University | Fall 2025
Systematically evaluated 9 different transformer-based vision AI approaches for providing real-time visual assistance to the 250 million blind and visually impaired individuals worldwide. Identified optimal architectures achieving sub-2-second latency for practical accessibility applications.
Key Achievement
0.54s
Mean latency (fastest approach)
3x
Speedup vs pure VLM approaches
$0.005
Cost per query (scalable)
The Accessibility Challenge
Over 7 million blind and visually impaired individuals in the United States face a fundamental barrier: they cannot access visual information that sighted individuals take for granted. Current accessibility tools are either too slow, too expensive, or lack the contextual understanding necessary for real-time assistance.
Research Question: Which vision AI architecture achieves the optimal balance of latency, accuracy, and cost for real-time accessibility applications?
9 Approaches Tested
Baseline Approaches
- Approach 1: Pure VLM (GPT-4V, Claude, Gemini)
- Approach 2: Hybrid (YOLO detection + LLM generation)
- Approach 3: Specialized (YOLO + OCR + Depth estimation)
Optimized Variants
- Approach 2.5: Hybrid + caching + GPT-3.5 (fastest: 0.54s)
- Approach 3.5: Specialized + complexity routing (0.93s)
- Approach 1.5: Progressive disclosure (BLIP-2 + GPT-4V)
Alternative Architectures
- Approach 4: Local BLIP-2 (zero API cost, offline)
- Approach 6: RAG with game knowledge base
- Approach 7: Chain-of-Thought for safety detection
Results: Top 3 Approaches
Three approaches achieved sub-2-second latency (our threshold for practical usability):
1st: Approach 2.5 (Hybrid Optimized)
0.54sYOLO + GPT-3.5-turbo + LRU Cache
2nd: Approach 3.5 (Specialized)
0.93sMulti-model routing with OCR + depth estimation
3rd: Approach 1.5 (Progressive)
1.62sBLIP-2 quick preview + GPT-4V detailed (best quality)
Key Insights
- Architecture matters more than model size: GPT-3.5 + YOLO beats GPT-4V alone
- Caching is transformative: 15x speedup on cache hits (40-60% hit rate)
- Hybrid pipelines outperform pure VLMs: 3x speedup with maintained quality
- All differences statistically significant: p < 0.001 across comparisons
Scale of Testing
9
Approaches
564+
API Calls
42
Test Images
4
Scenarios
Scenarios: Gaming (12), Indoor Navigation (10), Outdoor Navigation (10), Text Reading (10)
Technologies Used
Team
Roshan Sivakumar & Dhesel Khando
Instructor: Prof. Jesse Spencer-Smith