Scene Reader

Real-Time Visual Accessibility Through Vision-Language Models

GPT-4V · YOLOv8 · BLIP-2 · RAG · Chain-of-Thought

DS-5690 Generative AI | Vanderbilt University | Fall 2025

Systematically evaluated 9 different transformer-based vision AI approaches for providing real-time visual assistance to the 250 million blind and visually impaired individuals worldwide. Identified optimal architectures achieving sub-2-second latency for practical accessibility applications.

Key Achievement

0.54s

Mean latency (fastest approach)

3x

Speedup vs pure VLM approaches

$0.005

Cost per query (scalable)

The Accessibility Challenge

Over 7 million blind and visually impaired individuals in the United States face a fundamental barrier: they cannot access visual information that sighted individuals take for granted. Current accessibility tools are either too slow, too expensive, or lack the contextual understanding necessary for real-time assistance.

Research Question: Which vision AI architecture achieves the optimal balance of latency, accuracy, and cost for real-time accessibility applications?

9 Approaches Tested

Baseline Approaches

  • Approach 1: Pure VLM (GPT-4V, Claude, Gemini)
  • Approach 2: Hybrid (YOLO detection + LLM generation)
  • Approach 3: Specialized (YOLO + OCR + Depth estimation)

Optimized Variants

  • Approach 2.5: Hybrid + caching + GPT-3.5 (fastest: 0.54s)
  • Approach 3.5: Specialized + complexity routing (0.93s)
  • Approach 1.5: Progressive disclosure (BLIP-2 + GPT-4V)

Alternative Architectures

  • Approach 4: Local BLIP-2 (zero API cost, offline)
  • Approach 6: RAG with game knowledge base
  • Approach 7: Chain-of-Thought for safety detection

Results: Top 3 Approaches

Three approaches achieved sub-2-second latency (our threshold for practical usability):

1st: Approach 2.5 (Hybrid Optimized)

0.54s

YOLO + GPT-3.5-turbo + LRU Cache

2nd: Approach 3.5 (Specialized)

0.93s

Multi-model routing with OCR + depth estimation

3rd: Approach 1.5 (Progressive)

1.62s

BLIP-2 quick preview + GPT-4V detailed (best quality)

Key Insights

  • Architecture matters more than model size: GPT-3.5 + YOLO beats GPT-4V alone
  • Caching is transformative: 15x speedup on cache hits (40-60% hit rate)
  • Hybrid pipelines outperform pure VLMs: 3x speedup with maintained quality
  • All differences statistically significant: p < 0.001 across comparisons

Scale of Testing

9

Approaches

564+

API Calls

42

Test Images

4

Scenarios

Scenarios: Gaming (12), Indoor Navigation (10), Outdoor Navigation (10), Text Reading (10)

Technologies Used

GPT-4VGPT-3.5-turboClaudeGeminiYOLOv8BLIP-2EasyOCRMiDaSPythonPyTorchRAG

Team

Roshan Sivakumar & Dhesel Khando

Instructor: Prof. Jesse Spencer-Smith