Skip to content

👁️ Give Your AI Super Vision!

Your AI can chat, create images, understand speech, analyze files, and talk back. Now let’s give it eyes! 👀

Imagine users uploading a business chart and your AI saying: “This shows a 23% increase in Q3 sales, with the highest growth in the mobile segment.” Or analyzing a screenshot and providing detailed UI/UX feedback!

What we’re building: Your AI will become a visual expert that can analyze photos, documents, charts, screenshots - anything visual - with professional-level insights!


Current state: Your AI can process text, but images are a mystery Target state: Your AI sees and understands any visual content!

🔄 The Visual Intelligence Transformation

Section titled “🔄 The Visual Intelligence Transformation”

Before (Blind AI):

User: [Uploads business chart] "What does this show?"
AI: "I can't see images, please describe it" 😕

After (AI with Vision):

User: [Uploads business chart] "What does this show?"
AI: "This bar chart shows quarterly revenue growth, with Q3 showing a 34% increase over Q2. The mobile division is your strongest performer with $2.3M in sales." 🤩

The magic: Your AI becomes a visual expert that understands images like a human!

Real-world scenarios your AI will handle:

  • 📈 Business charts - “Revenue increased 23% with mobile leading growth”
  • 📝 Documents - Extract key data, dates, and important information
  • 📱 Screenshots - “The login button should be bigger and more prominent”
  • 🎨 Photos - “This shows a golden retriever in a park with 3 people”
  • 📊 Dashboards - “Your conversion rate dropped 5% but user engagement is up”

Without vision AI:

❌ Manually examine every image
❌ Miss important visual patterns
❌ Time-consuming data extraction
❌ Limited to text-only analysis

With vision AI:

✅ Instant professional image analysis
✅ Extract data with perfect accuracy
✅ Spot patterns humans might miss
✅ Complete multimedia intelligence

🕰️ Your AI’s New Visual Superpowers

Section titled “🕰️ Your AI’s New Visual Superpowers”

📄 Document Detective Mode

Perfect for: Invoices, contracts, forms, reports
AI becomes: Professional document analyst
Results: "Invoice #12345 dated March 15th for $2,847.50 from TechCorp"

📊 Chart Analyst Mode

Perfect for: Graphs, dashboards, data visualizations
AI becomes: Business intelligence expert
Results: "Sales peaked in Q3 at $1.2M, showing 45% growth over Q2"

🎯 Everything Mode

Perfect for: Photos, screenshots, anything visual
AI becomes: Universal visual expert
Results: "This UI mockup has good spacing but the CTA button needs more contrast"

The best part: One AI handles all types perfectly!


🛠️ Step 1: Add Super Vision to Your Backend

Section titled “🛠️ Step 1: Add Super Vision to Your Backend”

Great news: We’re using your proven Response API patterns!

What you already know:

// Your familiar text analysis
const response = await client.responses.create({
model: "gpt-4o",
input: [expertPrompt, userMessage]
});

What we’re adding:

// Same pattern + image input!
const response = await client.responses.create({
model: "gpt-4o", // Same model, now with vision!
input: [
expertPrompt,
{
role: "user",
content: [
{ type: "text", text: "Analyze this image" },
{ type: "image_url", image_url: uploadedImage }
]
}
]
});

Perfect! Same Response API, just with image superpowers added.

Simple concept: Image goes in → Expert analysis comes out!

// What we need to track:
const visionState = {
uploadedImage: "user-screenshot.png", // What to analyze
analysisMode: "general", // How to analyze it
visionSettings: { // Analysis options
includeOCR: true, // Extract text
extractData: true, // Find numbers/dates
detailLevel: "high" // Depth of analysis
},
aiResults: "Professional analysis...", // Expert insights!
}

Vision analysis types:

  • 📝 Document mode - Focus on text extraction and data
  • 📊 Chart mode - Analyze data visualizations and trends
  • 🎯 General mode - Comprehensive understanding of anything
  • 🔍 Detail levels - From quick summaries to deep analysis

Add one package for image optimization:

Terminal window
# In your backend folder
npm install sharp

What sharp does: Makes images perfect for AI analysis - faster processing and better results!

Add this to your server - same reliable patterns:

import sharp from 'sharp';
// 👁️ VISION ANALYSIS ENDPOINT: Add this to your existing server
app.post("/api/vision/analyze", upload.single("image"), async (req, res) => {
try {
// 🛡️ VALIDATION: Check if image was uploaded
const uploadedImage = req.file;
const { analysisType = "general", includeOCR = true, extractData = true } = req.body;
if (!uploadedImage) {
return res.status(400).json({
error: "Image file is required",
success: false
});
}
console.log(`👁️ Analyzing: ${uploadedImage.originalname} (${uploadedImage.size} bytes)`);
// 🖼️ IMAGE OPTIMIZATION: Prepare image for vision analysis
const optimizedImage = await optimizeImageForVision(uploadedImage.buffer);
const base64Image = optimizedImage.toString('base64');
const imageUrl = `data:${uploadedImage.mimetype};base64,${base64Image}`;
// 🔍 ANALYSIS PROMPT: Generate appropriate prompt based on type
const analysisPrompt = generateVisionPrompt(analysisType, includeOCR, extractData);
// 🤖 AI VISION ANALYSIS: Process with GPT-4o
const response = await openai.responses.create({
model: "gpt-4o",
input: [
{
role: "system",
content: analysisPrompt.systemPrompt
},
{
role: "user",
content: [
{
type: "text",
text: analysisPrompt.userPrompt
},
{
type: "image_url",
image_url: {
url: imageUrl,
detail: "high"
}
}
]
}
]
});
// 📤 SUCCESS RESPONSE: Send analysis results
res.json({
success: true,
file_info: {
name: uploadedImage.originalname,
size: uploadedImage.size,
type: uploadedImage.mimetype
},
analysis: {
type: analysisType,
include_ocr: includeOCR,
extract_data: extractData,
result: response.output_text,
model: "gpt-4o"
},
timestamp: new Date().toISOString()
});
} catch (error) {
// 🚨 ERROR HANDLING: Handle analysis failures
console.error("Vision analysis error:", error);
res.status(500).json({
error: "Failed to analyze image",
details: error.message,
success: false
});
}
});
// 🔧 HELPER FUNCTIONS: Vision analysis utilities
// Optimize image for better vision analysis
const optimizeImageForVision = async (imageBuffer) => {
try {
// Resize large images for better processing
const optimized = await sharp(imageBuffer)
.resize(2048, 2048, {
fit: 'inside',
withoutEnlargement: true
})
.jpeg({ quality: 85 })
.toBuffer();
return optimized;
} catch (error) {
console.error('Image optimization error:', error);
return imageBuffer; // Return original if optimization fails
}
};
// Generate analysis prompts based on type
const generateVisionPrompt = (analysisType, includeOCR, extractData) => {
const baseSystem = "You are a professional visual analyst with expertise in document analysis, data extraction, and image understanding.";
switch (analysisType) {
case 'document':
return {
systemPrompt: `${baseSystem} You specialize in document analysis, OCR, and text extraction.`,
userPrompt: `Analyze this document image with focus on:
1. TEXT EXTRACTION: ${includeOCR ? 'Extract all readable text content using OCR' : 'Summarize visible text content'}
2. DOCUMENT STRUCTURE: Identify document type, layout, and organization
3. KEY DATA: Extract important numbers, dates, names, and values
4. INSIGHTS: Provide analysis of the document's purpose and key information
Provide clear, structured analysis that's easy to understand.`
};
case 'chart':
return {
systemPrompt: `${baseSystem} You specialize in chart analysis, data visualization interpretation, and trend analysis.`,
userPrompt: `Analyze this chart/graph with focus on:
1. CHART TYPE: Identify the type of visualization (bar, line, pie, etc.)
2. DATA EXTRACTION: ${extractData ? 'Extract specific numerical values and data points' : 'Summarize key trends and patterns'}
3. TRENDS: Identify patterns, trends, and significant changes
4. INSIGHTS: Provide business intelligence and actionable insights
Focus on accuracy and clear interpretation of the visual data.`
};
default: // general
return {
systemPrompt: `${baseSystem} You provide comprehensive visual analysis for any type of image.`,
userPrompt: `Analyze this image comprehensively:
1. CONTENT DESCRIPTION: What do you see in this image?
2. KEY ELEMENTS: Important objects, text, or data visible
3. CONTEXT ANALYSIS: Purpose, setting, or business context
4. ACTIONABLE INSIGHTS: Useful observations or recommendations
${includeOCR ? 'Include any readable text content.' : ''}
${extractData ? 'Extract any numerical or structured data visible.' : ''}
Provide practical, useful analysis that helps users understand the image better.`
};
}
};

Function breakdown:

  1. Validation - Ensure we have an image to analyze
  2. Image optimization - Prepare image for better AI analysis
  3. Prompt generation - Create appropriate analysis prompts
  4. Vision analysis - Process with GPT-4o vision capabilities
  5. Response formatting - Return structured results with metadata

Step 2D: Updating File Upload Configuration

Section titled “Step 2D: Updating File Upload Configuration”

Update your existing multer configuration to handle images:

// Update your existing multer setup to handle images
const upload = multer({
storage: multer.memoryStorage(),
limits: {
fileSize: 25 * 1024 * 1024 // 25MB limit
},
fileFilter: (req, file, cb) => {
// Accept all previous file types PLUS images
const allowedTypes = [
'application/pdf',
'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'text/plain',
'text/csv',
'application/json',
'text/javascript',
'text/x-python',
'audio/wav',
'audio/mp3',
'audio/mpeg',
'audio/mp4',
'audio/webm',
'image/jpeg', // Add image support
'image/png', // Add image support
'image/webp', // Add image support
'image/gif' // Add image support
];
const extension = path.extname(file.originalname).toLowerCase();
const allowedExtensions = ['.pdf', '.docx', '.xlsx', '.csv', '.txt', '.md', '.json', '.js', '.py', '.wav', '.mp3', '.jpeg', '.jpg', '.png', '.webp', '.gif'];
if (allowedTypes.includes(file.mimetype) || allowedExtensions.includes(extension)) {
cb(null, true);
} else {
cb(new Error('Unsupported file type'), false);
}
}
});

Your backend now supports:

  • Text chat (existing functionality)
  • Streaming chat (existing functionality)
  • Image generation (existing functionality)
  • Audio transcription (existing functionality)
  • File analysis (existing functionality)
  • Text-to-speech (existing functionality)
  • Vision analysis (new functionality)

🔧 Step 3: Building the React Vision Component

Section titled “🔧 Step 3: Building the React Vision Component”

Now let’s create a React component for vision analysis using the same patterns from your existing components.

Step 3A: Creating the Vision Analysis Component

Section titled “Step 3A: Creating the Vision Analysis Component”

Create a new file src/VisionAnalysis.jsx:

import { useState, useRef } from "react";
import { Upload, Eye, FileText, BarChart3, Download, Camera } from "lucide-react";
function VisionAnalysis() {
// 🧠 STATE: Vision analysis data management
const [selectedImage, setSelectedImage] = useState(null); // Uploaded image
const [analysisType, setAnalysisType] = useState("general"); // Analysis mode
const [isAnalyzing, setIsAnalyzing] = useState(false); // Processing status
const [analysisResult, setAnalysisResult] = useState(null); // Analysis results
const [error, setError] = useState(null); // Error messages
const [previewUrl, setPreviewUrl] = useState(null); // Image preview
const [options, setOptions] = useState({ // Analysis options
includeOCR: true,
extractData: true
});
const fileInputRef = useRef(null);
// 🔧 FUNCTIONS: Vision analysis logic engine
// Handle image selection
const handleImageSelect = (event) => {
const file = event.target.files[0];
if (file) {
// Validate file size (25MB limit)
if (file.size > 25 * 1024 * 1024) {
setError('Image too large. Maximum size is 25MB.');
return;
}
// Validate file type
const allowedTypes = ['image/jpeg', 'image/png', 'image/webp', 'image/gif'];
if (!allowedTypes.includes(file.type)) {
setError('Unsupported image type. Please upload JPEG, PNG, WebP, or GIF files.');
return;
}
setSelectedImage(file);
setAnalysisResult(null);
setError(null);
// Create preview URL
const url = URL.createObjectURL(file);
setPreviewUrl(url);
}
};
// Clear selected image
const clearImage = () => {
setSelectedImage(null);
setAnalysisResult(null);
setError(null);
if (previewUrl) {
URL.revokeObjectURL(previewUrl);
setPreviewUrl(null);
}
if (fileInputRef.current) {
fileInputRef.current.value = '';
}
};
// Main vision analysis function
const analyzeImage = async () => {
// 🛡️ GUARDS: Prevent invalid analysis
if (!selectedImage || isAnalyzing) return;
// 🔄 SETUP: Prepare for analysis
setIsAnalyzing(true);
setError(null);
setAnalysisResult(null);
try {
// 📤 FORM DATA: Prepare multipart form data
const formData = new FormData();
formData.append('image', selectedImage);
formData.append('analysisType', analysisType);
formData.append('includeOCR', options.includeOCR);
formData.append('extractData', options.extractData);
// 📡 API CALL: Send to your backend
const response = await fetch("http://localhost:8000/api/vision/analyze", {
method: "POST",
body: formData
});
const data = await response.json();
if (!response.ok) {
throw new Error(data.error || 'Failed to analyze image');
}
// ✅ SUCCESS: Store analysis results
setAnalysisResult(data);
} catch (error) {
// 🚨 ERROR HANDLING: Show user-friendly message
console.error('Vision analysis failed:', error);
setError(error.message || 'Something went wrong while analyzing the image');
} finally {
// 🧹 CLEANUP: Reset processing state
setIsAnalyzing(false);
}
};
// Download analysis results
const downloadAnalysis = () => {
if (!analysisResult) return;
const element = document.createElement('a');
const file = new Blob([JSON.stringify(analysisResult, null, 2)], { type: 'application/json' });
element.href = URL.createObjectURL(file);
element.download = `vision-analysis-${selectedImage.name}-${Date.now()}.json`;
document.body.appendChild(element);
element.click();
document.body.removeChild(element);
};
// Analysis type options
const analysisTypes = [
{ value: "general", label: "General Analysis", desc: "Comprehensive visual understanding", icon: Eye },
{ value: "document", label: "Document Analysis", desc: "OCR and text extraction focus", icon: FileText },
{ value: "chart", label: "Chart Analysis", desc: "Data visualization interpretation", icon: BarChart3 }
];
// Format file size
const formatFileSize = (bytes) => {
if (bytes === 0) return '0 Bytes';
const k = 1024;
const sizes = ['Bytes', 'KB', 'MB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i];
};
// 🎨 UI: Interface components
return (
<div className="min-h-screen bg-gradient-to-br from-indigo-50 to-purple-50 flex items-center justify-center p-4">
<div className="bg-white rounded-2xl shadow-2xl w-full max-w-6xl flex flex-col overflow-hidden">
{/* Header */}
<div className="bg-gradient-to-r from-indigo-600 to-purple-600 text-white p-6">
<div className="flex items-center space-x-3">
<div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
<Eye className="w-5 h-5" />
</div>
<div>
<h1 className="text-xl font-bold">👁️ AI Vision Analysis</h1>
<p className="text-indigo-100 text-sm">Analyze any image with AI intelligence!</p>
</div>
</div>
</div>
{/* Analysis Type Selection */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4 flex items-center">
<Camera className="w-5 h-5 mr-2 text-indigo-600" />
Analysis Type
</h3>
<div className="grid grid-cols-1 md:grid-cols-3 gap-4">
{analysisTypes.map((type) => {
const IconComponent = type.icon;
return (
<button
key={type.value}
onClick={() => setAnalysisType(type.value)}
className={`p-4 rounded-lg border-2 text-left transition-all duration-200 ${
analysisType === type.value
? 'border-indigo-500 bg-indigo-50 shadow-md'
: 'border-gray-200 hover:border-indigo-300 hover:bg-indigo-50'
}`}
>
<div className="flex items-center mb-2">
<IconComponent className="w-5 h-5 mr-2 text-indigo-600" />
<h4 className="font-medium text-gray-900">{type.label}</h4>
</div>
<p className="text-sm text-gray-600">{type.desc}</p>
</button>
);
})}
</div>
</div>
{/* Analysis Options */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4">Analysis Options</h3>
<div className="grid grid-cols-1 md:grid-cols-2 gap-4">
<label className="flex items-center space-x-3 p-3 rounded-lg border border-gray-200 hover:bg-gray-50 cursor-pointer">
<input
type="checkbox"
checked={options.includeOCR}
onChange={(e) => setOptions(prev => ({ ...prev, includeOCR: e.target.checked }))}
className="w-4 h-4 text-indigo-600 rounded focus:ring-indigo-500"
/>
<div>
<span className="font-medium text-gray-900">Include OCR</span>
<p className="text-sm text-gray-600">Extract text content from images</p>
</div>
</label>
<label className="flex items-center space-x-3 p-3 rounded-lg border border-gray-200 hover:bg-gray-50 cursor-pointer">
<input
type="checkbox"
checked={options.extractData}
onChange={(e) => setOptions(prev => ({ ...prev, extractData: e.target.checked }))}
className="w-4 h-4 text-indigo-600 rounded focus:ring-indigo-500"
/>
<div>
<span className="font-medium text-gray-900">Extract Data</span>
<p className="text-sm text-gray-600">Find numerical data and structured information</p>
</div>
</label>
</div>
</div>
{/* Image Upload Section */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4 flex items-center">
<Upload className="w-5 h-5 mr-2 text-indigo-600" />
Upload Image for Analysis
</h3>
{!selectedImage ? (
<div
onClick={() => fileInputRef.current?.click()}
className="border-2 border-dashed border-gray-300 rounded-xl p-8 text-center cursor-pointer hover:border-indigo-400 hover:bg-indigo-50 transition-colors duration-200"
>
<Upload className="w-12 h-12 text-gray-400 mx-auto mb-4" />
<h4 className="text-lg font-semibold text-gray-700 mb-2">Upload Image</h4>
<p className="text-gray-600 mb-4">
Support for JPEG, PNG, WebP, and GIF files up to 25MB
</p>
<button className="px-6 py-3 bg-gradient-to-r from-indigo-600 to-purple-600 text-white rounded-xl hover:from-indigo-700 hover:to-purple-700 transition-all duration-200 inline-flex items-center space-x-2 shadow-lg">
<Upload className="w-4 h-4" />
<span>Choose Image</span>
</button>
</div>
) : (
<div className="bg-gray-50 rounded-lg p-4 border border-gray-200">
<div className="grid grid-cols-1 md:grid-cols-2 gap-4">
{/* Image Preview */}
<div>
<h4 className="font-medium text-gray-900 mb-2">Preview:</h4>
<img
src={previewUrl}
alt={selectedImage.name}
className="w-full h-48 object-cover rounded-lg border border-gray-200"
/>
</div>
{/* Image Info */}
<div>
<div className="flex items-center justify-between mb-4">
<div>
<h4 className="font-medium text-gray-900">{selectedImage.name}</h4>
<p className="text-sm text-gray-600">{formatFileSize(selectedImage.size)}</p>
</div>
<button
onClick={clearImage}
className="p-2 text-gray-400 hover:text-red-600 transition-colors duration-200"
>
×
</button>
</div>
<button
onClick={analyzeImage}
disabled={isAnalyzing}
className="w-full bg-gradient-to-r from-indigo-600 to-purple-600 hover:from-indigo-700 hover:to-purple-700 disabled:from-gray-300 disabled:to-gray-300 text-white px-6 py-3 rounded-lg transition-all duration-200 flex items-center justify-center space-x-2 shadow-lg disabled:shadow-none"
>
{isAnalyzing ? (
<>
<div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
<span>Analyzing...</span>
</>
) : (
<>
<Eye className="w-4 h-4" />
<span>Analyze Image</span>
</>
)}
</button>
</div>
</div>
</div>
)}
<input
ref={fileInputRef}
type="file"
accept="image/jpeg,image/png,image/webp,image/gif"
onChange={handleImageSelect}
className="hidden"
/>
</div>
{/* Results Section */}
<div className="flex-1 p-6">
{/* Error Display */}
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
<p className="text-red-700">
<strong>Error:</strong> {error}
</p>
</div>
)}
{/* Analysis Results */}
{analysisResult ? (
<div className="bg-gray-50 rounded-lg p-4">
<div className="flex items-center justify-between mb-4">
<h4 className="font-semibold text-gray-900">Vision Analysis Results</h4>
<button
onClick={downloadAnalysis}
className="bg-gradient-to-r from-blue-500 to-blue-600 hover:from-blue-600 hover:to-blue-700 text-white px-4 py-2 rounded-lg transition-all duration-200 flex items-center space-x-2"
>
<Download className="w-4 h-4" />
<span>Download</span>
</button>
</div>
<div className="space-y-4">
{/* File Information */}
<div className="bg-white rounded-lg p-4">
<h5 className="font-medium text-gray-700 mb-2">Image Information:</h5>
<div className="grid grid-cols-2 md:grid-cols-4 gap-4 text-sm">
<div>
<span className="text-gray-600">Name:</span>
<p className="font-medium">{analysisResult.file_info.name}</p>
</div>
<div>
<span className="text-gray-600">Size:</span>
<p className="font-medium">{formatFileSize(analysisResult.file_info.size)}</p>
</div>
<div>
<span className="text-gray-600">Type:</span>
<p className="font-medium">{analysisResult.file_info.type}</p>
</div>
<div>
<span className="text-gray-600">Analysis:</span>
<p className="font-medium capitalize">{analysisResult.analysis.type}</p>
</div>
</div>
</div>
{/* Analysis Content */}
<div className="bg-white rounded-lg p-4">
<h5 className="font-medium text-gray-700 mb-2">AI Vision Analysis:</h5>
<div className="text-gray-900 leading-relaxed whitespace-pre-wrap max-h-96 overflow-y-auto">
{analysisResult.analysis.result}
</div>
</div>
</div>
</div>
) : !isAnalyzing && !error && (
// Welcome State
<div className="text-center py-12">
<div className="w-16 h-16 bg-indigo-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
<Eye className="w-8 h-8 text-indigo-600" />
</div>
<h3 className="text-lg font-semibold text-gray-700 mb-2">
Ready to Analyze!
</h3>
<p className="text-gray-600 max-w-md mx-auto">
Upload any image to get AI-powered visual analysis, text extraction, and intelligent insights.
</p>
</div>
)}
</div>
</div>
</div>
);
}
export default VisionAnalysis;

Step 3B: Adding Vision Analysis to Navigation

Section titled “Step 3B: Adding Vision Analysis to Navigation”

Update your src/App.jsx to include the new vision analysis component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import FileAnalysis from "./FileAnalysis";
import TextToSpeech from "./TextToSpeech";
import VisionAnalysis from "./VisionAnalysis";
import { MessageSquare, Image, Mic, Folder, Volume2, Eye } from "lucide-react";
function App() {
// 🧠 STATE: Navigation management
const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', 'audio', 'files', 'speech', or 'vision'
// 🎨 UI: Main app with navigation
return (
<div className="min-h-screen bg-gray-100">
{/* Navigation Header */}
<nav className="bg-white shadow-sm border-b border-gray-200">
<div className="max-w-6xl mx-auto px-4">
<div className="flex items-center justify-between h-16">
{/* Logo */}
<div className="flex items-center space-x-3">
<div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
<span className="text-white font-bold text-sm">AI</span>
</div>
<h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
</div>
{/* Navigation Buttons */}
<div className="flex space-x-2">
<button
onClick={() => setCurrentView("chat")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "chat"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<MessageSquare className="w-4 h-4" />
<span>Chat</span>
</button>
<button
onClick={() => setCurrentView("images")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "images"
? "bg-purple-100 text-purple-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Image className="w-4 h-4" />
<span>Images</span>
</button>
<button
onClick={() => setCurrentView("audio")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "audio"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Mic className="w-4 h-4" />
<span>Audio</span>
</button>
<button
onClick={() => setCurrentView("files")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "files"
? "bg-green-100 text-green-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Folder className="w-4 h-4" />
<span>Files</span>
</button>
<button
onClick={() => setCurrentView("speech")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "speech"
? "bg-orange-100 text-orange-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Volume2 className="w-4 h-4" />
<span>Speech</span>
</button>
<button
onClick={() => setCurrentView("vision")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "vision"
? "bg-indigo-100 text-indigo-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Eye className="w-4 h-4" />
<span>Vision</span>
</button>
</div>
</div>
</div>
</nav>
{/* Main Content */}
<main className="h-[calc(100vh-4rem)]">
{currentView === "chat" && <StreamingChat />}
{currentView === "images" && <ImageGenerator />}
{currentView === "audio" && <AudioTranscription />}
{currentView === "files" && <FileAnalysis />}
{currentView === "speech" && <TextToSpeech />}
{currentView === "vision" && <VisionAnalysis />}
</main>
</div>
);
}
export default App;

Let’s test your vision analysis feature step by step to make sure everything works correctly.

First, verify your backend route works by testing it directly:

Test with a simple image:

Terminal window
# Test the endpoint with an image file
curl -X POST http://localhost:8000/api/vision/analyze \
-F "image=@test-image.jpg" \
-F "analysisType=general" \
-F "includeOCR=true" \
-F "extractData=true"

Expected response:

{
"success": true,
"file_info": {
"name": "test-image.jpg",
"size": 245678,
"type": "image/jpeg"
},
"analysis": {
"type": "general",
"include_ocr": true,
"extract_data": true,
"result": "This image shows...",
"model": "gpt-4o"
},
"timestamp": "2024-01-15T10:30:00.000Z"
}

Start both servers:

Backend (in your backend folder):

Terminal window
npm run dev

Frontend (in your frontend folder):

Terminal window
npm run dev

Test the complete flow:

  1. Navigate to Vision → Click the “Vision” tab in navigation
  2. Select analysis type → Choose “General”, “Document”, or “Chart” analysis
  3. Configure options → Enable OCR or data extraction as needed
  4. Upload an image → Try a screenshot, document, or chart
  5. Analyze → Click “Analyze Image” and see loading state
  6. View results → See AI analysis with image information
  7. Download → Test downloading analysis as JSON file
  8. Switch images → Try different image types and analysis modes

Test error scenarios:

❌ Large image: Upload image larger than 25MB
❌ Wrong type: Upload unsupported file (like .txt or .mp4)
❌ Empty upload: Try to analyze without selecting an image
❌ Corrupt image: Upload damaged image file

Expected behavior:

  • Clear error messages displayed
  • No application crashes
  • User can try again with different image
  • Image upload resets properly after errors

Congratulations! You’ve extended your existing application with complete AI vision analysis:

  • Extended your backend with vision processing and GPT-4o integration
  • Added React vision component following the same patterns as your other features
  • Implemented intelligent image analysis for documents, charts, and general content
  • Created flexible analysis modes with OCR and data extraction options
  • Added download functionality for analysis results
  • Maintained consistent design with your existing application

Your application now has:

  • Text chat with streaming responses
  • Image generation with DALL-E 3 and GPT-Image-1
  • Audio transcription with Whisper voice recognition
  • File analysis with intelligent document processing
  • Text-to-speech with natural voice synthesis
  • Vision analysis with GPT-4o visual intelligence
  • Unified navigation between all features
  • Professional UI with consistent TailwindCSS styling

Complete OpenAI mastery achieved! You now have a comprehensive application that leverages all major OpenAI capabilities in a unified, professional interface. 👁️

<function_calls> [{“content”: “Create Module 2 folder structure for advanced OpenAI features”, “status”: “completed”, “priority”: “high”, “id”: “13”}, {“content”: “Create Image Generation guide with DALL-E and GPT-image-1”, “status”: “completed”, “priority”: “high”, “id”: “14”}, {“content”: “Create Audio Transcription guide with Whisper integration”, “status”: “completed”, “priority”: “high”, “id”: “15”}, {“content”: “Create Vision Analysis guide with GPT-4o”, “status”: “completed”, “priority”: “high”, “id”: “16”}, {“content”: “Create File Interaction guide with Response API”, “status”: “in_progress”, “priority”: “high”, “id”: “17”}, {“content”: “Create Text-to-Speech guide with voice synthesis”, “status”: “pending”, “priority”: “high”, “id”: “18”}]