Ir al contenido

🔊 Make Your AI Talk Back!

Esta página aún no está disponible en tu idioma.

Your AI can chat, create images, understand audio, and analyze files. Now let’s give it a voice! 🎤

Imagine users asking “What’s the weather like?” and your AI speaking back in a warm, friendly voice instead of just showing text. Or reading long articles aloud while they work on other things!

What we’re building: Your AI will be able to speak any text in 6 different voice personalities - from professional business tones to energetic marketing voices. It’s like having a team of voice actors inside your app!


Current state: Your AI shows brilliant text responses Target state: Users can hear your AI speak with natural voices!

Before (Silent AI):

User: "Explain quantum physics"
AI: [Shows long text explanation]
User: [Has to read everything] 😴

After (Speaking AI):

User: "Explain quantum physics"
AI: [Shows text AND speaks it] 🔊
User: [Can listen while doing other things] 🎧

The magic: Your AI becomes accessible, engaging, and multitask-friendly!

Real-world impact:

  • 📱 Accessibility heroes - Visually impaired users can fully enjoy your app
  • 🏃‍♀️ Multitasking magic - Users can listen while exercising, driving, or working
  • 🧠 Learning boost - Audio learners absorb information better when they hear it
  • 📚 Instant podcasts - Turn any article into audio content on demand
  • 🎯 Better engagement - Voice keeps users active instead of passive readers

Without voice AI:

❌ Hire expensive voice actors
❌ Use robotic computer voices
❌ Miss 15% of users who prefer audio
❌ Limited to text-only experiences

With voice AI:

✅ Professional voices in seconds
✅ Natural, engaging speech
✅ Serve all learning styles
✅ Complete multimedia experience

OpenAI gives you a complete voice acting team! Each one has a distinct personality:

🎙️ Alloy - The Professional

Perfect for: Business presentations, formal content
Sounds like: Your trusted corporate spokesperson
User feels: Confident and professional

🌊 Echo - The Calm Companion

Perfect for: Meditation apps, soothing content
Sounds like: Your gentle yoga instructor
User feels: Relaxed and peaceful

📚 Fable - The Master Storyteller

Perfect for: Creative content, engaging stories
Sounds like: Your favorite audiobook narrator
User feels: Captivated and entertained

🎯 Onyx - The Authority

Perfect for: News, important announcements
Sounds like: Your trusted news anchor
User feels: Informed and confident

☀️ Nova - The Friendly Helper

Perfect for: Tutorials, customer support
Sounds like: Your helpful best friend
User feels: Welcome and supported

✨ Shimmer - The Energy Booster

Perfect for: Marketing, motivational content
Sounds like: Your enthusiastic coach
User feels: Excited and motivated

Pro tip: We’ll build a voice selector so users can choose their favorite!


🛠️ Step 1: Add Voice Power to Your Backend

Section titled “🛠️ Step 1: Add Voice Power to Your Backend”

Good news: We’re using the exact same patterns you already know!

What you already have:

// Your familiar Response API pattern
const response = await client.responses.create({
model: "gpt-4o",
input: [systemPrompt, userMessage]
});

What we’re adding:

// New voice synthesis (same style!)
const speech = await client.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: textToSpeak
});

Perfect! Same patterns, just different endpoints.

Simple concept: Text goes in → Beautiful voice comes out!

// What we need to track:
const voiceState = {
textInput: "Hello, I'm your AI assistant!", // What to say
selectedVoice: "nova", // Who says it
audioSettings: { // How to say it
speed: 1.0, // Normal speed
quality: "hd", // High definition
format: "mp3" // Audio format
},
generatedAudio: "audio-file-url", // Result!
}

Voice options:

  • 🏃‍♂️ TTS-1 - Fast generation (great for testing)
  • 💎 TTS-1-HD - Premium quality (perfect for production)
  • ⚡ Speed control - From 0.25x (slow) to 4x (fast)
  • 🎵 Formats - MP3, Opus, AAC, FLAC

Add this to your existing server - same patterns you know and love:

import fs from 'fs';
import path from 'path';
// 🔊 VOICE PROFILES: Available AI voices with personalities
const VOICE_PROFILES = {
alloy: {
name: "Alloy",
description: "Professional and versatile",
bestFor: "Business content, presentations"
},
echo: {
name: "Echo",
description: "Calm and soothing",
bestFor: "Meditation, relaxation content"
},
fable: {
name: "Fable",
description: "Expressive storyteller",
bestFor: "Stories, creative content"
},
onyx: {
name: "Onyx",
description: "Deep and authoritative",
bestFor: "News, formal announcements"
},
nova: {
name: "Nova",
description: "Warm and friendly",
bestFor: "Customer service, tutorials"
},
shimmer: {
name: "Shimmer",
description: "Bright and energetic",
bestFor: "Marketing, upbeat content"
}
};
// 🔧 HELPER FUNCTIONS: Audio processing utilities
const saveAudioToTemp = async (audioBuffer, format = 'mp3') => {
const tempDir = path.join(process.cwd(), "temp");
// Create temp directory if it doesn't exist
if (!fs.existsSync(tempDir)) {
fs.mkdirSync(tempDir, { recursive: true });
}
// Create unique filename
const filename = `tts-${Date.now()}.${format}`;
const filepath = path.join(tempDir, filename);
// Write audio file
fs.writeFileSync(filepath, audioBuffer);
// Auto-cleanup after 1 hour
setTimeout(() => {
try {
if (fs.existsSync(filepath)) {
fs.unlinkSync(filepath);
console.log(`🧹 Cleaned up: ${filename}`);
}
} catch (error) {
console.error("Error cleaning up audio file:", error);
}
}, 3600000); // 1 hour
return { filepath, filename };
};
// 🔊 AI Text-to-Speech endpoint - add this to your existing server
app.post("/api/tts/generate", async (req, res) => {
try {
// 🛡️ VALIDATION: Check required inputs
const {
text,
voice = "alloy",
model = "tts-1",
speed = 1.0,
format = "mp3"
} = req.body;
if (!text || text.trim() === "") {
return res.status(400).json({
error: "Text is required",
success: false
});
}
if (text.length > 4096) {
return res.status(400).json({
error: "Text too long. Maximum 4096 characters allowed.",
current_length: text.length,
success: false
});
}
console.log(`🔊 Generating speech: ${text.substring(0, 50)}... (${voice})`);
// 🎙️ AI SPEECH GENERATION: Convert text to speech
const response = await openai.audio.speech.create({
model: model, // tts-1 (fast) or tts-1-hd (high quality)
voice: voice, // AI voice personality
input: text.trim(), // Text to convert
response_format: format, // Audio format (mp3, opus, aac, flac)
speed: Math.max(0.25, Math.min(4.0, speed)) // Speaking speed (0.25x to 4x)
});
// 💾 AUDIO PROCESSING: Save audio file
const audioBuffer = Buffer.from(await response.arrayBuffer());
const { filepath, filename } = await saveAudioToTemp(audioBuffer, format);
// 📤 SUCCESS RESPONSE: Send audio info and download link
res.json({
success: true,
audio: {
filename: filename,
format: format,
size: audioBuffer.length,
duration_estimate: Math.ceil(text.length / 14), // ~14 characters per second
download_url: `/api/tts/download/${filename}`
},
generation: {
voice: voice,
voice_info: VOICE_PROFILES[voice],
model: model,
speed: speed,
text_length: text.length
},
timestamp: new Date().toISOString()
});
} catch (error) {
// 🚨 ERROR HANDLING: Handle TTS failures
console.error("Text-to-speech error:", error);
res.status(500).json({
error: "Failed to generate speech",
details: error.message,
success: false
});
}
});
// 📥 Audio Download endpoint - serve generated audio files
app.get("/api/tts/download/:filename", (req, res) => {
try {
const { filename } = req.params;
const filepath = path.join(process.cwd(), "temp", filename);
// Security check - ensure filename is safe
if (!filename.match(/^tts-\d+\.(mp3|opus|aac|flac)$/)) {
return res.status(400).json({ error: "Invalid filename" });
}
// Check if file exists
if (!fs.existsSync(filepath)) {
return res.status(404).json({ error: "Audio file not found or expired" });
}
// Serve audio file
const extension = path.extname(filename).substring(1);
res.setHeader('Content-Type', `audio/${extension}`);
res.setHeader('Content-Disposition', `attachment; filename="${filename}"`);
const audioBuffer = fs.readFileSync(filepath);
res.send(audioBuffer);
} catch (error) {
console.error("Audio download error:", error);
res.status(500).json({
error: "Failed to download audio",
message: error.message
});
}
});
// 🎙️ Voice Information endpoint - get available voices
app.get("/api/tts/voices", (req, res) => {
res.json({
success: true,
voices: VOICE_PROFILES,
models: [
{
id: "tts-1",
name: "TTS-1",
description: "Fast, cost-effective synthesis",
quality: "standard"
},
{
id: "tts-1-hd",
name: "TTS-1 HD",
description: "High-definition audio quality",
quality: "premium"
}
],
formats: ["mp3", "opus", "aac", "flac"],
speed_range: { min: 0.25, max: 4.0, default: 1.0 },
text_limit: 4096
});
});

What this does (step by step):

  1. ✅ Validates text - Makes sure we have something to say
  2. 🎭 Picks voice - Selects the right AI personality
  3. 🎙️ Generates speech - OpenAI creates beautiful audio
  4. 💾 Saves file - Stores audio temporarily for download
  5. 📤 Returns results - Sends back audio URL and metadata
  6. 🧹 Cleans up - Removes old files automatically

Same reliable patterns as your chat and image features!

Add this middleware to handle text-to-speech specific errors:

// 🚨 TTS ERROR HANDLING: Handle text-to-speech errors
app.use((error, req, res, next) => {
if (error.message && error.message.includes('Invalid voice')) {
return res.status(400).json({
error: "Invalid voice selected. Please choose from: alloy, echo, fable, onyx, nova, shimmer",
success: false
});
}
if (error.message && error.message.includes('text too long')) {
return res.status(400).json({
error: "Text exceeds maximum length of 4096 characters",
success: false
});
}
next(error);
});

Your backend now supports:

  • Text chat (existing functionality)
  • Streaming chat (existing functionality)
  • Image generation (existing functionality)
  • Audio transcription (existing functionality)
  • File analysis (existing functionality)
  • Text-to-speech (new functionality)
---
## 🔧 Step 3: Building the React Text-to-Speech Component
Now let's create a React component for text-to-speech using the same patterns from your existing components.
### **Step 3A: Creating the Text-to-Speech Component**
Create a new file `src/TextToSpeech.jsx`:
```jsx
import { useState, useRef, useEffect } from "react";
import { Volume2, Play, Pause, Download, Settings } from "lucide-react";
function TextToSpeech() {
// 🧠 STATE: Text-to-speech data management
const [text, setText] = useState(""); // Text to convert
const [selectedVoice, setSelectedVoice] = useState("alloy"); // AI voice selection
const [audioSettings, setAudioSettings] = useState({ // TTS settings
model: "tts-1",
speed: 1.0,
format: "mp3"
});
const [isGenerating, setIsGenerating] = useState(false); // Processing status
const [generatedAudio, setGeneratedAudio] = useState([]); // Generated audio list
const [currentlyPlaying, setCurrentlyPlaying] = useState(null); // Audio playback state
const [voices, setVoices] = useState({}); // Available voices
const [error, setError] = useState(null); // Error messages
const audioRef = useRef(null);
// Load available voices on component mount
useEffect(() => {
fetchVoices();
}, []);
const fetchVoices = async () => {
try {
const response = await fetch("http://localhost:8000/api/tts/voices");
const data = await response.json();
if (data.success) {
setVoices(data.voices);
}
} catch (error) {
console.error('Failed to fetch voices:', error);
}
};
// 🔧 FUNCTIONS: Text-to-speech logic engine
// Main speech generation function
const generateSpeech = async () => {
// 🛡️ GUARDS: Prevent invalid generation
if (!text.trim() || isGenerating) return;
// 🔄 SETUP: Prepare for generation
setIsGenerating(true);
setError(null);
try {
// 📤 API CALL: Send to your backend
const response = await fetch("http://localhost:8000/api/tts/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: text.trim(),
voice: selectedVoice,
...audioSettings
})
});
const data = await response.json();
if (!response.ok) {
throw new Error(data.error || 'Failed to generate speech');
}
// ✅ SUCCESS: Store generated audio
const newAudio = {
id: Date.now(),
text: text.trim(),
voice: selectedVoice,
settings: audioSettings,
audio: data.audio,
generation: data.generation,
timestamp: new Date().toISOString()
};
setGeneratedAudio(prev => [newAudio, ...prev]);
setText(""); // Clear input after successful generation
} catch (error) {
// 🚨 ERROR HANDLING: Show user-friendly message
console.error('Speech generation failed:', error);
setError(error.message || 'Something went wrong while generating speech');
} finally {
// 🧹 CLEANUP: Reset generation state
setIsGenerating(false);
}
};
// Audio playback function
const playAudio = async (audioItem) => {
try {
if (currentlyPlaying?.id === audioItem.id) {
// Pause current audio
if (audioRef.current) {
audioRef.current.pause();
setCurrentlyPlaying(null);
}
return;
}
// Stop any currently playing audio
if (audioRef.current) {
audioRef.current.pause();
}
// Create new audio element
const audio = new Audio(`http://localhost:8000${audioItem.audio.download_url}`);
audioRef.current = audio;
audio.onloadstart = () => setCurrentlyPlaying({ ...audioItem, status: 'loading' });
audio.oncanplay = () => setCurrentlyPlaying({ ...audioItem, status: 'ready' });
audio.onplay = () => setCurrentlyPlaying({ ...audioItem, status: 'playing' });
audio.onpause = () => setCurrentlyPlaying({ ...audioItem, status: 'paused' });
audio.onended = () => setCurrentlyPlaying(null);
audio.onerror = () => {
setCurrentlyPlaying(null);
setError('Failed to play audio');
};
await audio.play();
} catch (error) {
console.error('Audio playback error:', error);
setCurrentlyPlaying(null);
setError('Failed to play audio');
}
};
// Download audio function
const downloadAudio = (audioItem) => {
try {
const link = document.createElement('a');
link.href = `http://localhost:8000${audioItem.audio.download_url}`;
link.download = `speech-${audioItem.id}.${audioItem.audio.format}`;
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
} catch (error) {
console.error('Download error:', error);
setError('Failed to download audio');
}
};
// Sample texts for quick testing
const sampleTexts = [
"Welcome to our application! I'm excited to help you with AI-powered text-to-speech.",
"Once upon a time, in the world of artificial intelligence, voices came alive with just a few lines of code.",
"This is a test of the emergency broadcast system. This is only a test.",
"Take a deep breath and relax as you listen to this calming AI-generated voice.",
"Breaking news: AI technology continues to amaze us with natural-sounding speech synthesis."
];
// Utility functions
const formatFileSize = (bytes) => {
if (bytes === 0) return '0 Bytes';
const k = 1024;
const sizes = ['Bytes', 'KB', 'MB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i];
};
const formatDuration = (seconds) => {
const mins = Math.floor(seconds / 60);
const secs = Math.floor(seconds % 60);
return `${mins}:${secs.toString().padStart(2, '0')}`;
};
// 🎨 UI: Interface components
return (
<div className="min-h-screen bg-gradient-to-br from-orange-50 to-red-50 flex items-center justify-center p-4">
<div className="bg-white rounded-2xl shadow-2xl w-full max-w-4xl flex flex-col overflow-hidden">
{/* Header */}
<div className="bg-gradient-to-r from-orange-600 to-red-600 text-white p-6">
<div className="flex items-center space-x-3">
<div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
<Volume2 className="w-5 h-5" />
</div>
<div>
<h1 className="text-xl font-bold">🔊 AI Text-to-Speech</h1>
<p className="text-orange-100 text-sm">Convert any text to natural speech!</p>
</div>
</div>
</div>
{/* Voice Settings Section */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4 flex items-center">
<Settings className="w-5 h-5 mr-2 text-orange-600" />
Voice Settings
</h3>
<div className="grid grid-cols-1 md:grid-cols-4 gap-4">
{/* Voice Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Voice</label>
<select
value={selectedVoice}
onChange={(e) => setSelectedVoice(e.target.value)}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
{Object.entries(voices).map(([key, voice]) => (
<option key={key} value={key}>
{voice.name} - {voice.description}
</option>
))}
</select>
</div>
{/* Model Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Quality</label>
<select
value={audioSettings.model}
onChange={(e) => setAudioSettings(prev => ({ ...prev, model: e.target.value }))}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
<option value="tts-1">Standard (Fast)</option>
<option value="tts-1-hd">HD (High Quality)</option>
</select>
</div>
{/* Speed Control */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">
Speed ({audioSettings.speed}x)
</label>
<input
type="range"
min="0.25"
max="4"
step="0.05"
value={audioSettings.speed}
onChange={(e) => setAudioSettings(prev => ({ ...prev, speed: parseFloat(e.target.value) }))}
disabled={isGenerating}
className="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer disabled:cursor-not-allowed"
/>
</div>
{/* Format Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Format</label>
<select
value={audioSettings.format}
onChange={(e) => setAudioSettings(prev => ({ ...prev, format: e.target.value }))}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
<option value="mp3">MP3</option>
<option value="opus">Opus</option>
<option value="aac">AAC</option>
<option value="flac">FLAC</option>
</select>
</div>
</div>
</div>
{/* Text Input Section */}
<div className="p-6 border-b border-gray-200">
<div className="mb-4">
<div className="flex justify-between items-center mb-2">
<label className="block text-sm font-medium text-gray-700">Text to Convert</label>
<span className="text-sm text-gray-500">{text.length}/4096 characters</span>
</div>
<textarea
value={text}
onChange={(e) => setText(e.target.value)}
placeholder="Enter the text you want to convert to speech..."
className="w-full px-4 py-3 border border-gray-300 rounded-xl focus:outline-none focus:ring-2 focus:ring-orange-500 focus:border-transparent transition-all duration-200 resize-none"
rows={4}
maxLength={4096}
disabled={isGenerating}
/>
</div>
{/* Sample Texts */}
<div className="mb-4">
<p className="text-sm text-gray-600 mb-2">Quick samples:</p>
<div className="flex flex-wrap gap-2">
{sampleTexts.map((sample, index) => (
<button
key={index}
onClick={() => setText(sample)}
disabled={isGenerating}
className="px-3 py-1 text-sm bg-gray-100 hover:bg-orange-100 text-gray-700 hover:text-orange-700 rounded-full transition-colors duration-200 disabled:opacity-50 disabled:cursor-not-allowed"
>
{sample.substring(0, 30)}...
</button>
))}
</div>
</div>
{/* Generate Button */}
<div className="flex justify-center">
<button
onClick={generateSpeech}
disabled={isGenerating || !text.trim()}
className="px-8 py-3 bg-gradient-to-r from-orange-600 to-red-600 hover:from-orange-700 hover:to-red-700 disabled:from-gray-300 disabled:to-gray-300 text-white rounded-xl transition-all duration-200 flex items-center space-x-2 shadow-lg disabled:shadow-none"
>
{isGenerating ? (
<>
<div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
<span>Generating...</span>
</>
) : (
<>
<Volume2 className="w-4 h-4" />
<span>Generate Speech</span>
</>
)}
</button>
</div>
</div>
{/* Results Section */}
<div className="flex-1 p-6">
{/* Error Display */}
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
<p className="text-red-700">
<strong>Error:</strong> {error}
</p>
</div>
)}
{/* Generated Audio List */}
{generatedAudio.length === 0 ? (
<div className="text-center py-12">
<div className="w-16 h-16 bg-orange-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
<Volume2 className="w-8 h-8 text-orange-600" />
</div>
<h3 className="text-lg font-semibold text-gray-700 mb-2">
No Audio Generated Yet
</h3>
<p className="text-gray-600 max-w-md mx-auto">
Enter some text above and click "Generate Speech" to create your first AI voice.
</p>
</div>
) : (
<div className="space-y-4">
<h4 className="font-semibold text-gray-900 mb-4">
Generated Audio ({generatedAudio.length})
</h4>
{generatedAudio.map((audioItem) => (
<div key={audioItem.id} className="bg-gray-50 rounded-lg p-4 border border-gray-200">
<div className="flex items-start justify-between mb-3">
<div className="flex-1">
<div className="flex items-center space-x-2 mb-2">
<div className="p-1 bg-orange-100 rounded">
<Volume2 className="w-4 h-4 text-orange-600" />
</div>
<span className="font-medium text-gray-900 text-sm">
{voices[audioItem.voice]?.name || audioItem.voice}
</span>
<span className="text-xs text-gray-500">
{new Date(audioItem.timestamp).toLocaleTimeString()}
</span>
</div>
<p className="text-sm text-gray-700 mb-2 line-clamp-2">
{audioItem.text}
</p>
<div className="flex flex-wrap gap-1 text-xs">
<span className="px-2 py-1 bg-orange-100 text-orange-800 rounded-full">
{audioItem.settings.model}
</span>
<span className="px-2 py-1 bg-blue-100 text-blue-800 rounded-full">
{audioItem.settings.speed}x speed
</span>
<span className="px-2 py-1 bg-green-100 text-green-800 rounded-full">
{formatFileSize(audioItem.audio.size)}
</span>
<span className="px-2 py-1 bg-gray-100 text-gray-800 rounded-full">
~{formatDuration(audioItem.audio.duration_estimate)}
</span>
</div>
</div>
<div className="flex items-center space-x-2">
<button
onClick={() => playAudio(audioItem)}
className="p-2 bg-orange-500 hover:bg-orange-600 text-white rounded-lg transition-colors duration-200"
title={currentlyPlaying?.id === audioItem.id ? "Pause" : "Play"}
>
{currentlyPlaying?.id === audioItem.id && currentlyPlaying?.status === 'playing' ? (
<Pause className="w-4 h-4" />
) : (
<Play className="w-4 h-4" />
)}
</button>
<button
onClick={() => downloadAudio(audioItem)}
className="p-2 bg-green-500 hover:bg-green-600 text-white rounded-lg transition-colors duration-200"
title="Download audio"
>
<Download className="w-4 h-4" />
</button>
</div>
</div>
</div>
))}
</div>
)}
</div>
</div>
</div>
);
}
export default TextToSpeech;

Step 3B: Adding Text-to-Speech to Navigation

Section titled “Step 3B: Adding Text-to-Speech to Navigation”

Update your src/App.jsx to include the new text-to-speech component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import FileAnalysis from "./FileAnalysis";
import TextToSpeech from "./TextToSpeech";
import { MessageSquare, Image, Mic, Folder, Volume2 } from "lucide-react";
function App() {
// 🧠 STATE: Navigation management
const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', 'audio', 'files', or 'speech'
// 🎨 UI: Main app with navigation
return (
<div className="min-h-screen bg-gray-100">
{/* Navigation Header */}
<nav className="bg-white shadow-sm border-b border-gray-200">
<div className="max-w-6xl mx-auto px-4">
<div className="flex items-center justify-between h-16">
{/* Logo */}
<div className="flex items-center space-x-3">
<div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
<span className="text-white font-bold text-sm">AI</span>
</div>
<h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
</div>
{/* Navigation Buttons */}
<div className="flex space-x-2">
<button
onClick={() => setCurrentView("chat")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "chat"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<MessageSquare className="w-4 h-4" />
<span>Chat</span>
</button>
<button
onClick={() => setCurrentView("images")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "images"
? "bg-purple-100 text-purple-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Image className="w-4 h-4" />
<span>Images</span>
</button>
<button
onClick={() => setCurrentView("audio")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "audio"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Mic className="w-4 h-4" />
<span>Audio</span>
</button>
<button
onClick={() => setCurrentView("files")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "files"
? "bg-green-100 text-green-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Folder className="w-4 h-4" />
<span>Files</span>
</button>
<button
onClick={() => setCurrentView("speech")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "speech"
? "bg-orange-100 text-orange-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Volume2 className="w-4 h-4" />
<span>Speech</span>
</button>
</div>
</div>
</div>
</nav>
{/* Main Content */}
<main className="h-[calc(100vh-4rem)]">
{currentView === "chat" && <StreamingChat />}
{currentView === "images" && <ImageGenerator />}
{currentView === "audio" && <AudioTranscription />}
{currentView === "files" && <FileAnalysis />}
{currentView === "speech" && <TextToSpeech />}
</main>
</div>
);
}
export default App;

Let’s test your text-to-speech feature step by step to make sure everything works correctly.

First, verify your backend route works by testing it directly:

Test with a simple text:

Terminal window
curl -X POST http://localhost:8000/api/tts/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello, this is a test of AI voice synthesis.", "voice": "alloy", "model": "tts-1"}'

Expected response:

{
"success": true,
"audio": {
"filename": "tts-1234567890.mp3",
"format": "mp3",
"size": 15420,
"duration_estimate": 3,
"download_url": "/api/tts/download/tts-1234567890.mp3"
},
"generation": {
"voice": "alloy",
"voice_info": {
"name": "Alloy",
"description": "Professional and versatile"
},
"model": "tts-1",
"speed": 1.0,
"text_length": 44
}
}

Start both servers:

Backend (in your backend folder):

Terminal window
npm run dev

Frontend (in your frontend folder):

Terminal window
npm run dev

Test the complete flow:

  1. Navigate to Speech → Click the “Speech” tab in navigation
  2. Select voice settings → Choose voice, quality, speed, and format
  3. Enter text → Type or select a sample text
  4. Generate speech → Click “Generate Speech” and see loading state
  5. Listen to audio → Click play button to hear the generated voice
  6. Download audio → Test downloading the speech file
  7. Try different voices → Test all six AI voices with the same text

Test all six voices with the same text to hear their personalities:

🎙️ Alloy: Professional and neutral
🌊 Echo: Calm and soothing
📚 Fable: Expressive storyteller
🎯 Onyx: Deep and authoritative
☀️ Nova: Warm and friendly
✨ Shimmer: Bright and energetic

Expected behavior:

  • Each voice has distinct personality and tone
  • Audio quality is clear and natural
  • Playback controls work smoothly
  • Download generates proper audio files

Congratulations! You’ve completed your comprehensive OpenAI mastery application with text-to-speech:

  • Extended your backend with voice synthesis and audio file management
  • Added React speech component following the same patterns as your other features
  • Implemented six AI voices with distinct personalities and use cases
  • Created flexible audio settings for quality, speed, and format control
  • Added playback functionality with play/pause controls
  • Maintained consistent design with your existing application

Your complete application now has:

  • Text chat with streaming responses
  • Image generation with DALL-E 3 and GPT-Image-1
  • Audio transcription with Whisper voice recognition
  • File analysis with intelligent document processing
  • Text-to-speech with six AI voice personalities
  • Unified navigation between all features
  • Professional UI with consistent TailwindCSS styling

🎉 You’ve built a complete OpenAI mastery application! Your users can now chat with AI, generate images, transcribe audio, analyze files, and hear AI responses spoken aloud - all in one seamless experience.

Your application demonstrates mastery of OpenAI’s entire ecosystem and provides a solid foundation for building even more advanced AI-powered applications. 🔊