Skip to content

πŸ”Š Make Your AI Talk Back!

Your AI can chat, create images, understand audio, and analyze files. Now let’s give it a voice! 🎀

Imagine users asking β€œWhat’s the weather like?” and your AI speaking back in a warm, friendly voice instead of just showing text. Or reading long articles aloud while they work on other things!

What we’re building: Your AI will be able to speak any text in 6 different voice personalities - from professional business tones to energetic marketing voices. It’s like having a team of voice actors inside your app!


Current state: Your AI shows brilliant text responses Target state: Users can hear your AI speak with natural voices!

Before (Silent AI):

User: "Explain quantum physics"
AI: [Shows long text explanation]
User: [Has to read everything] 😴

After (Speaking AI):

User: "Explain quantum physics"
AI: [Shows text AND speaks it] πŸ”Š
User: [Can listen while doing other things] 🎧

The magic: Your AI becomes accessible, engaging, and multitask-friendly!

Real-world impact:

  • πŸ“± Accessibility heroes - Visually impaired users can fully enjoy your app
  • πŸƒβ€β™€οΈ Multitasking magic - Users can listen while exercising, driving, or working
  • 🧠 Learning boost - Audio learners absorb information better when they hear it
  • πŸ“š Instant podcasts - Turn any article into audio content on demand
  • 🎯 Better engagement - Voice keeps users active instead of passive readers

Without voice AI:

❌ Hire expensive voice actors
❌ Use robotic computer voices
❌ Miss 15% of users who prefer audio
❌ Limited to text-only experiences

With voice AI:

βœ… Professional voices in seconds
βœ… Natural, engaging speech
βœ… Serve all learning styles
βœ… Complete multimedia experience

OpenAI gives you a complete voice acting team! Each one has a distinct personality:

πŸŽ™οΈ Alloy - The Professional

Perfect for: Business presentations, formal content
Sounds like: Your trusted corporate spokesperson
User feels: Confident and professional

🌊 Echo - The Calm Companion

Perfect for: Meditation apps, soothing content
Sounds like: Your gentle yoga instructor
User feels: Relaxed and peaceful

πŸ“š Fable - The Master Storyteller

Perfect for: Creative content, engaging stories
Sounds like: Your favorite audiobook narrator
User feels: Captivated and entertained

🎯 Onyx - The Authority

Perfect for: News, important announcements
Sounds like: Your trusted news anchor
User feels: Informed and confident

β˜€οΈ Nova - The Friendly Helper

Perfect for: Tutorials, customer support
Sounds like: Your helpful best friend
User feels: Welcome and supported

✨ Shimmer - The Energy Booster

Perfect for: Marketing, motivational content
Sounds like: Your enthusiastic coach
User feels: Excited and motivated

Pro tip: We’ll build a voice selector so users can choose their favorite!


πŸ› οΈ Step 1: Add Voice Power to Your Backend

Section titled β€œπŸ› οΈ Step 1: Add Voice Power to Your Backend”

Good news: We’re using the exact same patterns you already know!

What you already have:

// Your familiar Response API pattern
const response = await client.responses.create({
model: "gpt-4o",
input: [systemPrompt, userMessage]
});

What we’re adding:

// New voice synthesis (same style!)
const speech = await client.audio.speech.create({
model: "tts-1",
voice: "alloy",
input: textToSpeak
});

Perfect! Same patterns, just different endpoints.

Simple concept: Text goes in β†’ Beautiful voice comes out!

// What we need to track:
const voiceState = {
textInput: "Hello, I'm your AI assistant!", // What to say
selectedVoice: "nova", // Who says it
audioSettings: { // How to say it
speed: 1.0, // Normal speed
quality: "hd", // High definition
format: "mp3" // Audio format
},
generatedAudio: "audio-file-url", // Result!
}

Voice options:

  • πŸƒβ€β™‚οΈ TTS-1 - Fast generation (great for testing)
  • πŸ’Ž TTS-1-HD - Premium quality (perfect for production)
  • ⚑ Speed control - From 0.25x (slow) to 4x (fast)
  • 🎡 Formats - MP3, Opus, AAC, FLAC

Add this to your existing server - same patterns you know and love:

import fs from 'fs';
import path from 'path';
// πŸ”Š VOICE PROFILES: Available AI voices with personalities
const VOICE_PROFILES = {
alloy: {
name: "Alloy",
description: "Professional and versatile",
bestFor: "Business content, presentations"
},
echo: {
name: "Echo",
description: "Calm and soothing",
bestFor: "Meditation, relaxation content"
},
fable: {
name: "Fable",
description: "Expressive storyteller",
bestFor: "Stories, creative content"
},
onyx: {
name: "Onyx",
description: "Deep and authoritative",
bestFor: "News, formal announcements"
},
nova: {
name: "Nova",
description: "Warm and friendly",
bestFor: "Customer service, tutorials"
},
shimmer: {
name: "Shimmer",
description: "Bright and energetic",
bestFor: "Marketing, upbeat content"
}
};
// πŸ”§ HELPER FUNCTIONS: Audio processing utilities
const saveAudioToTemp = async (audioBuffer, format = 'mp3') => {
const tempDir = path.join(process.cwd(), "temp");
// Create temp directory if it doesn't exist
if (!fs.existsSync(tempDir)) {
fs.mkdirSync(tempDir, { recursive: true });
}
// Create unique filename
const filename = `tts-${Date.now()}.${format}`;
const filepath = path.join(tempDir, filename);
// Write audio file
fs.writeFileSync(filepath, audioBuffer);
// Auto-cleanup after 1 hour
setTimeout(() => {
try {
if (fs.existsSync(filepath)) {
fs.unlinkSync(filepath);
console.log(`🧹 Cleaned up: ${filename}`);
}
} catch (error) {
console.error("Error cleaning up audio file:", error);
}
}, 3600000); // 1 hour
return { filepath, filename };
};
// πŸ”Š AI Text-to-Speech endpoint - add this to your existing server
app.post("/api/tts/generate", async (req, res) => {
try {
// πŸ›‘οΈ VALIDATION: Check required inputs
const {
text,
voice = "alloy",
model = "tts-1",
speed = 1.0,
format = "mp3"
} = req.body;
if (!text || text.trim() === "") {
return res.status(400).json({
error: "Text is required",
success: false
});
}
if (text.length > 4096) {
return res.status(400).json({
error: "Text too long. Maximum 4096 characters allowed.",
current_length: text.length,
success: false
});
}
console.log(`πŸ”Š Generating speech: ${text.substring(0, 50)}... (${voice})`);
// πŸŽ™οΈ AI SPEECH GENERATION: Convert text to speech
const response = await openai.audio.speech.create({
model: model, // tts-1 (fast) or tts-1-hd (high quality)
voice: voice, // AI voice personality
input: text.trim(), // Text to convert
response_format: format, // Audio format (mp3, opus, aac, flac)
speed: Math.max(0.25, Math.min(4.0, speed)) // Speaking speed (0.25x to 4x)
});
// πŸ’Ύ AUDIO PROCESSING: Save audio file
const audioBuffer = Buffer.from(await response.arrayBuffer());
const { filepath, filename } = await saveAudioToTemp(audioBuffer, format);
// πŸ“€ SUCCESS RESPONSE: Send audio info and download link
res.json({
success: true,
audio: {
filename: filename,
format: format,
size: audioBuffer.length,
duration_estimate: Math.ceil(text.length / 14), // ~14 characters per second
download_url: `/api/tts/download/${filename}`
},
generation: {
voice: voice,
voice_info: VOICE_PROFILES[voice],
model: model,
speed: speed,
text_length: text.length
},
timestamp: new Date().toISOString()
});
} catch (error) {
// 🚨 ERROR HANDLING: Handle TTS failures
console.error("Text-to-speech error:", error);
res.status(500).json({
error: "Failed to generate speech",
details: error.message,
success: false
});
}
});
// πŸ“₯ Audio Download endpoint - serve generated audio files
app.get("/api/tts/download/:filename", (req, res) => {
try {
const { filename } = req.params;
const filepath = path.join(process.cwd(), "temp", filename);
// Security check - ensure filename is safe
if (!filename.match(/^tts-\d+\.(mp3|opus|aac|flac)$/)) {
return res.status(400).json({ error: "Invalid filename" });
}
// Check if file exists
if (!fs.existsSync(filepath)) {
return res.status(404).json({ error: "Audio file not found or expired" });
}
// Serve audio file
const extension = path.extname(filename).substring(1);
res.setHeader('Content-Type', `audio/${extension}`);
res.setHeader('Content-Disposition', `attachment; filename="${filename}"`);
const audioBuffer = fs.readFileSync(filepath);
res.send(audioBuffer);
} catch (error) {
console.error("Audio download error:", error);
res.status(500).json({
error: "Failed to download audio",
message: error.message
});
}
});
// πŸŽ™οΈ Voice Information endpoint - get available voices
app.get("/api/tts/voices", (req, res) => {
res.json({
success: true,
voices: VOICE_PROFILES,
models: [
{
id: "tts-1",
name: "TTS-1",
description: "Fast, cost-effective synthesis",
quality: "standard"
},
{
id: "tts-1-hd",
name: "TTS-1 HD",
description: "High-definition audio quality",
quality: "premium"
}
],
formats: ["mp3", "opus", "aac", "flac"],
speed_range: { min: 0.25, max: 4.0, default: 1.0 },
text_limit: 4096
});
});

What this does (step by step):

  1. βœ… Validates text - Makes sure we have something to say
  2. 🎭 Picks voice - Selects the right AI personality
  3. πŸŽ™οΈ Generates speech - OpenAI creates beautiful audio
  4. πŸ’Ύ Saves file - Stores audio temporarily for download
  5. πŸ“€ Returns results - Sends back audio URL and metadata
  6. 🧹 Cleans up - Removes old files automatically

Same reliable patterns as your chat and image features!

Add this middleware to handle text-to-speech specific errors:

// 🚨 TTS ERROR HANDLING: Handle text-to-speech errors
app.use((error, req, res, next) => {
if (error.message && error.message.includes('Invalid voice')) {
return res.status(400).json({
error: "Invalid voice selected. Please choose from: alloy, echo, fable, onyx, nova, shimmer",
success: false
});
}
if (error.message && error.message.includes('text too long')) {
return res.status(400).json({
error: "Text exceeds maximum length of 4096 characters",
success: false
});
}
next(error);
});

Your backend now supports:

  • Text chat (existing functionality)
  • Streaming chat (existing functionality)
  • Image generation (existing functionality)
  • Audio transcription (existing functionality)
  • File analysis (existing functionality)
  • Text-to-speech (new functionality)
---
## πŸ”§ Step 3: Building the React Text-to-Speech Component
Now let's create a React component for text-to-speech using the same patterns from your existing components.
### **Step 3A: Creating the Text-to-Speech Component**
Create a new file `src/TextToSpeech.jsx`:
```jsx
import { useState, useRef, useEffect } from "react";
import { Volume2, Play, Pause, Download, Settings } from "lucide-react";
function TextToSpeech() {
// 🧠 STATE: Text-to-speech data management
const [text, setText] = useState(""); // Text to convert
const [selectedVoice, setSelectedVoice] = useState("alloy"); // AI voice selection
const [audioSettings, setAudioSettings] = useState({ // TTS settings
model: "tts-1",
speed: 1.0,
format: "mp3"
});
const [isGenerating, setIsGenerating] = useState(false); // Processing status
const [generatedAudio, setGeneratedAudio] = useState([]); // Generated audio list
const [currentlyPlaying, setCurrentlyPlaying] = useState(null); // Audio playback state
const [voices, setVoices] = useState({}); // Available voices
const [error, setError] = useState(null); // Error messages
const audioRef = useRef(null);
// Load available voices on component mount
useEffect(() => {
fetchVoices();
}, []);
const fetchVoices = async () => {
try {
const response = await fetch("http://localhost:8000/api/tts/voices");
const data = await response.json();
if (data.success) {
setVoices(data.voices);
}
} catch (error) {
console.error('Failed to fetch voices:', error);
}
};
// πŸ”§ FUNCTIONS: Text-to-speech logic engine
// Main speech generation function
const generateSpeech = async () => {
// πŸ›‘οΈ GUARDS: Prevent invalid generation
if (!text.trim() || isGenerating) return;
// πŸ”„ SETUP: Prepare for generation
setIsGenerating(true);
setError(null);
try {
// πŸ“€ API CALL: Send to your backend
const response = await fetch("http://localhost:8000/api/tts/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: text.trim(),
voice: selectedVoice,
...audioSettings
})
});
const data = await response.json();
if (!response.ok) {
throw new Error(data.error || 'Failed to generate speech');
}
// βœ… SUCCESS: Store generated audio
const newAudio = {
id: Date.now(),
text: text.trim(),
voice: selectedVoice,
settings: audioSettings,
audio: data.audio,
generation: data.generation,
timestamp: new Date().toISOString()
};
setGeneratedAudio(prev => [newAudio, ...prev]);
setText(""); // Clear input after successful generation
} catch (error) {
// 🚨 ERROR HANDLING: Show user-friendly message
console.error('Speech generation failed:', error);
setError(error.message || 'Something went wrong while generating speech');
} finally {
// 🧹 CLEANUP: Reset generation state
setIsGenerating(false);
}
};
// Audio playback function
const playAudio = async (audioItem) => {
try {
if (currentlyPlaying?.id === audioItem.id) {
// Pause current audio
if (audioRef.current) {
audioRef.current.pause();
setCurrentlyPlaying(null);
}
return;
}
// Stop any currently playing audio
if (audioRef.current) {
audioRef.current.pause();
}
// Create new audio element
const audio = new Audio(`http://localhost:8000${audioItem.audio.download_url}`);
audioRef.current = audio;
audio.onloadstart = () => setCurrentlyPlaying({ ...audioItem, status: 'loading' });
audio.oncanplay = () => setCurrentlyPlaying({ ...audioItem, status: 'ready' });
audio.onplay = () => setCurrentlyPlaying({ ...audioItem, status: 'playing' });
audio.onpause = () => setCurrentlyPlaying({ ...audioItem, status: 'paused' });
audio.onended = () => setCurrentlyPlaying(null);
audio.onerror = () => {
setCurrentlyPlaying(null);
setError('Failed to play audio');
};
await audio.play();
} catch (error) {
console.error('Audio playback error:', error);
setCurrentlyPlaying(null);
setError('Failed to play audio');
}
};
// Download audio function
const downloadAudio = (audioItem) => {
try {
const link = document.createElement('a');
link.href = `http://localhost:8000${audioItem.audio.download_url}`;
link.download = `speech-${audioItem.id}.${audioItem.audio.format}`;
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
} catch (error) {
console.error('Download error:', error);
setError('Failed to download audio');
}
};
// Sample texts for quick testing
const sampleTexts = [
"Welcome to our application! I'm excited to help you with AI-powered text-to-speech.",
"Once upon a time, in the world of artificial intelligence, voices came alive with just a few lines of code.",
"This is a test of the emergency broadcast system. This is only a test.",
"Take a deep breath and relax as you listen to this calming AI-generated voice.",
"Breaking news: AI technology continues to amaze us with natural-sounding speech synthesis."
];
// Utility functions
const formatFileSize = (bytes) => {
if (bytes === 0) return '0 Bytes';
const k = 1024;
const sizes = ['Bytes', 'KB', 'MB'];
const i = Math.floor(Math.log(bytes) / Math.log(k));
return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i];
};
const formatDuration = (seconds) => {
const mins = Math.floor(seconds / 60);
const secs = Math.floor(seconds % 60);
return `${mins}:${secs.toString().padStart(2, '0')}`;
};
// 🎨 UI: Interface components
return (
<div className="min-h-screen bg-gradient-to-br from-orange-50 to-red-50 flex items-center justify-center p-4">
<div className="bg-white rounded-2xl shadow-2xl w-full max-w-4xl flex flex-col overflow-hidden">
{/* Header */}
<div className="bg-gradient-to-r from-orange-600 to-red-600 text-white p-6">
<div className="flex items-center space-x-3">
<div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
<Volume2 className="w-5 h-5" />
</div>
<div>
<h1 className="text-xl font-bold">πŸ”Š AI Text-to-Speech</h1>
<p className="text-orange-100 text-sm">Convert any text to natural speech!</p>
</div>
</div>
</div>
{/* Voice Settings Section */}
<div className="p-6 border-b border-gray-200">
<h3 className="font-semibold text-gray-900 mb-4 flex items-center">
<Settings className="w-5 h-5 mr-2 text-orange-600" />
Voice Settings
</h3>
<div className="grid grid-cols-1 md:grid-cols-4 gap-4">
{/* Voice Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Voice</label>
<select
value={selectedVoice}
onChange={(e) => setSelectedVoice(e.target.value)}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
{Object.entries(voices).map(([key, voice]) => (
<option key={key} value={key}>
{voice.name} - {voice.description}
</option>
))}
</select>
</div>
{/* Model Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Quality</label>
<select
value={audioSettings.model}
onChange={(e) => setAudioSettings(prev => ({ ...prev, model: e.target.value }))}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
<option value="tts-1">Standard (Fast)</option>
<option value="tts-1-hd">HD (High Quality)</option>
</select>
</div>
{/* Speed Control */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">
Speed ({audioSettings.speed}x)
</label>
<input
type="range"
min="0.25"
max="4"
step="0.05"
value={audioSettings.speed}
onChange={(e) => setAudioSettings(prev => ({ ...prev, speed: parseFloat(e.target.value) }))}
disabled={isGenerating}
className="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer disabled:cursor-not-allowed"
/>
</div>
{/* Format Selection */}
<div>
<label className="block text-sm font-medium text-gray-700 mb-2">Format</label>
<select
value={audioSettings.format}
onChange={(e) => setAudioSettings(prev => ({ ...prev, format: e.target.value }))}
disabled={isGenerating}
className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
>
<option value="mp3">MP3</option>
<option value="opus">Opus</option>
<option value="aac">AAC</option>
<option value="flac">FLAC</option>
</select>
</div>
</div>
</div>
{/* Text Input Section */}
<div className="p-6 border-b border-gray-200">
<div className="mb-4">
<div className="flex justify-between items-center mb-2">
<label className="block text-sm font-medium text-gray-700">Text to Convert</label>
<span className="text-sm text-gray-500">{text.length}/4096 characters</span>
</div>
<textarea
value={text}
onChange={(e) => setText(e.target.value)}
placeholder="Enter the text you want to convert to speech..."
className="w-full px-4 py-3 border border-gray-300 rounded-xl focus:outline-none focus:ring-2 focus:ring-orange-500 focus:border-transparent transition-all duration-200 resize-none"
rows={4}
maxLength={4096}
disabled={isGenerating}
/>
</div>
{/* Sample Texts */}
<div className="mb-4">
<p className="text-sm text-gray-600 mb-2">Quick samples:</p>
<div className="flex flex-wrap gap-2">
{sampleTexts.map((sample, index) => (
<button
key={index}
onClick={() => setText(sample)}
disabled={isGenerating}
className="px-3 py-1 text-sm bg-gray-100 hover:bg-orange-100 text-gray-700 hover:text-orange-700 rounded-full transition-colors duration-200 disabled:opacity-50 disabled:cursor-not-allowed"
>
{sample.substring(0, 30)}...
</button>
))}
</div>
</div>
{/* Generate Button */}
<div className="flex justify-center">
<button
onClick={generateSpeech}
disabled={isGenerating || !text.trim()}
className="px-8 py-3 bg-gradient-to-r from-orange-600 to-red-600 hover:from-orange-700 hover:to-red-700 disabled:from-gray-300 disabled:to-gray-300 text-white rounded-xl transition-all duration-200 flex items-center space-x-2 shadow-lg disabled:shadow-none"
>
{isGenerating ? (
<>
<div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
<span>Generating...</span>
</>
) : (
<>
<Volume2 className="w-4 h-4" />
<span>Generate Speech</span>
</>
)}
</button>
</div>
</div>
{/* Results Section */}
<div className="flex-1 p-6">
{/* Error Display */}
{error && (
<div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
<p className="text-red-700">
<strong>Error:</strong> {error}
</p>
</div>
)}
{/* Generated Audio List */}
{generatedAudio.length === 0 ? (
<div className="text-center py-12">
<div className="w-16 h-16 bg-orange-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
<Volume2 className="w-8 h-8 text-orange-600" />
</div>
<h3 className="text-lg font-semibold text-gray-700 mb-2">
No Audio Generated Yet
</h3>
<p className="text-gray-600 max-w-md mx-auto">
Enter some text above and click "Generate Speech" to create your first AI voice.
</p>
</div>
) : (
<div className="space-y-4">
<h4 className="font-semibold text-gray-900 mb-4">
Generated Audio ({generatedAudio.length})
</h4>
{generatedAudio.map((audioItem) => (
<div key={audioItem.id} className="bg-gray-50 rounded-lg p-4 border border-gray-200">
<div className="flex items-start justify-between mb-3">
<div className="flex-1">
<div className="flex items-center space-x-2 mb-2">
<div className="p-1 bg-orange-100 rounded">
<Volume2 className="w-4 h-4 text-orange-600" />
</div>
<span className="font-medium text-gray-900 text-sm">
{voices[audioItem.voice]?.name || audioItem.voice}
</span>
<span className="text-xs text-gray-500">
{new Date(audioItem.timestamp).toLocaleTimeString()}
</span>
</div>
<p className="text-sm text-gray-700 mb-2 line-clamp-2">
{audioItem.text}
</p>
<div className="flex flex-wrap gap-1 text-xs">
<span className="px-2 py-1 bg-orange-100 text-orange-800 rounded-full">
{audioItem.settings.model}
</span>
<span className="px-2 py-1 bg-blue-100 text-blue-800 rounded-full">
{audioItem.settings.speed}x speed
</span>
<span className="px-2 py-1 bg-green-100 text-green-800 rounded-full">
{formatFileSize(audioItem.audio.size)}
</span>
<span className="px-2 py-1 bg-gray-100 text-gray-800 rounded-full">
~{formatDuration(audioItem.audio.duration_estimate)}
</span>
</div>
</div>
<div className="flex items-center space-x-2">
<button
onClick={() => playAudio(audioItem)}
className="p-2 bg-orange-500 hover:bg-orange-600 text-white rounded-lg transition-colors duration-200"
title={currentlyPlaying?.id === audioItem.id ? "Pause" : "Play"}
>
{currentlyPlaying?.id === audioItem.id && currentlyPlaying?.status === 'playing' ? (
<Pause className="w-4 h-4" />
) : (
<Play className="w-4 h-4" />
)}
</button>
<button
onClick={() => downloadAudio(audioItem)}
className="p-2 bg-green-500 hover:bg-green-600 text-white rounded-lg transition-colors duration-200"
title="Download audio"
>
<Download className="w-4 h-4" />
</button>
</div>
</div>
</div>
))}
</div>
)}
</div>
</div>
</div>
);
}
export default TextToSpeech;

Update your src/App.jsx to include the new text-to-speech component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import FileAnalysis from "./FileAnalysis";
import TextToSpeech from "./TextToSpeech";
import { MessageSquare, Image, Mic, Folder, Volume2 } from "lucide-react";
function App() {
// 🧠 STATE: Navigation management
const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', 'audio', 'files', or 'speech'
// 🎨 UI: Main app with navigation
return (
<div className="min-h-screen bg-gray-100">
{/* Navigation Header */}
<nav className="bg-white shadow-sm border-b border-gray-200">
<div className="max-w-6xl mx-auto px-4">
<div className="flex items-center justify-between h-16">
{/* Logo */}
<div className="flex items-center space-x-3">
<div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
<span className="text-white font-bold text-sm">AI</span>
</div>
<h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
</div>
{/* Navigation Buttons */}
<div className="flex space-x-2">
<button
onClick={() => setCurrentView("chat")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "chat"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<MessageSquare className="w-4 h-4" />
<span>Chat</span>
</button>
<button
onClick={() => setCurrentView("images")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "images"
? "bg-purple-100 text-purple-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Image className="w-4 h-4" />
<span>Images</span>
</button>
<button
onClick={() => setCurrentView("audio")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "audio"
? "bg-blue-100 text-blue-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Mic className="w-4 h-4" />
<span>Audio</span>
</button>
<button
onClick={() => setCurrentView("files")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "files"
? "bg-green-100 text-green-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Folder className="w-4 h-4" />
<span>Files</span>
</button>
<button
onClick={() => setCurrentView("speech")}
className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
currentView === "speech"
? "bg-orange-100 text-orange-700 shadow-sm"
: "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
}`}
>
<Volume2 className="w-4 h-4" />
<span>Speech</span>
</button>
</div>
</div>
</div>
</nav>
{/* Main Content */}
<main className="h-[calc(100vh-4rem)]">
{currentView === "chat" && <StreamingChat />}
{currentView === "images" && <ImageGenerator />}
{currentView === "audio" && <AudioTranscription />}
{currentView === "files" && <FileAnalysis />}
{currentView === "speech" && <TextToSpeech />}
</main>
</div>
);
}
export default App;

Let’s test your text-to-speech feature step by step to make sure everything works correctly.

First, verify your backend route works by testing it directly:

Test with a simple text:

Terminal window
curl -X POST http://localhost:8000/api/tts/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello, this is a test of AI voice synthesis.", "voice": "alloy", "model": "tts-1"}'

Expected response:

{
"success": true,
"audio": {
"filename": "tts-1234567890.mp3",
"format": "mp3",
"size": 15420,
"duration_estimate": 3,
"download_url": "/api/tts/download/tts-1234567890.mp3"
},
"generation": {
"voice": "alloy",
"voice_info": {
"name": "Alloy",
"description": "Professional and versatile"
},
"model": "tts-1",
"speed": 1.0,
"text_length": 44
}
}

Start both servers:

Backend (in your backend folder):

Terminal window
npm run dev

Frontend (in your frontend folder):

Terminal window
npm run dev

Test the complete flow:

  1. Navigate to Speech β†’ Click the β€œSpeech” tab in navigation
  2. Select voice settings β†’ Choose voice, quality, speed, and format
  3. Enter text β†’ Type or select a sample text
  4. Generate speech β†’ Click β€œGenerate Speech” and see loading state
  5. Listen to audio β†’ Click play button to hear the generated voice
  6. Download audio β†’ Test downloading the speech file
  7. Try different voices β†’ Test all six AI voices with the same text

Test all six voices with the same text to hear their personalities:

πŸŽ™οΈ Alloy: Professional and neutral
🌊 Echo: Calm and soothing
πŸ“š Fable: Expressive storyteller
🎯 Onyx: Deep and authoritative
β˜€οΈ Nova: Warm and friendly
✨ Shimmer: Bright and energetic

Expected behavior:

  • Each voice has distinct personality and tone
  • Audio quality is clear and natural
  • Playback controls work smoothly
  • Download generates proper audio files

Congratulations! You’ve completed your comprehensive OpenAI mastery application with text-to-speech:

  • βœ… Extended your backend with voice synthesis and audio file management
  • βœ… Added React speech component following the same patterns as your other features
  • βœ… Implemented six AI voices with distinct personalities and use cases
  • βœ… Created flexible audio settings for quality, speed, and format control
  • βœ… Added playback functionality with play/pause controls
  • βœ… Maintained consistent design with your existing application

Your complete application now has:

  • Text chat with streaming responses
  • Image generation with DALL-E 3 and GPT-Image-1
  • Audio transcription with Whisper voice recognition
  • File analysis with intelligent document processing
  • Text-to-speech with six AI voice personalities
  • Unified navigation between all features
  • Professional UI with consistent TailwindCSS styling

πŸŽ‰ You’ve built a complete OpenAI mastery application! Your users can now chat with AI, generate images, transcribe audio, analyze files, and hear AI responses spoken aloud - all in one seamless experience.

Your application demonstrates mastery of OpenAI’s entire ecosystem and provides a solid foundation for building even more advanced AI-powered applications. πŸ”Š