🔊 Make Your AI Talk Back!

Esta página aún no está disponible en tu idioma.

Your AI can chat, create images, understand audio, and analyze files. Now let’s give it a voice! 🎤

Imagine users asking “What’s the weather like?” and your AI speaking back in a warm, friendly voice instead of just showing text. Or reading long articles aloud while they work on other things!

What we’re building: Your AI will be able to speak any text in 6 different voice personalities - from professional business tones to energetic marketing voices. It’s like having a team of voice actors inside your app!

🎯 From Silent Text to Speaking AI

Current state: Your AI shows brilliant text responses Target state: Users can hear your AI speak with natural voices!

🔄 The Amazing Transformation

Before (Silent AI):

User: "Explain quantum physics"
AI: [Shows long text explanation]
User: [Has to read everything] 😴

After (Speaking AI):

User: "Explain quantum physics"
AI: [Shows text AND speaks it] 🔊
User: [Can listen while doing other things] 🎧

The magic: Your AI becomes accessible, engaging, and multitask-friendly!

🚀 Why Voice Makes Your App Incredible

Real-world impact:

📱 Accessibility heroes - Visually impaired users can fully enjoy your app
🏃‍♀️ Multitasking magic - Users can listen while exercising, driving, or working
🧠 Learning boost - Audio learners absorb information better when they hear it
📚 Instant podcasts - Turn any article into audio content on demand
🎯 Better engagement - Voice keeps users active instead of passive readers

Without voice AI:

❌ Hire expensive voice actors
❌ Use robotic computer voices
❌ Miss 15% of users who prefer audio
❌ Limited to text-only experiences

With voice AI:

✅ Professional voices in seconds
✅ Natural, engaging speech
✅ Serve all learning styles
✅ Complete multimedia experience

🎭 Meet Your 6 AI Voice Actors

OpenAI gives you a complete voice acting team! Each one has a distinct personality:

🎙️ Alloy - The Professional

Perfect for: Business presentations, formal content
Sounds like: Your trusted corporate spokesperson
User feels: Confident and professional

🌊 Echo - The Calm Companion

Perfect for: Meditation apps, soothing content
Sounds like: Your gentle yoga instructor
User feels: Relaxed and peaceful

📚 Fable - The Master Storyteller

Perfect for: Creative content, engaging stories
Sounds like: Your favorite audiobook narrator
User feels: Captivated and entertained

🎯 Onyx - The Authority

Perfect for: News, important announcements
Sounds like: Your trusted news anchor
User feels: Informed and confident

☀️ Nova - The Friendly Helper

Perfect for: Tutorials, customer support
Sounds like: Your helpful best friend
User feels: Welcome and supported

✨ Shimmer - The Energy Booster

Perfect for: Marketing, motivational content
Sounds like: Your enthusiastic coach
User feels: Excited and motivated

Pro tip: We’ll build a voice selector so users can choose their favorite!

🛠️ Step 1: Add Voice Power to Your Backend

Good news: We’re using the exact same patterns you already know!

What you already have:

// Your familiar Response API pattern
const response = await client.responses.create({
  model: "gpt-4o",
  input: [systemPrompt, userMessage]
});

What we’re adding:

// New voice synthesis (same style!)
const speech = await client.audio.speech.create({
  model: "tts-1",
  voice: "alloy",
  input: textToSpeak
});

Perfect! Same patterns, just different endpoints.

🧠 Understanding Voice Generation Flow

Simple concept: Text goes in → Beautiful voice comes out!

// What we need to track:
const voiceState = {
  textInput: "Hello, I'm your AI assistant!",    // What to say
  selectedVoice: "nova",                         // Who says it
  audioSettings: {                               // How to say it
    speed: 1.0,      // Normal speed
    quality: "hd",   // High definition
    format: "mp3"    // Audio format
  },
  generatedAudio: "audio-file-url",             // Result!
}

Voice options:

🏃‍♂️ TTS-1 - Fast generation (great for testing)
💎 TTS-1-HD - Premium quality (perfect for production)
⚡ Speed control - From 0.25x (slow) to 4x (fast)
🎵 Formats - MP3, Opus, AAC, FLAC

Step 2: Add the Voice Generation Route

Add this to your existing server - same patterns you know and love:

import fs from 'fs';
import path from 'path';

// 🔊 VOICE PROFILES: Available AI voices with personalities
const VOICE_PROFILES = {
  alloy: {
    name: "Alloy",
    description: "Professional and versatile",
    bestFor: "Business content, presentations"
  },
  echo: {
    name: "Echo",
    description: "Calm and soothing",
    bestFor: "Meditation, relaxation content"
  },
  fable: {
    name: "Fable",
    description: "Expressive storyteller",
    bestFor: "Stories, creative content"
  },
  onyx: {
    name: "Onyx",
    description: "Deep and authoritative",
    bestFor: "News, formal announcements"
  },
  nova: {
    name: "Nova",
    description: "Warm and friendly",
    bestFor: "Customer service, tutorials"
  },
  shimmer: {
    name: "Shimmer",
    description: "Bright and energetic",
    bestFor: "Marketing, upbeat content"
  }
};

// 🔧 HELPER FUNCTIONS: Audio processing utilities
const saveAudioToTemp = async (audioBuffer, format = 'mp3') => {
  const tempDir = path.join(process.cwd(), "temp");

  // Create temp directory if it doesn't exist
  if (!fs.existsSync(tempDir)) {
    fs.mkdirSync(tempDir, { recursive: true });
  }

  // Create unique filename
  const filename = `tts-${Date.now()}.${format}`;
  const filepath = path.join(tempDir, filename);

  // Write audio file
  fs.writeFileSync(filepath, audioBuffer);

  // Auto-cleanup after 1 hour
  setTimeout(() => {
    try {
      if (fs.existsSync(filepath)) {
        fs.unlinkSync(filepath);
        console.log(`🧹 Cleaned up: ${filename}`);
      }
    } catch (error) {
      console.error("Error cleaning up audio file:", error);
    }
  }, 3600000); // 1 hour

  return { filepath, filename };
};

// 🔊 AI Text-to-Speech endpoint - add this to your existing server
app.post("/api/tts/generate", async (req, res) => {
  try {
    // 🛡️ VALIDATION: Check required inputs
    const {
      text,
      voice = "alloy",
      model = "tts-1",
      speed = 1.0,
      format = "mp3"
    } = req.body;

    if (!text || text.trim() === "") {
      return res.status(400).json({
        error: "Text is required",
        success: false
      });
    }

    if (text.length > 4096) {
      return res.status(400).json({
        error: "Text too long. Maximum 4096 characters allowed.",
        current_length: text.length,
        success: false
      });
    }

    console.log(`🔊 Generating speech: ${text.substring(0, 50)}... (${voice})`);

    // 🎙️ AI SPEECH GENERATION: Convert text to speech
    const response = await openai.audio.speech.create({
      model: model,               // tts-1 (fast) or tts-1-hd (high quality)
      voice: voice,              // AI voice personality
      input: text.trim(),        // Text to convert
      response_format: format,   // Audio format (mp3, opus, aac, flac)
      speed: Math.max(0.25, Math.min(4.0, speed))  // Speaking speed (0.25x to 4x)
    });

    // 💾 AUDIO PROCESSING: Save audio file
    const audioBuffer = Buffer.from(await response.arrayBuffer());
    const { filepath, filename } = await saveAudioToTemp(audioBuffer, format);

    // 📤 SUCCESS RESPONSE: Send audio info and download link
    res.json({
      success: true,
      audio: {
        filename: filename,
        format: format,
        size: audioBuffer.length,
        duration_estimate: Math.ceil(text.length / 14), // ~14 characters per second
        download_url: `/api/tts/download/${filename}`
      },
      generation: {
        voice: voice,
        voice_info: VOICE_PROFILES[voice],
        model: model,
        speed: speed,
        text_length: text.length
      },
      timestamp: new Date().toISOString()
    });

  } catch (error) {
    // 🚨 ERROR HANDLING: Handle TTS failures
    console.error("Text-to-speech error:", error);
    res.status(500).json({
      error: "Failed to generate speech",
      details: error.message,
      success: false
    });
  }
});

// 📥 Audio Download endpoint - serve generated audio files
app.get("/api/tts/download/:filename", (req, res) => {
  try {
    const { filename } = req.params;
    const filepath = path.join(process.cwd(), "temp", filename);

    // Security check - ensure filename is safe
    if (!filename.match(/^tts-\d+\.(mp3|opus|aac|flac)$/)) {
      return res.status(400).json({ error: "Invalid filename" });
    }

    // Check if file exists
    if (!fs.existsSync(filepath)) {
      return res.status(404).json({ error: "Audio file not found or expired" });
    }

    // Serve audio file
    const extension = path.extname(filename).substring(1);
    res.setHeader('Content-Type', `audio/${extension}`);
    res.setHeader('Content-Disposition', `attachment; filename="${filename}"`);

    const audioBuffer = fs.readFileSync(filepath);
    res.send(audioBuffer);

  } catch (error) {
    console.error("Audio download error:", error);
    res.status(500).json({
      error: "Failed to download audio",
      message: error.message
    });
  }
});

// 🎙️ Voice Information endpoint - get available voices
app.get("/api/tts/voices", (req, res) => {
  res.json({
    success: true,
    voices: VOICE_PROFILES,
    models: [
      {
        id: "tts-1",
        name: "TTS-1",
        description: "Fast, cost-effective synthesis",
        quality: "standard"
      },
      {
        id: "tts-1-hd",
        name: "TTS-1 HD",
        description: "High-definition audio quality",
        quality: "premium"
      }
    ],
    formats: ["mp3", "opus", "aac", "flac"],
    speed_range: { min: 0.25, max: 4.0, default: 1.0 },
    text_limit: 4096
  });
});

What this does (step by step):

✅ Validates text - Makes sure we have something to say
🎭 Picks voice - Selects the right AI personality
🎙️ Generates speech - OpenAI creates beautiful audio
💾 Saves file - Stores audio temporarily for download
📤 Returns results - Sends back audio URL and metadata
🧹 Cleans up - Removes old files automatically

Same reliable patterns as your chat and image features!

Step 2C: Adding Error Handling for TTS

Add this middleware to handle text-to-speech specific errors:

// 🚨 TTS ERROR HANDLING: Handle text-to-speech errors
app.use((error, req, res, next) => {
  if (error.message && error.message.includes('Invalid voice')) {
    return res.status(400).json({
      error: "Invalid voice selected. Please choose from: alloy, echo, fable, onyx, nova, shimmer",
      success: false
    });
  }

  if (error.message && error.message.includes('text too long')) {
    return res.status(400).json({
      error: "Text exceeds maximum length of 4096 characters",
      success: false
    });
  }

  next(error);
});

Your backend now supports:

Text chat (existing functionality)
Streaming chat (existing functionality)
Image generation (existing functionality)
Audio transcription (existing functionality)
File analysis (existing functionality)
Text-to-speech (new functionality)

---

## 🔧 Step 3: Building the React Text-to-Speech Component

Now let's create a React component for text-to-speech using the same patterns from your existing components.

### **Step 3A: Creating the Text-to-Speech Component**

Create a new file `src/TextToSpeech.jsx`:

```jsx
import { useState, useRef, useEffect } from "react";
import { Volume2, Play, Pause, Download, Settings } from "lucide-react";

function TextToSpeech() {
  // 🧠 STATE: Text-to-speech data management
  const [text, setText] = useState("");                      // Text to convert
  const [selectedVoice, setSelectedVoice] = useState("alloy"); // AI voice selection
  const [audioSettings, setAudioSettings] = useState({        // TTS settings
    model: "tts-1",
    speed: 1.0,
    format: "mp3"
  });
  const [isGenerating, setIsGenerating] = useState(false);    // Processing status
  const [generatedAudio, setGeneratedAudio] = useState([]);   // Generated audio list
  const [currentlyPlaying, setCurrentlyPlaying] = useState(null); // Audio playback state
  const [voices, setVoices] = useState({});                  // Available voices
  const [error, setError] = useState(null);                  // Error messages

  const audioRef = useRef(null);

  // Load available voices on component mount
  useEffect(() => {
    fetchVoices();
  }, []);

  const fetchVoices = async () => {
    try {
      const response = await fetch("http://localhost:8000/api/tts/voices");
      const data = await response.json();
      if (data.success) {
        setVoices(data.voices);
      }
    } catch (error) {
      console.error('Failed to fetch voices:', error);
    }
  };

  // 🔧 FUNCTIONS: Text-to-speech logic engine

  // Main speech generation function
  const generateSpeech = async () => {
    // 🛡️ GUARDS: Prevent invalid generation
    if (!text.trim() || isGenerating) return;

    // 🔄 SETUP: Prepare for generation
    setIsGenerating(true);
    setError(null);

    try {
      // 📤 API CALL: Send to your backend
      const response = await fetch("http://localhost:8000/api/tts/generate", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          text: text.trim(),
          voice: selectedVoice,
          ...audioSettings
        })
      });

      const data = await response.json();

      if (!response.ok) {
        throw new Error(data.error || 'Failed to generate speech');
      }

      // ✅ SUCCESS: Store generated audio
      const newAudio = {
        id: Date.now(),
        text: text.trim(),
        voice: selectedVoice,
        settings: audioSettings,
        audio: data.audio,
        generation: data.generation,
        timestamp: new Date().toISOString()
      };

      setGeneratedAudio(prev => [newAudio, ...prev]);
      setText(""); // Clear input after successful generation

    } catch (error) {
      // 🚨 ERROR HANDLING: Show user-friendly message
      console.error('Speech generation failed:', error);
      setError(error.message || 'Something went wrong while generating speech');
    } finally {
      // 🧹 CLEANUP: Reset generation state
      setIsGenerating(false);
    }
  };

  // Audio playback function
  const playAudio = async (audioItem) => {
    try {
      if (currentlyPlaying?.id === audioItem.id) {
        // Pause current audio
        if (audioRef.current) {
          audioRef.current.pause();
          setCurrentlyPlaying(null);
        }
        return;
      }

      // Stop any currently playing audio
      if (audioRef.current) {
        audioRef.current.pause();
      }

      // Create new audio element
      const audio = new Audio(`http://localhost:8000${audioItem.audio.download_url}`);
      audioRef.current = audio;

      audio.onloadstart = () => setCurrentlyPlaying({ ...audioItem, status: 'loading' });
      audio.oncanplay = () => setCurrentlyPlaying({ ...audioItem, status: 'ready' });
      audio.onplay = () => setCurrentlyPlaying({ ...audioItem, status: 'playing' });
      audio.onpause = () => setCurrentlyPlaying({ ...audioItem, status: 'paused' });
      audio.onended = () => setCurrentlyPlaying(null);
      audio.onerror = () => {
        setCurrentlyPlaying(null);
        setError('Failed to play audio');
      };

      await audio.play();
    } catch (error) {
      console.error('Audio playback error:', error);
      setCurrentlyPlaying(null);
      setError('Failed to play audio');
    }
  };

  // Download audio function
  const downloadAudio = (audioItem) => {
    try {
      const link = document.createElement('a');
      link.href = `http://localhost:8000${audioItem.audio.download_url}`;
      link.download = `speech-${audioItem.id}.${audioItem.audio.format}`;
      document.body.appendChild(link);
      link.click();
      document.body.removeChild(link);
    } catch (error) {
      console.error('Download error:', error);
      setError('Failed to download audio');
    }
  };

  // Sample texts for quick testing
  const sampleTexts = [
    "Welcome to our application! I'm excited to help you with AI-powered text-to-speech.",
    "Once upon a time, in the world of artificial intelligence, voices came alive with just a few lines of code.",
    "This is a test of the emergency broadcast system. This is only a test.",
    "Take a deep breath and relax as you listen to this calming AI-generated voice.",
    "Breaking news: AI technology continues to amaze us with natural-sounding speech synthesis."
  ];

  // Utility functions
  const formatFileSize = (bytes) => {
    if (bytes === 0) return '0 Bytes';
    const k = 1024;
    const sizes = ['Bytes', 'KB', 'MB'];
    const i = Math.floor(Math.log(bytes) / Math.log(k));
    return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i];
  };

  const formatDuration = (seconds) => {
    const mins = Math.floor(seconds / 60);
    const secs = Math.floor(seconds % 60);
    return `${mins}:${secs.toString().padStart(2, '0')}`;
  };

  // 🎨 UI: Interface components
  return (
    <div className="min-h-screen bg-gradient-to-br from-orange-50 to-red-50 flex items-center justify-center p-4">
      <div className="bg-white rounded-2xl shadow-2xl w-full max-w-4xl flex flex-col overflow-hidden">

        {/* Header */}
        <div className="bg-gradient-to-r from-orange-600 to-red-600 text-white p-6">
          <div className="flex items-center space-x-3">
            <div className="w-10 h-10 bg-white bg-opacity-20 rounded-full flex items-center justify-center">
              <Volume2 className="w-5 h-5" />
            </div>
            <div>
              <h1 className="text-xl font-bold">🔊 AI Text-to-Speech</h1>
              <p className="text-orange-100 text-sm">Convert any text to natural speech!</p>
            </div>
          </div>
        </div>

        {/* Voice Settings Section */}
        <div className="p-6 border-b border-gray-200">
          <h3 className="font-semibold text-gray-900 mb-4 flex items-center">
            <Settings className="w-5 h-5 mr-2 text-orange-600" />
            Voice Settings
          </h3>

          <div className="grid grid-cols-1 md:grid-cols-4 gap-4">
            {/* Voice Selection */}
            <div>
              <label className="block text-sm font-medium text-gray-700 mb-2">Voice</label>
              <select
                value={selectedVoice}
                onChange={(e) => setSelectedVoice(e.target.value)}
                disabled={isGenerating}
                className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
              >
                {Object.entries(voices).map(([key, voice]) => (
                  <option key={key} value={key}>
                    {voice.name} - {voice.description}
                  </option>
                ))}
              </select>
            </div>

            {/* Model Selection */}
            <div>
              <label className="block text-sm font-medium text-gray-700 mb-2">Quality</label>
              <select
                value={audioSettings.model}
                onChange={(e) => setAudioSettings(prev => ({ ...prev, model: e.target.value }))}
                disabled={isGenerating}
                className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
              >
                <option value="tts-1">Standard (Fast)</option>
                <option value="tts-1-hd">HD (High Quality)</option>
              </select>
            </div>

            {/* Speed Control */}
            <div>
              <label className="block text-sm font-medium text-gray-700 mb-2">
                Speed ({audioSettings.speed}x)
              </label>
              <input
                type="range"
                min="0.25"
                max="4"
                step="0.05"
                value={audioSettings.speed}
                onChange={(e) => setAudioSettings(prev => ({ ...prev, speed: parseFloat(e.target.value) }))}
                disabled={isGenerating}
                className="w-full h-2 bg-gray-200 rounded-lg appearance-none cursor-pointer disabled:cursor-not-allowed"
              />
            </div>

            {/* Format Selection */}
            <div>
              <label className="block text-sm font-medium text-gray-700 mb-2">Format</label>
              <select
                value={audioSettings.format}
                onChange={(e) => setAudioSettings(prev => ({ ...prev, format: e.target.value }))}
                disabled={isGenerating}
                className="w-full px-3 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-orange-500 disabled:bg-gray-100"
              >
                <option value="mp3">MP3</option>
                <option value="opus">Opus</option>
                <option value="aac">AAC</option>
                <option value="flac">FLAC</option>
              </select>
            </div>
          </div>
        </div>

        {/* Text Input Section */}
        <div className="p-6 border-b border-gray-200">
          <div className="mb-4">
            <div className="flex justify-between items-center mb-2">
              <label className="block text-sm font-medium text-gray-700">Text to Convert</label>
              <span className="text-sm text-gray-500">{text.length}/4096 characters</span>
            </div>
            <textarea
              value={text}
              onChange={(e) => setText(e.target.value)}
              placeholder="Enter the text you want to convert to speech..."
              className="w-full px-4 py-3 border border-gray-300 rounded-xl focus:outline-none focus:ring-2 focus:ring-orange-500 focus:border-transparent transition-all duration-200 resize-none"
              rows={4}
              maxLength={4096}
              disabled={isGenerating}
            />
          </div>

          {/* Sample Texts */}
          <div className="mb-4">
            <p className="text-sm text-gray-600 mb-2">Quick samples:</p>
            <div className="flex flex-wrap gap-2">
              {sampleTexts.map((sample, index) => (
                <button
                  key={index}
                  onClick={() => setText(sample)}
                  disabled={isGenerating}
                  className="px-3 py-1 text-sm bg-gray-100 hover:bg-orange-100 text-gray-700 hover:text-orange-700 rounded-full transition-colors duration-200 disabled:opacity-50 disabled:cursor-not-allowed"
                >
                  {sample.substring(0, 30)}...
                </button>
              ))}
            </div>
          </div>

          {/* Generate Button */}
          <div className="flex justify-center">
            <button
              onClick={generateSpeech}
              disabled={isGenerating || !text.trim()}
              className="px-8 py-3 bg-gradient-to-r from-orange-600 to-red-600 hover:from-orange-700 hover:to-red-700 disabled:from-gray-300 disabled:to-gray-300 text-white rounded-xl transition-all duration-200 flex items-center space-x-2 shadow-lg disabled:shadow-none"
            >
              {isGenerating ? (
                <>
                  <div className="w-4 h-4 border-2 border-white border-t-transparent rounded-full animate-spin"></div>
                  <span>Generating...</span>
                </>
              ) : (
                <>
                  <Volume2 className="w-4 h-4" />
                  <span>Generate Speech</span>
                </>
              )}
            </button>
          </div>
        </div>

        {/* Results Section */}
        <div className="flex-1 p-6">
          {/* Error Display */}
          {error && (
            <div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-4">
              <p className="text-red-700">
                <strong>Error:</strong> {error}
              </p>
            </div>
          )}

          {/* Generated Audio List */}
          {generatedAudio.length === 0 ? (
            <div className="text-center py-12">
              <div className="w-16 h-16 bg-orange-100 rounded-2xl flex items-center justify-center mx-auto mb-4">
                <Volume2 className="w-8 h-8 text-orange-600" />
              </div>
              <h3 className="text-lg font-semibold text-gray-700 mb-2">
                No Audio Generated Yet
              </h3>
              <p className="text-gray-600 max-w-md mx-auto">
                Enter some text above and click "Generate Speech" to create your first AI voice.
              </p>
            </div>
          ) : (
            <div className="space-y-4">
              <h4 className="font-semibold text-gray-900 mb-4">
                Generated Audio ({generatedAudio.length})
              </h4>

              {generatedAudio.map((audioItem) => (
                <div key={audioItem.id} className="bg-gray-50 rounded-lg p-4 border border-gray-200">
                  <div className="flex items-start justify-between mb-3">
                    <div className="flex-1">
                      <div className="flex items-center space-x-2 mb-2">
                        <div className="p-1 bg-orange-100 rounded">
                          <Volume2 className="w-4 h-4 text-orange-600" />
                        </div>
                        <span className="font-medium text-gray-900 text-sm">
                          {voices[audioItem.voice]?.name || audioItem.voice}
                        </span>
                        <span className="text-xs text-gray-500">
                          {new Date(audioItem.timestamp).toLocaleTimeString()}
                        </span>
                      </div>

                      <p className="text-sm text-gray-700 mb-2 line-clamp-2">
                        {audioItem.text}
                      </p>

                      <div className="flex flex-wrap gap-1 text-xs">
                        <span className="px-2 py-1 bg-orange-100 text-orange-800 rounded-full">
                          {audioItem.settings.model}
                        </span>
                        <span className="px-2 py-1 bg-blue-100 text-blue-800 rounded-full">
                          {audioItem.settings.speed}x speed
                        </span>
                        <span className="px-2 py-1 bg-green-100 text-green-800 rounded-full">
                          {formatFileSize(audioItem.audio.size)}
                        </span>
                        <span className="px-2 py-1 bg-gray-100 text-gray-800 rounded-full">
                          ~{formatDuration(audioItem.audio.duration_estimate)}
                        </span>
                      </div>
                    </div>

                    <div className="flex items-center space-x-2">
                      <button
                        onClick={() => playAudio(audioItem)}
                        className="p-2 bg-orange-500 hover:bg-orange-600 text-white rounded-lg transition-colors duration-200"
                        title={currentlyPlaying?.id === audioItem.id ? "Pause" : "Play"}
                      >
                        {currentlyPlaying?.id === audioItem.id && currentlyPlaying?.status === 'playing' ? (
                          <Pause className="w-4 h-4" />
                        ) : (
                          <Play className="w-4 h-4" />
                        )}
                      </button>

                      <button
                        onClick={() => downloadAudio(audioItem)}
                        className="p-2 bg-green-500 hover:bg-green-600 text-white rounded-lg transition-colors duration-200"
                        title="Download audio"
                      >
                        <Download className="w-4 h-4" />
                      </button>
                    </div>
                  </div>
                </div>
              ))}
            </div>
          )}
        </div>
      </div>
    </div>
  );
}

export default TextToSpeech;

Update your src/App.jsx to include the new text-to-speech component:

import { useState } from "react";
import StreamingChat from "./StreamingChat";
import ImageGenerator from "./ImageGenerator";
import AudioTranscription from "./AudioTranscription";
import FileAnalysis from "./FileAnalysis";
import TextToSpeech from "./TextToSpeech";
import { MessageSquare, Image, Mic, Folder, Volume2 } from "lucide-react";

function App() {
  // 🧠 STATE: Navigation management
  const [currentView, setCurrentView] = useState("chat"); // 'chat', 'images', 'audio', 'files', or 'speech'

  // 🎨 UI: Main app with navigation
  return (
    <div className="min-h-screen bg-gray-100">
      {/* Navigation Header */}
      <nav className="bg-white shadow-sm border-b border-gray-200">
        <div className="max-w-6xl mx-auto px-4">
          <div className="flex items-center justify-between h-16">
            {/* Logo */}
            <div className="flex items-center space-x-3">
              <div className="w-8 h-8 bg-gradient-to-r from-blue-500 to-purple-600 rounded-lg flex items-center justify-center">
                <span className="text-white font-bold text-sm">AI</span>
              </div>
              <h1 className="text-xl font-bold text-gray-900">OpenAI Mastery</h1>
            </div>

            {/* Navigation Buttons */}
            <div className="flex space-x-2">
              <button
                onClick={() => setCurrentView("chat")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "chat"
                    ? "bg-blue-100 text-blue-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <MessageSquare className="w-4 h-4" />
                <span>Chat</span>
              </button>

              <button
                onClick={() => setCurrentView("images")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "images"
                    ? "bg-purple-100 text-purple-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <Image className="w-4 h-4" />
                <span>Images</span>
              </button>

              <button
                onClick={() => setCurrentView("audio")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "audio"
                    ? "bg-blue-100 text-blue-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <Mic className="w-4 h-4" />
                <span>Audio</span>
              </button>

              <button
                onClick={() => setCurrentView("files")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "files"
                    ? "bg-green-100 text-green-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <Folder className="w-4 h-4" />
                <span>Files</span>
              </button>

              <button
                onClick={() => setCurrentView("speech")}
                className={`px-4 py-2 rounded-lg flex items-center space-x-2 transition-all duration-200 ${
                  currentView === "speech"
                    ? "bg-orange-100 text-orange-700 shadow-sm"
                    : "text-gray-600 hover:text-gray-900 hover:bg-gray-100"
                }`}
              >
                <Volume2 className="w-4 h-4" />
                <span>Speech</span>
              </button>
            </div>
          </div>
        </div>
      </nav>

      {/* Main Content */}
      <main className="h-[calc(100vh-4rem)]">
        {currentView === "chat" && <StreamingChat />}
        {currentView === "images" && <ImageGenerator />}
        {currentView === "audio" && <AudioTranscription />}
        {currentView === "files" && <FileAnalysis />}
        {currentView === "speech" && <TextToSpeech />}
      </main>
    </div>
  );
}

export default App;

🧪 Testing Your Text-to-Speech

Let’s test your text-to-speech feature step by step to make sure everything works correctly.

Step 1: Backend Route Test

First, verify your backend route works by testing it directly:

Test with a simple text:

curl -X POST http://localhost:8000/api/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test of AI voice synthesis.", "voice": "alloy", "model": "tts-1"}'

Expected response:

{
  "success": true,
  "audio": {
    "filename": "tts-1234567890.mp3",
    "format": "mp3",
    "size": 15420,
    "duration_estimate": 3,
    "download_url": "/api/tts/download/tts-1234567890.mp3"
  },
  "generation": {
    "voice": "alloy",
    "voice_info": {
      "name": "Alloy",
      "description": "Professional and versatile"
    },
    "model": "tts-1",
    "speed": 1.0,
    "text_length": 44
  }
}

Step 2: Full Application Test

Start both servers:

Backend (in your backend folder):

npm run dev

Frontend (in your frontend folder):

npm run dev

Test the complete flow:

Navigate to Speech → Click the “Speech” tab in navigation
Select voice settings → Choose voice, quality, speed, and format
Enter text → Type or select a sample text
Generate speech → Click “Generate Speech” and see loading state
Listen to audio → Click play button to hear the generated voice
Download audio → Test downloading the speech file
Try different voices → Test all six AI voices with the same text

Step 3: Voice Comparison Test

Test all six voices with the same text to hear their personalities:

🎙️ Alloy: Professional and neutral
🌊 Echo: Calm and soothing
📚 Fable: Expressive storyteller
🎯 Onyx: Deep and authoritative
☀️ Nova: Warm and friendly
✨ Shimmer: Bright and energetic

Expected behavior:

Each voice has distinct personality and tone
Audio quality is clear and natural
Playback controls work smoothly
Download generates proper audio files

✅ What You Built

Congratulations! You’ve completed your comprehensive OpenAI mastery application with text-to-speech:

✅ Extended your backend with voice synthesis and audio file management
✅ Added React speech component following the same patterns as your other features
✅ Implemented six AI voices with distinct personalities and use cases
✅ Created flexible audio settings for quality, speed, and format control
✅ Added playback functionality with play/pause controls
✅ Maintained consistent design with your existing application

Your complete application now has:

Text chat with streaming responses
Image generation with DALL-E 3 and GPT-Image-1
Audio transcription with Whisper voice recognition
File analysis with intelligent document processing
Text-to-speech with six AI voice personalities
Unified navigation between all features
Professional UI with consistent TailwindCSS styling

🎉 You’ve built a complete OpenAI mastery application! Your users can now chat with AI, generate images, transcribe audio, analyze files, and hear AI responses spoken aloud - all in one seamless experience.

Your application demonstrates mastery of OpenAI’s entire ecosystem and provides a solid foundation for building even more advanced AI-powered applications. 🔊