ElevenLabs API Research

Assigned to: Backend Developer Duration: 20 hours (2.5 working days)

SDK Repository: https://github.com/RageAgainstThePixel/ElevenLabs-DotNet

Motivation & Goal

Why This Research?

MicDots transforms QR codes into voice experiences. To build the MVP, we need to understand ElevenLabs text-to-speech capabilities and make critical technical decisions:

Which model should we use? (Turbo v2.5 vs Eleven Flash)
How do custom voices work? (Client-provided voice IDs OR default voices)
What are the costs and performance? (Real metrics for budget planning)
How do we implement it? (SDK integration, error handling, data structure)

Voice Selection:

We'll test with 3 voices:

If client provides voice IDs: Use their 3 voice IDs
If client doesn't provide: Use defaults (1 male, 1 female, 1 male British)

What We'll Deliver:

After 20 hours of comprehensive research, the client will have:

✅ Audio samples from both models with 3 test voices
✅ Technical metrics (cost, speed, file size, quality)
✅ Detailed performance analysis and edge case testing
✅ Developer recommendation with justification
✅ Everything needed to make an informed model selection decision

Decision Flow:

Developer tests both models, provides data and audio samples
Client listens, evaluates quality, reviews costs
Client decides which model to use based on their priorities
Developer implements the chosen solution

This research removes guesswork and enables confident, data-driven decisions for the MicDots MVP.

Task Breakdown & Time Estimation

ID	Task	Description	Time Estimation	Day
MD-P1-REL-01	Setup & Basic TTS	SDK installation, authentication, basic text-to-speech working	2 hours	Day 1
MD-P1-REL-02	Model Comparison	Test both models with 3 samples, comprehensive quality analysis, performance metrics	6 hours	Day 1-2
MD-P1-REL-03	Voice Customization	Test with client's 3 voice IDs, validate they work, edge cases testing	4 hours	Day 2
MD-P1-REL-04	Code Samples	Working SDK integration examples with error handling	3 hours	Day 2
MD-P1-REL-05	Client Review Prep	Organize audio samples, prepare comparison materials	2 hours	Day 3
MD-P1-REL-06	Documentation	Complete deliverable template with findings	3 hours	Day 3

Total Time Estimation: 20 hours (Day 1: 8h, Day 2: 7h, Day 3: 5h)

Focus Areas:

Comprehensive Testing: Thorough model comparison with quality analysis
Performance Metrics: Real-world response times and scalability testing
Edge Cases: Test error scenarios and special character handling
Client Decision: Provide complete materials for informed model selection
Production-Ready: Code samples with proper error handling and best practices

Pre-Requirements (What We Need from Client)

📋 View Detailed Pre-Requirements →

Before starting this research, confirm the following checklist:

Required from Client:

ElevenLabs Account Information (email, tier, usage limits)
API Credentials (API Key with appropriate permissions)
Use Case Specifications (target audience, voice characteristics)

Optional but Helpful:

Voice IDs for Testing (3 voice IDs, mix of male/female)
- If client provides voice IDs, we'll test with those
- If NOT provided, we'll use defaults: 1 male, 1 female, 1 male British
- Future requirement: In the future, we will need 8 desired voices
Budget Constraints (max cost per generation, monthly budget)
Quality Expectations (audio quality standards, clarity requirements)

Audio Format (Pre-defined):

MP3, 128 kbps, Mono - Optimized for voice speech and slow internet

Before Starting Day 1:

API key is active and tested
Voice IDs are valid and accessible with the provided API key
Account has sufficient credits/quota for testing (~18 samples)
Client is available for questions during the research period

Research Tasks

1. Setup & Basic TTS

Duration: 1 hour (Day 1)

Objective: Install SDK, configure environment, and get basic text-to-speech working

SDK Repository: https://github.com/RageAgainstThePixel/ElevenLabs-DotNet

Installation:

dotnet add package ElevenLabs-DotNet

Audio Format Requirements:

Format: MP3 only
Bitrate: 128 kbps (optimized for voice speech)
Channels: Mono (single channel)
Purpose: Good quality for slow internet, optimized for voice/QR code use case

SDK Initialization Example:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

// Initialize the SDK client
var api = new ElevenLabsClient("your-api-key-here");

// Verify API connection by listing available voices
var voices = await api.VoicesEndpoint.GetAllVoicesAsync();
Console.WriteLine($"API Connected! Found {voices.Count} voices");

Basic Text-to-Speech Example:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

var api = new ElevenLabsClient("your-api-key-here");

// Configure audio settings: MP3, 128 kbps, 44.1kHz, mono
var voiceSettings = new VoiceSettings(
    stability: 0.5f,
    similarityBoost: 0.75f
);

// Test text
string text = "Welcome to MicDots! Scan the QR code to hear your personalized message.";

// Use default voice for initial test
string voiceId = "21m00Tcm4TlvDq8ikWAM"; // Rachel (ElevenLabs default voice)

// Generate speech (uses mp3_44100_128 by default)
var audioClip = await api.TextToSpeechEndpoint.TextToSpeechAsync(
    text: text,
    voiceId: voiceId,
    voiceSettings: voiceSettings,
    outputFormat: OutputFormat.MP3_44100_128  // MP3, 44.1kHz, 128 kbps, mono
);

// Save to file
await File.WriteAllBytesAsync("test_output.mp3", audioClip.ClipData.ToArray());
Console.WriteLine("Audio generated successfully!");

Audio Format Configuration:

The SDK supports these output formats. For MicDots, use MP3_44100_128:

// Available output formats
OutputFormat.MP3_44100_128    // ← Use this for MicDots (MP3, 44.1kHz, 128 kbps, mono)
OutputFormat.MP3_44100_192    // Higher quality MP3
OutputFormat.PCM_16000        // Raw PCM audio
OutputFormat.PCM_22050        // Raw PCM audio
OutputFormat.PCM_24000        // Raw PCM audio
OutputFormat.PCM_44100        // Raw PCM audio

Tasks:

Install ElevenLabs-DotNet SDK via NuGet
Initialize SDK client with API key
Test API connection by listing voices
Configure SDK to output MP3_44100_128 format
Generate first audio sample using example code above
Verify audio file is playable and format is correct (MP3, 128 kbps, mono)
Verify file size is reasonable for mobile/slow connections

Success Criteria:

SDK installed and working
API connection active and can list voices
Basic TTS conversion successful
Audio format verified: MP3, 128 kbps, mono
Developer understands how to use SDK and configure audio format

2. Model Comparison

Duration: 6 hours (Day 1-2)

Objective: Test both models (Turbo v2.5 and Eleven Flash) with 3 samples, comprehensive performance testing, provide materials for client to decide which model to use

Documentation: https://elevenlabs.io/docs/models

Models to Focus On:

Turbo v2.5 - Fast, optimized for English (Model ID: eleven_turbo_v2_5)
Eleven Flash v2.5 - Ultra-fast, low latency, English optimized (Model ID: eleven_flash_v2_5)

How to Specify Models in SDK:

The ElevenLabs SDK uses the Model enum or string identifiers to specify which model to use:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

var api = new ElevenLabsClient("your-api-key-here");

// Method 1: Using Model enum (recommended)
var turboModel = Model.ElevenTurboV2_5;
var flashModel = Model.ElevenFlashV2_5;

// Method 2: Using string identifiers
string turboModelId = "eleven_turbo_v2_5";
string flashModelId = "eleven_flash_v2_5";

Complete Model Comparison Example:

using ElevenLabs;
using ElevenLabs.TextToSpeech;
using System.Diagnostics;

var api = new ElevenLabsClient("your-api-key-here");

string text = "Welcome to MicDots! Scan the QR code to hear your personalized message.";
string voiceId = "21m00Tcm4TlvDq8ikWAM"; // Rachel

var voiceSettings = new VoiceSettings(stability: 0.5f, similarityBoost: 0.75f);

// Test Turbo v2.5
var stopwatch = Stopwatch.StartNew();
var turboAudio = await api.TextToSpeechEndpoint.TextToSpeechAsync(
    text: text,
    voiceId: voiceId,
    model: Model.ElevenTurboV2_5,  // ← Specify Turbo v2.5
    voiceSettings: voiceSettings,
    outputFormat: OutputFormat.MP3_44100_128
);
stopwatch.Stop();

await File.WriteAllBytesAsync("short_turbo.mp3", turboAudio.ClipData.ToArray());
Console.WriteLine($"Turbo v2.5: {stopwatch.ElapsedMilliseconds}ms, Size: {turboAudio.ClipData.Length / 1024}KB");

// Test Eleven Flash v2.5
stopwatch.Restart();
var flashAudio = await api.TextToSpeechEndpoint.TextToSpeechAsync(
    text: text,
    voiceId: voiceId,
    model: Model.ElevenFlashV2_5,  // ← Specify Flash v2.5
    voiceSettings: voiceSettings,
    outputFormat: OutputFormat.MP3_44100_128
);
stopwatch.Stop();

await File.WriteAllBytesAsync("short_flash.mp3", flashAudio.ClipData.ToArray());
Console.WriteLine($"Flash v2.5: {stopwatch.ElapsedMilliseconds}ms, Size: {flashAudio.ClipData.Length / 1024}KB");

Model Identifiers Reference:

Model Name	SDK Enum	String Identifier
Turbo v2.5	`Model.ElevenTurboV2_5`	`"eleven_turbo_v2_5"`
Flash v2.5	`Model.ElevenFlashV2_5`	`"eleven_flash_v2_5"`
Multilingual v2	`Model.ElevenMultilingualV2`	`"eleven_multilingual_v2"`
Monolingual v1	`Model.ElevenMonolingualV1`	`"eleven_monolingual_v1"`

Required Test Examples:

Short promotional text (75 chars):
- "Welcome to MicDots! Scan the QR code to hear your personalized message."
Medium product description (240 chars):
- "This limited edition product features premium materials and cutting-edge technology. Designed for professionals who demand excellence, it combines durability with elegant aesthetics. Perfect for both everyday use and special occasions."
Long narrative text (600 chars):
- "In today's fast-paced digital world, communication has evolved beyond traditional text and images. QR codes have become ubiquitous, appearing on everything from restaurant menus to museum exhibits. But what if these codes could speak? What if instead of reading static information, users could simply scan and listen? That's the vision behind our platform - transforming silent QR codes into interactive audio experiences. Whether you're a business owner looking to engage customers, an educator creating accessible content, or a marketer crafting memorable campaigns, voice-enabled QR codes open up new possibilities for connection and engagement."

Tasks:

Test Turbo v2.5 with all 3 samples
Test Eleven Flash with all 3 samples
Generate and save all 6 audio files (2 models × 3 examples)
Label clearly: short_turbo.mp3, short_flash.mp3, medium_turbo.mp3, etc.
Organize in folders by model for easy comparison
Verify audio format is MP3, 128 kbps, mono
Record all metrics for client decision

Metrics to Record:

Materials to Provide Client:

All 6 audio samples organized by model
Technical metrics comparison table
Cost comparison
Speed/performance comparison
Developer observations on quality and reliability

Success Criteria:

All 6 audio samples generated
All metrics recorded
Files organized for client review
Client has everything needed to make decision

3. Voice Customization

Duration: 4 hours (Day 2)

Objective: Test with client's 3 voice IDs, validate they work with both models, and test edge cases (special characters, long text, API failures)

What is Voice Customization?

Voice customization involves testing client-provided voice IDs and validating they work correctly:

Validate voice IDs: Verify voice IDs are recognized by ElevenLabs API
Test accessibility: Confirm voices are accessible with client's API key
Generate samples: Test voice IDs with both models
Compare quality: Let client hear their chosen voices with both models
Define data structure: Determine what voice data to store in database

How to Validate Voice IDs:

using ElevenLabs;
using ElevenLabs.Voices;

var api = new ElevenLabsClient("your-api-key-here");

// Method 1: Get specific voice by ID
string voiceId = "21m00Tcm4TlvDq8ikWAM"; // Rachel

try
{
    var voice = await api.VoicesEndpoint.GetVoiceAsync(voiceId);

    Console.WriteLine($"✓ Voice ID Valid: {voice.VoiceId}");
    Console.WriteLine($"  Name: {voice.Name}");
    Console.WriteLine($"  Category: {voice.Category}");
    Console.WriteLine($"  Labels: {string.Join(", ", voice.Labels.Select(l => $"{l.Key}={l.Value}"))}");

    // Extract gender, accent, age from labels
    var gender = voice.Labels.ContainsKey("gender") ? voice.Labels["gender"] : "unknown";
    var accent = voice.Labels.ContainsKey("accent") ? voice.Labels["accent"] : "unknown";
    var age = voice.Labels.ContainsKey("age") ? voice.Labels["age"] : "unknown";

    Console.WriteLine($"  Gender: {gender}, Accent: {accent}, Age: {age}");
}
catch (Exception ex)
{
    Console.WriteLine($"✗ Voice ID Invalid: {voiceId}");
    Console.WriteLine($"  Error: {ex.Message}");
}

// Method 2: List all available voices
var allVoices = await api.VoicesEndpoint.GetAllVoicesAsync();

Console.WriteLine($"\nTotal voices available: {allVoices.Count}");

foreach (var v in allVoices.Take(5)) // Show first 5
{
    Console.WriteLine($"- {v.Name} (ID: {v.VoiceId})");
}

How to Test Multiple Voices with Both Models:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

var api = new ElevenLabsClient("your-api-key-here");

// Client-provided voice IDs (or use defaults)
var voiceIds = new Dictionary<string, string>
{
    { "voice1", "21m00Tcm4TlvDq8ikWAM" },  // Rachel (Female, American)
    { "voice2", "VR6AewLTigWG4xSOukaG" },  // Arnold (Male, American)
    { "voice3", "TX3LPaxmHKxFdv7VOQHJ" }   // Liam (Male, British)
};

var models = new[]
{
    ("turbo", Model.ElevenTurboV2_5),
    ("flash", Model.ElevenFlashV2_5)
};

var testTexts = new Dictionary<string, string>
{
    { "short", "Welcome to MicDots! Scan the QR code to hear your personalized message." },
    { "medium", "This limited edition product features premium materials and cutting-edge technology." },
    { "long", "In today's fast-paced digital world, communication has evolved beyond traditional text..." }
};

// Generate all combinations: 3 voices × 2 models × 3 texts = 18 audio files
foreach (var (voiceName, voiceId) in voiceIds)
{
    // Validate voice first
    try
    {
        var voice = await api.VoicesEndpoint.GetVoiceAsync(voiceId);
        Console.WriteLine($"Testing voice: {voice.Name} ({voiceId})");

        foreach (var (modelName, model) in models)
        {
            foreach (var (textName, text) in testTexts)
            {
                try
                {
                    var stopwatch = System.Diagnostics.Stopwatch.StartNew();

                    var audio = await api.TextToSpeechEndpoint.TextToSpeechAsync(
                        text: text,
                        voiceId: voiceId,
                        model: model,
                        outputFormat: OutputFormat.MP3_44100_128
                    );

                    stopwatch.Stop();

                    // Save with clear naming: short_turbo_rachel.mp3
                    string filename = $"{textName}_{modelName}_{voice.Name.ToLower()}.mp3";
                    await File.WriteAllBytesAsync(filename, audio.ClipData.ToArray());

                    Console.WriteLine($"  ✓ {filename} - {stopwatch.ElapsedMilliseconds}ms, {audio.ClipData.Length / 1024}KB");
                }
                catch (Exception ex)
                {
                    Console.WriteLine($"  ✗ Failed: {textName}_{modelName}_{voiceName} - {ex.Message}");
                }
            }
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine($"✗ Voice validation failed: {voiceId} - {ex.Message}");
    }
}

Default Voice IDs (if client doesn't provide):

Voice Name	Voice ID	Gender	Accent	Use Case
Rachel	`21m00Tcm4TlvDq8ikWAM`	Female	American	Clear, professional
Arnold	`VR6AewLTigWG4xSOukaG`	Male	American	Deep, authoritative
Liam	`TX3LPaxmHKxFdv7VOQHJ`	Male	British	Sophisticated, accent variety

Tasks:

Validate all 3 client-provided voice IDs using code above
Use SDK to fetch voice details (name, gender, accent, labels)
Test each voice ID with both models (Turbo v2.5 and Flash v2.5)
Generate all audio files (3 voices × 2 models × 3 samples = 18 total)
Save audio files with clear naming convention: {length}_{model}_{voicename}.mp3
Verify audio format: MP3, 128 kbps, mono
Record metrics for each voice/model combination (time, size, cost)
Test error scenarios (invalid voice ID, permission issues)
Determine minimum required data to store in database

Metrics to Record:

File size for each voice/model combination
Generation time per voice
Audio duration
Cost per voice/model combination
Validation results (which voices work, which don't)

Voice Gallery Data Structure to Test:

Test which data structure works best:

Option 1: Voice ID Only (Minimal)

{
  "voiceId": "21m00Tcm4TlvDq8ikWAM"
}

Option 2: Full Voice Object (Recommended)

{
  "voiceId": "21m00Tcm4TlvDq8ikWAM",
  "name": "Rachel",
  "language": "en",
  "gender": "female",
  "accent": "American"
}

Success Criteria:

All 18 audio samples generated (3 voices × 2 models × 3 samples)
All voice IDs validated and working
Files organized by voice and model
All metrics recorded
Recommended data structure defined

4. Code Samples

Duration: 3 hours (Day 2)

Objective: Create working SDK integration examples in C# with comprehensive error handling

Required Code Samples:

Basic Text-to-Speech Conversion
- Initialize SDK client
- Convert text to speech
- Save audio file locally
- Configure audio format (MP3, 128 kbps, mono)
Voice Validation
- Fetch voice details by voice ID
- Validate voice exists and is accessible
- Test audio generation with voice ID
- Handle validation errors
Error Handling
- Handle API errors gracefully
- Retry logic for failed requests
- Timeout handling
- Invalid voice ID handling

Tasks:

Write clean, documented C# code
Use ElevenLabs-DotNet SDK best practices
Include error handling in all samples
Test all code samples
Add comments explaining each step

Success Criteria:

All code samples working and tested
Code is clean and well-documented
Samples cover all required scenarios

5. Client Review Prep

Duration: 2 hours (Day 3)

Objective: Organize audio samples and prepare final deliverable for client review

Tasks:

Organize all audio files in clear folder structure
- /turbo-v2.5/ - All Turbo v2.5 samples
- /eleven-flash/ - All Eleven Flash samples
Verify file naming conventions are clear
- Format: {length}_{model}_{voicename}.mp3
Create sample organization documentation
Verify all audio files are MP3, 128 kbps, mono
Test that all audio files play correctly
Prepare evaluation templates for client
Final quality check

Folder Structure Example:

/elevenlabs-research-samples/
├── turbo-v2.5/
│   ├── short_turbo_voice1.mp3
│   ├── short_turbo_voice2.mp3
│   ├── short_turbo_voice3.mp3
│   ├── medium_turbo_voice1.mp3
│   ├── medium_turbo_voice2.mp3
│   ├── medium_turbo_voice3.mp3
│   ├── long_turbo_voice1.mp3
│   ├── long_turbo_voice2.mp3
│   └── long_turbo_voice3.mp3
└── eleven-flash/
    ├── short_flash_voice1.mp3
    ├── short_flash_voice2.mp3
    ├── short_flash_voice3.mp3
    ├── medium_flash_voice1.mp3
    ├── medium_flash_voice2.mp3
    ├── medium_flash_voice3.mp3
    ├── long_flash_voice1.mp3
    ├── long_flash_voice2.mp3
    └── long_flash_voice3.mp3

Success Criteria:

All audio files organized and accessible
File names are clear and consistent
All files tested and working
Ready for client review

6. Documentation

Duration: 3 hours (Day 3)

Objective: Complete deliverable template with all findings

📄 Deliverable Template →

Complete the deliverable template with all research findings, technical metrics, audio samples, and recommendations for client review.

Sections to Complete:

Audio Samples Delivered
- Document sample organization
- List test texts used
- Table of voice IDs tested
Technical Metrics
- Complete performance data table for all tests
- Calculate average metrics per model
Technical Implementation Notes
- SDK integration complexity
- Error handling observations
- API rate limits & constraints
- Audio format verification
Model Comparison Summary
- Technical trade-offs
- Pros and cons for each model
Red Flags & Technical Concerns
- Reliability issues
- Deprecation warnings
- API limitations
- Performance concerns
Voice Gallery System Research
- Voice ID validation results
- Minimum required data structure
- Code samples
Developer Recommendation
- Our best approach
- Recommended model and justification
- Technical implementation approach
- Cost implications
- Risk assessment
Client Decision Section
- Quality assessment templates (for client)
- Prepare for final model/voice selection
Code Samples
- Include all working code in appendix

Tasks:

Fill out all sections of deliverable template
Review for completeness
Verify all data is accurate
Check all code samples are included
Proofread documentation
Final review before submission

Success Criteria:

Deliverable template 100% complete
All sections filled with accurate data
Ready for client review and decision

Client Decision Process

After research completion:

Developer provides:
- All audio samples organized
- Technical metrics and cost data
- Developer recommendation
Client evaluates:
- Listen to all audio samples
- Rate quality using provided scale (1-5)
- Review technical metrics and costs
Client decides:
- Select preferred model (Turbo v2.5 or Eleven Flash)
- Select preferred voices
- Approve budget
Developer implements:
- Integrate selected model
- Use client-selected voices
- Implement based on technical findings

Client Evaluation Scale:

5 - Excellent: Ready to use, professional quality
4 - Good: High quality, minor imperfections
3 - Normal: Acceptable quality, usable
2 - Bad: Poor quality, noticeable issues
1 - Not ready to use: Unacceptable quality

Resources

ElevenLabs-DotNet SDK: https://github.com/RageAgainstThePixel/ElevenLabs-DotNet
SDK Documentation: Check repository README for usage examples
ElevenLabs Documentation: https://elevenlabs.io/docs
API Reference: https://elevenlabs.io/docs/api-reference
Models Guide: https://elevenlabs.io/docs/models
Voices Overview: https://elevenlabs.io/docs/voices
Pricing: https://elevenlabs.io/pricing

Notes Section

Use this space to document findings, issues, or observations during research:

Day 1 Notes:
-
-
-

Day 2 Notes:
-
-
-

Issues Encountered:
-
-

Recommendations:
-
-

Motivation & Goal​

Task Breakdown & Time Estimation​

Pre-Requirements (What We Need from Client)​

Required from Client:​

Optional but Helpful:​

Audio Format (Pre-defined):​

Before Starting Day 1:​

Research Tasks​

1. Setup & Basic TTS​

2. Model Comparison​

3. Voice Customization​

4. Code Samples​

5. Client Review Prep​

6. Documentation​

Client Decision Process​

Resources​

Notes Section​

Motivation & Goal

Task Breakdown & Time Estimation

Pre-Requirements (What We Need from Client)

Required from Client:

Optional but Helpful:

Audio Format (Pre-defined):

Before Starting Day 1:

Research Tasks

1. Setup & Basic TTS

2. Model Comparison

3. Voice Customization

4. Code Samples

5. Client Review Prep

6. Documentation

Client Decision Process

Resources

Notes Section