Skip to main content

ElevenLabs API Research

Assigned to: Backend Developer Duration: 20 hours (2.5 working days)

SDK Repository: https://github.com/RageAgainstThePixel/ElevenLabs-DotNet


Motivation & Goal

Why This Research?

MicDots transforms QR codes into voice experiences. To build the MVP, we need to understand ElevenLabs text-to-speech capabilities and make critical technical decisions:

  • Which model should we use? (Turbo v2.5 vs Eleven Flash)
  • How do custom voices work? (Client-provided voice IDs OR default voices)
  • What are the costs and performance? (Real metrics for budget planning)
  • How do we implement it? (SDK integration, error handling, data structure)

Voice Selection:

We'll test with 3 voices:

  • If client provides voice IDs: Use their 3 voice IDs
  • If client doesn't provide: Use defaults (1 male, 1 female, 1 male British)

What We'll Deliver:

After 20 hours of comprehensive research, the client will have:

  • ✅ Audio samples from both models with 3 test voices
  • ✅ Technical metrics (cost, speed, file size, quality)
  • ✅ Detailed performance analysis and edge case testing
  • ✅ Developer recommendation with justification
  • ✅ Everything needed to make an informed model selection decision

Decision Flow:

  1. Developer tests both models, provides data and audio samples
  2. Client listens, evaluates quality, reviews costs
  3. Client decides which model to use based on their priorities
  4. Developer implements the chosen solution

This research removes guesswork and enables confident, data-driven decisions for the MicDots MVP.


Task Breakdown & Time Estimation

IDTaskDescriptionTime EstimationDay
MD-P1-REL-01Setup & Basic TTSSDK installation, authentication, basic text-to-speech working2 hoursDay 1
MD-P1-REL-02Model ComparisonTest both models with 3 samples, comprehensive quality analysis, performance metrics6 hoursDay 1-2
MD-P1-REL-03Voice CustomizationTest with client's 3 voice IDs, validate they work, edge cases testing4 hoursDay 2
MD-P1-REL-04Code SamplesWorking SDK integration examples with error handling3 hoursDay 2
MD-P1-REL-05Client Review PrepOrganize audio samples, prepare comparison materials2 hoursDay 3
MD-P1-REL-06DocumentationComplete deliverable template with findings3 hoursDay 3

Total Time Estimation: 20 hours (Day 1: 8h, Day 2: 7h, Day 3: 5h)

Focus Areas:

  • Comprehensive Testing: Thorough model comparison with quality analysis
  • Performance Metrics: Real-world response times and scalability testing
  • Edge Cases: Test error scenarios and special character handling
  • Client Decision: Provide complete materials for informed model selection
  • Production-Ready: Code samples with proper error handling and best practices

Pre-Requirements (What We Need from Client)

📋 View Detailed Pre-Requirements →

Before starting this research, confirm the following checklist:

Required from Client:

  • ElevenLabs Account Information (email, tier, usage limits)
  • API Credentials (API Key with appropriate permissions)
  • Use Case Specifications (target audience, voice characteristics)

Optional but Helpful:

  • Voice IDs for Testing (3 voice IDs, mix of male/female)
    • If client provides voice IDs, we'll test with those
    • If NOT provided, we'll use defaults: 1 male, 1 female, 1 male British
    • Future requirement: In the future, we will need 8 desired voices
  • Budget Constraints (max cost per generation, monthly budget)
  • Quality Expectations (audio quality standards, clarity requirements)

Audio Format (Pre-defined):

  • MP3, 128 kbps, Mono - Optimized for voice speech and slow internet

Before Starting Day 1:

  • API key is active and tested
  • Voice IDs are valid and accessible with the provided API key
  • Account has sufficient credits/quota for testing (~18 samples)
  • Client is available for questions during the research period

Research Tasks

1. Setup & Basic TTS

Duration: 1 hour (Day 1)

Objective: Install SDK, configure environment, and get basic text-to-speech working

SDK Repository: https://github.com/RageAgainstThePixel/ElevenLabs-DotNet

Installation:

dotnet add package ElevenLabs-DotNet

Audio Format Requirements:

  • Format: MP3 only
  • Bitrate: 128 kbps (optimized for voice speech)
  • Channels: Mono (single channel)
  • Purpose: Good quality for slow internet, optimized for voice/QR code use case

SDK Initialization Example:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

// Initialize the SDK client
var api = new ElevenLabsClient("your-api-key-here");

// Verify API connection by listing available voices
var voices = await api.VoicesEndpoint.GetAllVoicesAsync();
Console.WriteLine($"API Connected! Found {voices.Count} voices");

Basic Text-to-Speech Example:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

var api = new ElevenLabsClient("your-api-key-here");

// Configure audio settings: MP3, 128 kbps, 44.1kHz, mono
var voiceSettings = new VoiceSettings(
stability: 0.5f,
similarityBoost: 0.75f
);

// Test text
string text = "Welcome to MicDots! Scan the QR code to hear your personalized message.";

// Use default voice for initial test
string voiceId = "21m00Tcm4TlvDq8ikWAM"; // Rachel (ElevenLabs default voice)

// Generate speech (uses mp3_44100_128 by default)
var audioClip = await api.TextToSpeechEndpoint.TextToSpeechAsync(
text: text,
voiceId: voiceId,
voiceSettings: voiceSettings,
outputFormat: OutputFormat.MP3_44100_128 // MP3, 44.1kHz, 128 kbps, mono
);

// Save to file
await File.WriteAllBytesAsync("test_output.mp3", audioClip.ClipData.ToArray());
Console.WriteLine("Audio generated successfully!");

Audio Format Configuration:

The SDK supports these output formats. For MicDots, use MP3_44100_128:

// Available output formats
OutputFormat.MP3_44100_128 // ← Use this for MicDots (MP3, 44.1kHz, 128 kbps, mono)
OutputFormat.MP3_44100_192 // Higher quality MP3
OutputFormat.PCM_16000 // Raw PCM audio
OutputFormat.PCM_22050 // Raw PCM audio
OutputFormat.PCM_24000 // Raw PCM audio
OutputFormat.PCM_44100 // Raw PCM audio

Tasks:

  • Install ElevenLabs-DotNet SDK via NuGet
  • Initialize SDK client with API key
  • Test API connection by listing voices
  • Configure SDK to output MP3_44100_128 format
  • Generate first audio sample using example code above
  • Verify audio file is playable and format is correct (MP3, 128 kbps, mono)
  • Verify file size is reasonable for mobile/slow connections

Success Criteria:

  • SDK installed and working
  • API connection active and can list voices
  • Basic TTS conversion successful
  • Audio format verified: MP3, 128 kbps, mono
  • Developer understands how to use SDK and configure audio format

2. Model Comparison

Duration: 6 hours (Day 1-2)

Objective: Test both models (Turbo v2.5 and Eleven Flash) with 3 samples, comprehensive performance testing, provide materials for client to decide which model to use

Documentation: https://elevenlabs.io/docs/models

Models to Focus On:

  • Turbo v2.5 - Fast, optimized for English (Model ID: eleven_turbo_v2_5)
  • Eleven Flash v2.5 - Ultra-fast, low latency, English optimized (Model ID: eleven_flash_v2_5)

How to Specify Models in SDK:

The ElevenLabs SDK uses the Model enum or string identifiers to specify which model to use:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

var api = new ElevenLabsClient("your-api-key-here");

// Method 1: Using Model enum (recommended)
var turboModel = Model.ElevenTurboV2_5;
var flashModel = Model.ElevenFlashV2_5;

// Method 2: Using string identifiers
string turboModelId = "eleven_turbo_v2_5";
string flashModelId = "eleven_flash_v2_5";

Complete Model Comparison Example:

using ElevenLabs;
using ElevenLabs.TextToSpeech;
using System.Diagnostics;

var api = new ElevenLabsClient("your-api-key-here");

string text = "Welcome to MicDots! Scan the QR code to hear your personalized message.";
string voiceId = "21m00Tcm4TlvDq8ikWAM"; // Rachel

var voiceSettings = new VoiceSettings(stability: 0.5f, similarityBoost: 0.75f);

// Test Turbo v2.5
var stopwatch = Stopwatch.StartNew();
var turboAudio = await api.TextToSpeechEndpoint.TextToSpeechAsync(
text: text,
voiceId: voiceId,
model: Model.ElevenTurboV2_5, // ← Specify Turbo v2.5
voiceSettings: voiceSettings,
outputFormat: OutputFormat.MP3_44100_128
);
stopwatch.Stop();

await File.WriteAllBytesAsync("short_turbo.mp3", turboAudio.ClipData.ToArray());
Console.WriteLine($"Turbo v2.5: {stopwatch.ElapsedMilliseconds}ms, Size: {turboAudio.ClipData.Length / 1024}KB");

// Test Eleven Flash v2.5
stopwatch.Restart();
var flashAudio = await api.TextToSpeechEndpoint.TextToSpeechAsync(
text: text,
voiceId: voiceId,
model: Model.ElevenFlashV2_5, // ← Specify Flash v2.5
voiceSettings: voiceSettings,
outputFormat: OutputFormat.MP3_44100_128
);
stopwatch.Stop();

await File.WriteAllBytesAsync("short_flash.mp3", flashAudio.ClipData.ToArray());
Console.WriteLine($"Flash v2.5: {stopwatch.ElapsedMilliseconds}ms, Size: {flashAudio.ClipData.Length / 1024}KB");

Model Identifiers Reference:

Model NameSDK EnumString Identifier
Turbo v2.5Model.ElevenTurboV2_5"eleven_turbo_v2_5"
Flash v2.5Model.ElevenFlashV2_5"eleven_flash_v2_5"
Multilingual v2Model.ElevenMultilingualV2"eleven_multilingual_v2"
Monolingual v1Model.ElevenMonolingualV1"eleven_monolingual_v1"

Required Test Examples:

  1. Short promotional text (75 chars):

    • "Welcome to MicDots! Scan the QR code to hear your personalized message."
  2. Medium product description (240 chars):

    • "This limited edition product features premium materials and cutting-edge technology. Designed for professionals who demand excellence, it combines durability with elegant aesthetics. Perfect for both everyday use and special occasions."
  3. Long narrative text (600 chars):

    • "In today's fast-paced digital world, communication has evolved beyond traditional text and images. QR codes have become ubiquitous, appearing on everything from restaurant menus to museum exhibits. But what if these codes could speak? What if instead of reading static information, users could simply scan and listen? That's the vision behind our platform - transforming silent QR codes into interactive audio experiences. Whether you're a business owner looking to engage customers, an educator creating accessible content, or a marketer crafting memorable campaigns, voice-enabled QR codes open up new possibilities for connection and engagement."

Tasks:

  • Test Turbo v2.5 with all 3 samples
  • Test Eleven Flash with all 3 samples
  • Generate and save all 6 audio files (2 models × 3 examples)
  • Label clearly: short_turbo.mp3, short_flash.mp3, medium_turbo.mp3, etc.
  • Organize in folders by model for easy comparison
  • Verify audio format is MP3, 128 kbps, mono
  • Record all metrics for client decision

Metrics to Record:

  • File size for each sample
  • Generation time (API response time)
  • Audio duration
  • Cost per sample
  • API rate limits or constraints observed

Materials to Provide Client:

  • All 6 audio samples organized by model
  • Technical metrics comparison table
  • Cost comparison
  • Speed/performance comparison
  • Developer observations on quality and reliability

Success Criteria:

  • All 6 audio samples generated
  • All metrics recorded
  • Files organized for client review
  • Client has everything needed to make decision

3. Voice Customization

Duration: 4 hours (Day 2)

Objective: Test with client's 3 voice IDs, validate they work with both models, and test edge cases (special characters, long text, API failures)

What is Voice Customization?

Voice customization involves testing client-provided voice IDs and validating they work correctly:

  1. Validate voice IDs: Verify voice IDs are recognized by ElevenLabs API
  2. Test accessibility: Confirm voices are accessible with client's API key
  3. Generate samples: Test voice IDs with both models
  4. Compare quality: Let client hear their chosen voices with both models
  5. Define data structure: Determine what voice data to store in database

How to Validate Voice IDs:

using ElevenLabs;
using ElevenLabs.Voices;

var api = new ElevenLabsClient("your-api-key-here");

// Method 1: Get specific voice by ID
string voiceId = "21m00Tcm4TlvDq8ikWAM"; // Rachel

try
{
var voice = await api.VoicesEndpoint.GetVoiceAsync(voiceId);

Console.WriteLine($"✓ Voice ID Valid: {voice.VoiceId}");
Console.WriteLine($" Name: {voice.Name}");
Console.WriteLine($" Category: {voice.Category}");
Console.WriteLine($" Labels: {string.Join(", ", voice.Labels.Select(l => $"{l.Key}={l.Value}"))}");

// Extract gender, accent, age from labels
var gender = voice.Labels.ContainsKey("gender") ? voice.Labels["gender"] : "unknown";
var accent = voice.Labels.ContainsKey("accent") ? voice.Labels["accent"] : "unknown";
var age = voice.Labels.ContainsKey("age") ? voice.Labels["age"] : "unknown";

Console.WriteLine($" Gender: {gender}, Accent: {accent}, Age: {age}");
}
catch (Exception ex)
{
Console.WriteLine($"✗ Voice ID Invalid: {voiceId}");
Console.WriteLine($" Error: {ex.Message}");
}

// Method 2: List all available voices
var allVoices = await api.VoicesEndpoint.GetAllVoicesAsync();

Console.WriteLine($"\nTotal voices available: {allVoices.Count}");

foreach (var v in allVoices.Take(5)) // Show first 5
{
Console.WriteLine($"- {v.Name} (ID: {v.VoiceId})");
}

How to Test Multiple Voices with Both Models:

using ElevenLabs;
using ElevenLabs.TextToSpeech;

var api = new ElevenLabsClient("your-api-key-here");

// Client-provided voice IDs (or use defaults)
var voiceIds = new Dictionary<string, string>
{
{ "voice1", "21m00Tcm4TlvDq8ikWAM" }, // Rachel (Female, American)
{ "voice2", "VR6AewLTigWG4xSOukaG" }, // Arnold (Male, American)
{ "voice3", "TX3LPaxmHKxFdv7VOQHJ" } // Liam (Male, British)
};

var models = new[]
{
("turbo", Model.ElevenTurboV2_5),
("flash", Model.ElevenFlashV2_5)
};

var testTexts = new Dictionary<string, string>
{
{ "short", "Welcome to MicDots! Scan the QR code to hear your personalized message." },
{ "medium", "This limited edition product features premium materials and cutting-edge technology." },
{ "long", "In today's fast-paced digital world, communication has evolved beyond traditional text..." }
};

// Generate all combinations: 3 voices × 2 models × 3 texts = 18 audio files
foreach (var (voiceName, voiceId) in voiceIds)
{
// Validate voice first
try
{
var voice = await api.VoicesEndpoint.GetVoiceAsync(voiceId);
Console.WriteLine($"Testing voice: {voice.Name} ({voiceId})");

foreach (var (modelName, model) in models)
{
foreach (var (textName, text) in testTexts)
{
try
{
var stopwatch = System.Diagnostics.Stopwatch.StartNew();

var audio = await api.TextToSpeechEndpoint.TextToSpeechAsync(
text: text,
voiceId: voiceId,
model: model,
outputFormat: OutputFormat.MP3_44100_128
);

stopwatch.Stop();

// Save with clear naming: short_turbo_rachel.mp3
string filename = $"{textName}_{modelName}_{voice.Name.ToLower()}.mp3";
await File.WriteAllBytesAsync(filename, audio.ClipData.ToArray());

Console.WriteLine($" ✓ {filename} - {stopwatch.ElapsedMilliseconds}ms, {audio.ClipData.Length / 1024}KB");
}
catch (Exception ex)
{
Console.WriteLine($" ✗ Failed: {textName}_{modelName}_{voiceName} - {ex.Message}");
}
}
}
}
catch (Exception ex)
{
Console.WriteLine($"✗ Voice validation failed: {voiceId} - {ex.Message}");
}
}

Default Voice IDs (if client doesn't provide):

Voice NameVoice IDGenderAccentUse Case
Rachel21m00Tcm4TlvDq8ikWAMFemaleAmericanClear, professional
ArnoldVR6AewLTigWG4xSOukaGMaleAmericanDeep, authoritative
LiamTX3LPaxmHKxFdv7VOQHJMaleBritishSophisticated, accent variety

Tasks:

  • Validate all 3 client-provided voice IDs using code above
  • Use SDK to fetch voice details (name, gender, accent, labels)
  • Test each voice ID with both models (Turbo v2.5 and Flash v2.5)
  • Generate all audio files (3 voices × 2 models × 3 samples = 18 total)
  • Save audio files with clear naming convention: {length}_{model}_{voicename}.mp3
  • Verify audio format: MP3, 128 kbps, mono
  • Record metrics for each voice/model combination (time, size, cost)
  • Test error scenarios (invalid voice ID, permission issues)
  • Determine minimum required data to store in database

Metrics to Record:

  • File size for each voice/model combination
  • Generation time per voice
  • Audio duration
  • Cost per voice/model combination
  • Validation results (which voices work, which don't)

Voice Gallery Data Structure to Test:

Test which data structure works best:

Option 1: Voice ID Only (Minimal)

{
"voiceId": "21m00Tcm4TlvDq8ikWAM"
}

Option 2: Full Voice Object (Recommended)

{
"voiceId": "21m00Tcm4TlvDq8ikWAM",
"name": "Rachel",
"language": "en",
"gender": "female",
"accent": "American"
}

Success Criteria:

  • All 18 audio samples generated (3 voices × 2 models × 3 samples)
  • All voice IDs validated and working
  • Files organized by voice and model
  • All metrics recorded
  • Recommended data structure defined

4. Code Samples

Duration: 3 hours (Day 2)

Objective: Create working SDK integration examples in C# with comprehensive error handling

Required Code Samples:

  1. Basic Text-to-Speech Conversion

    • Initialize SDK client
    • Convert text to speech
    • Save audio file locally
    • Configure audio format (MP3, 128 kbps, mono)
  2. Voice Validation

    • Fetch voice details by voice ID
    • Validate voice exists and is accessible
    • Test audio generation with voice ID
    • Handle validation errors
  3. Error Handling

    • Handle API errors gracefully
    • Retry logic for failed requests
    • Timeout handling
    • Invalid voice ID handling

Tasks:

  • Write clean, documented C# code
  • Use ElevenLabs-DotNet SDK best practices
  • Include error handling in all samples
  • Test all code samples
  • Add comments explaining each step

Success Criteria:

  • All code samples working and tested
  • Code is clean and well-documented
  • Samples cover all required scenarios

5. Client Review Prep

Duration: 2 hours (Day 3)

Objective: Organize audio samples and prepare final deliverable for client review

Tasks:

  • Organize all audio files in clear folder structure
    • /turbo-v2.5/ - All Turbo v2.5 samples
    • /eleven-flash/ - All Eleven Flash samples
  • Verify file naming conventions are clear
    • Format: {length}_{model}_{voicename}.mp3
  • Create sample organization documentation
  • Verify all audio files are MP3, 128 kbps, mono
  • Test that all audio files play correctly
  • Prepare evaluation templates for client
  • Final quality check

Folder Structure Example:

/elevenlabs-research-samples/
├── turbo-v2.5/
│ ├── short_turbo_voice1.mp3
│ ├── short_turbo_voice2.mp3
│ ├── short_turbo_voice3.mp3
│ ├── medium_turbo_voice1.mp3
│ ├── medium_turbo_voice2.mp3
│ ├── medium_turbo_voice3.mp3
│ ├── long_turbo_voice1.mp3
│ ├── long_turbo_voice2.mp3
│ └── long_turbo_voice3.mp3
└── eleven-flash/
├── short_flash_voice1.mp3
├── short_flash_voice2.mp3
├── short_flash_voice3.mp3
├── medium_flash_voice1.mp3
├── medium_flash_voice2.mp3
├── medium_flash_voice3.mp3
├── long_flash_voice1.mp3
├── long_flash_voice2.mp3
└── long_flash_voice3.mp3

Success Criteria:

  • All audio files organized and accessible
  • File names are clear and consistent
  • All files tested and working
  • Ready for client review

6. Documentation

Duration: 3 hours (Day 3)

Objective: Complete deliverable template with all findings

📄 Deliverable Template →

Complete the deliverable template with all research findings, technical metrics, audio samples, and recommendations for client review.

Sections to Complete:

  1. Audio Samples Delivered

    • Document sample organization
    • List test texts used
    • Table of voice IDs tested
  2. Technical Metrics

    • Complete performance data table for all tests
    • Calculate average metrics per model
  3. Technical Implementation Notes

    • SDK integration complexity
    • Error handling observations
    • API rate limits & constraints
    • Audio format verification
  4. Model Comparison Summary

    • Technical trade-offs
    • Pros and cons for each model
  5. Red Flags & Technical Concerns

    • Reliability issues
    • Deprecation warnings
    • API limitations
    • Performance concerns
  6. Voice Gallery System Research

    • Voice ID validation results
    • Minimum required data structure
    • Code samples
  7. Developer Recommendation

    • Our best approach
    • Recommended model and justification
    • Technical implementation approach
    • Cost implications
    • Risk assessment
  8. Client Decision Section

    • Quality assessment templates (for client)
    • Prepare for final model/voice selection
  9. Code Samples

    • Include all working code in appendix

Tasks:

  • Fill out all sections of deliverable template
  • Review for completeness
  • Verify all data is accurate
  • Check all code samples are included
  • Proofread documentation
  • Final review before submission

Success Criteria:

  • Deliverable template 100% complete
  • All sections filled with accurate data
  • Ready for client review and decision

Client Decision Process

After research completion:

  1. Developer provides:

    • All audio samples organized
    • Technical metrics and cost data
    • Developer recommendation
  2. Client evaluates:

    • Listen to all audio samples
    • Rate quality using provided scale (1-5)
    • Review technical metrics and costs
  3. Client decides:

    • Select preferred model (Turbo v2.5 or Eleven Flash)
    • Select preferred voices
    • Approve budget
  4. Developer implements:

    • Integrate selected model
    • Use client-selected voices
    • Implement based on technical findings

Client Evaluation Scale:

  • 5 - Excellent: Ready to use, professional quality
  • 4 - Good: High quality, minor imperfections
  • 3 - Normal: Acceptable quality, usable
  • 2 - Bad: Poor quality, noticeable issues
  • 1 - Not ready to use: Unacceptable quality

Resources


Notes Section

Use this space to document findings, issues, or observations during research:

Day 1 Notes:
-
-
-

Day 2 Notes:
-
-
-

Issues Encountered:
-
-

Recommendations:
-
-