Every Story, Ready to Listen: Azure Speech HD Voices
Photo: Unsplash
Reading aloud is one of the oldest forms of storytelling. The audiobook turned that impulse into a commercial format — but for indie authors, audiobooks have always been a luxury, because production costs time and money. You need a microphone, a quiet environment, editing software, hours of post-production.
OutaStory does it differently: every story can automatically be presented as an audio version — chapter by chapter, with Azure Speech HD Voices that sound like real narrators.
How the audio process works
Audio publishing is step 3 of the three-step publish wizard. The author picks a voice from a short preview list, then everything runs asynchronously.
Technically, several things happen one after another:
1. SSML generation via GPT-5:
The raw text of each chapter is sent to Azure OpenAI, model gpt-5. The task: convert the text into valid SSML (Speech Synthesis Markup Language). SSML lets you control pauses, emphasis, speaking pace, and sentence melody — far beyond "read this text out loud."
The model knows the SSML standard, but it doesn't always produce valid XML right away. That's why we built OutaStory.Ssml, our own library that validates the generated SSML against the SSML specification. If validation fails, the request is resubmitted with an error-correction instruction.
2. MP3 synthesis via Azure Speech: The validated SSML is handed to Azure Speech — to the HD Voices variant. HD Voices don't sound like the classic text-to-speech systems from the nineties. They modulate, they breathe, they have rhythm. The difference compared to standard voices is noticeable.
Synthesis happens chapter by chapter. Each chapter is its own MP3 file, stored in Azure Blob Storage, in the audio container.
3. Service Bus and ordering:
Generation is driven via the ssmlprocessor Azure Service Bus queue. Important detail: this queue is session-enabled. That means messages for the same draft are processed in the correct order — chapter 1 before chapter 2, and so on. That sounds obvious, but it isn't trivial in distributed systems.
The 200,000-character limit
Azure Speech HD Voices have a limit per synthesis request. For OutaStory we introduced an overall limit of 200,000 characters per story. That's roughly equivalent to a novel of 100 to 120 pages — or several shorter stories combined.
200,000 characters is generous enough for most indie stories. For our 42 flagship stories, each around 14,000 words (roughly 90,000 characters), the limit is never reached.
The limit is a safety net — it prevents half a million characters from accidentally being sent to the synthesizer, which would be both expensive and slow.
Photo: Unsplash
What makes HD Voices better
I want to briefly explain why Azure Speech HD Voices mattered to us, rather than just plain standard text-to-speech.
Standard TTS reads aloud. HD Voices narrate. The difference shows up in:
- Sentence melody: HD Voices raise their pitch at the end of a question and lower it at the end of a statement — automatically.
- Pauses: Paragraphs get real breathing marks instead of one continuous push-through.
- Emotional coloring: Exciting scenes sound different from quiet moments, when the SSML marks them accordingly.
- Natural pacing: HD Voices vary their tempo slightly, which sounds more human than uniform reading.
For a platform where listening stands equal alongside reading, that wasn't even a question. It was HD or nothing.
The in-house SSML library: OutaStory.Ssml
I want to briefly touch on this library, because it secures a disproportionately important part of audio quality.
SSML is a W3C standard, but the standard leaves room for interpretation. Different implementations interpret edge cases differently. Azure Speech has its own extensions that aren't part of the standard but are valid. GPT-5 knows the standard — but sometimes produces tags that are syntactically nested incorrectly.
OutaStory.Ssml solves this in two steps: first schema validation against the SSML standard (with Azure extensions), then a semantic check against known problem patterns. If anything fails, the library returns a structured error, which gets sent as part of the retry prompt to the LLM.
That might sound like over-engineering. In practice, it has prevented corrupted MP3 files or speech synthesis aborting with an internal error.
Feature flag: AudioGeneration
Like the AI covers, audio has a feature flag too: AudioGeneration. On hosts where Azure Speech isn't configured, the wizard step stays visible but disabled. That allows local development without real Azure credentials.
What's next?
Next week: how slug-based URLs make stories and authors linkable and discoverable — and what's behind draft-{guid} until an author picks her own slug.
