SSML Markup Capabilities for Speech Synthesis

09-09-2025 , 09-09-2025

SSML (Speech Synthesis Markup Language) is a markup language. It is used to describe text for converting into speech by neural networks.

What is its purpose? With SSML, you can control tone, accent, pronunciation, and add pauses and other audio effects. This makes the generated speech sound more natural and expressive.
Usage goals: The main goal is to make synthesized speech sound natural and expressive. SSML also ensures accurate pronunciation of numbers, dates, phone numbers, and other specific information.
Who created it? SSML was developed by the World Wide Web Consortium (W3C). This organization sets web standards.
What is its mission? SSML aims to standardize and enhance speech synthesis methods in the digital space.

For SSML documentation on the official W3C website: https://www.w3.org/TR/speech-synthesis/

Basic Rules for Writing SSML Tags

SSML tags are usually enclosed in angle brackets, like in HTML. Example: <speak>text</speak>.
Typically, there should be an opening and closing tag (except for <break>).
Within tags, you can use attributes to adjust pronunciation settings.
Some tags can be nested within others.
SSML tag and attribute syntax follows XML standards.

Supported Tags

SpeechGen supports the most common SSML tags. Some voices might not follow certain tag attributes. Specific details are in the documentation for each parameter.

Below is a list of main tags with links to detailed documentation for each.