Review Speech Synthesis from a Product's Perspective

Driven by deep learning advancements, speech synthesis has made significant strides in recent years. Existing technologies can produce human-like speech, enabling a wide range of applications.

However, the current state of speech synthesis lack fine-grained control over the generated speech. This limitation hinders the ability to create personalized and contextually relevant speech outputs, which are crucial for enhancing user experience in various applications.

In this article, we first review the current state of speech synthesis technologies, including their strengths and weaknesses. We then discuss the control mechanisms available in existing systems and their limitations. Finally, we envision the future of speech synthesis, highlighting the potential for more advanced control mechanisms and their implications for product development.

Current State of Speech Synthesis

Controllable Speech Synthesis

Future Directions

Citation

@article{lu2025tts,
  title   = "Review Speech Synthesis from a Product's Perspective.",
  author  = "Lu, Yan",
  journal = "https://jason-cs18.github.io/",
  year    = "2025",
  month   = "Apr",
  url     = "https://jason-cs18.github.io/mlsys/blog/avatar/tts"
}

References

Xie et al. “Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey” arXiv preprint arXiv:2412.06602 (2024).
Wang et al. “Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens” arXiv preprint arXiv:2503.01710 (2023).
Du et al. “CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models” arXiv preprint arXiv:2412.10117 (2024).
Chen et al. “F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching” arXiv preprint arXiv:2410.06885 (2024).
Jiang et al. “MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis” arXiv preprint arXiv:2502.18924 (2024).