< Back to BLOG /

Modern Text-to-Speech Systems Review






Awareness of general Machine Learning approaches, CNNs, GRUs etc is required to enjoy this video.

The review covers systems developed in 2016-2018 years.

The transcript might be found in our Medium publication here: https://medium.com/hydrosphere-io/webinar-transcript-modern-text-to-speech-systems-review-for-intermediate-advanced-ml-audience-6c12ed9d46bb

References:
1. Van Den Oord, Aäron, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. “WaveNet: A generative model for raw audio.” In SSW, p. 125. 2016.

2. Paine, Tom Le, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit Ramachandran, Mark A. Hasegawa-Johnson, and Thomas S. Huang. “Fast wavenet generation algorithm.” arXiv preprint arXiv:1611.09482 (2016).

3. Arik, Sercan O., Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li et al. “Deep voice: Real-time neural text-to-speech.” arXiv preprint arXiv:1702.07825 (2017).

4. Wang, Yuxuan, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang et al. “Tacotron: Towards end-to-end speech synthesis.” arXiv preprint arXiv:1703.10135 (2017).

5. Arik, Sercan, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. “Deep voice 2: Multi-speaker neural text-to-speech.” arXiv preprint arXiv:1705.08947 (2017).

6. Tjandra, Andros, Sakriani Sakti, and Satoshi Nakamura. “Listening while speaking: Speech chain by deep learning.” Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017.

7. Ping, Wei, Kainan Peng, Andrew Gibiansky, Sercan O. Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. “Deep voice 3: Scaling text-to-speech with convolutional sequence learning.” (2018).

8. Shen, Jonathan, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779-4783. IEEE, 2018.

9. Wang, Yuxuan, et al. “Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.” arXiv preprint arXiv:1803.09017 (2018).

10. Jia, Ye, et al. “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis.” arXiv preprint arXiv:1806.04558 (2018).

11. Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2018). Close to Human Quality TTS with Transformer. arXiv preprint arXiv:1809.08895.