Progress of Text to Speech Synthesis towards High-quality and Flexible Voice Creation: from Diphone-based to Statistical Approaches

  Dr. Masami Akamine
  Toshiba Research and Consulting Company

  Prof. Qiangfu Zhao
  The University of Aizu

Date & Time:
  Sep.16, 14:00-15:30

  Lecture room S1, The University of Aizu

  Text to speech synthesis (TTS) is a core technology in a human-centric system for providing human-friend interfaces. TTS has a long history in its technology development. It was 1983 when Prof. Klatt of MIT demonstrated a full TTS system called Klattalk. Synthesized speech was fairly intelligible but muffled and robotic with flat intonation. We started research on TTS in Toshiba in 1995. This talk will give an overview of the development of TTS done at Toshiba research labs over last 20 years toward high-quality and highly flexible voices. The talk includes closed-loop training (CLT) of synthe sis units, unit selection and fusion, a statistical approach to flexible TTS, and deployment to products and services. The CLT method automatically creates di-phone speech units that minimize squared errors between synthesized and recorded reference speech. Diphone-based speech synthesis with CLT well fits embedded systems such as car navigation systems because it produces high-quality voices with a small footprint. Toward creating more natural and clear er voices we developed unit-selection-based systems with multiple unit selection and fusion. Considering contents creation using TTS, we need a wide variety of voices in terms of characteristics, emotions and speaking styles. Cluster adaptive training was introduced to a statistical approach based on Hidden Markov Models with state outputs generated by Gaussian Mixture Models for this purpose. Linearly combining cluster mo dels this approach represents acoustic models of speech parameters used for speech synthesis. It is easy to generate different voices and emotional voices with manipulating weights of the cluster mode ls. This talk will conclude with introduction of some products and services deployed in Japanese market.

  Masami Akamine received his Ph.D. degree in Electrical Engineering from Tohoku University in 1985. Since 1985 he has been with the Toshiba Research and Development Center. He is currently a Senior Fel low at Toshiba Research and Consulting, responsible for coordinating research programs among research groups in Kawasaki Japan, Cambridge UK and Beijing China. His research interests include speech co ding, speech synthesis, automatic speech recognition and their applications. He was awarded as an outstanding researcher by the Minister of Education, Science and Technology Japan in 2001. He has rece ived Technology Development Award from the Acoustic Society of Japan in 2002, Society Best Paper Award from IEICE Japan in 2003, Prime Minister’s Prize from the Japan Institute of Invention and Innova tion in 2008, and Achievement Award from IEICE Japan in 2012. He was also honored to receive the Purple Ribbon Medal from the Emperor Japan in 2013. He is a senior member of IEEE and had served as a m ember of the Speech and Language Technical Committee for two years since 2012.

  Prof. Qiangfu Zhao, The University of Aizu
  Email: qf-zhao@u-aizu.ac.jp