■ Lecture Summary
Title:
Progress of Text to Speech Synthesis towards High-quality and Flexible Voice Creation: from Diphone-based to Statistical Approaches
Lecturer:
Dr. Masami Akamine
Toshiba Research and Consulting Company
Coordinator:
Prof. Qiangfu Zhao
The University of Aizu
Date & Time:
Sep.16, 14:00-15:30
Place:
Lecture room S1, The University of Aizu
Abstract:
Text to speech synthesis (TTS) is a core technology in a human-centric
system for providing human-friend interfaces. TTS has a long history in
its technology development. It was 1983 when Prof. Klatt
of MIT demonstrated a full TTS system called Klattalk. Synthesized
speech was fairly intelligible but muffled and robotic with flat
intonation. We started research on TTS in Toshiba in 1995. This talk
will give an overview of the development of TTS done at Toshiba
research labs over last 20 years toward high-quality and highly flexible
voices. The talk includes closed-loop training (CLT) of synthe
sis units, unit selection and fusion, a statistical approach to flexible
TTS, and deployment to products and services. The CLT method
automatically creates di-phone speech units that minimize squared
errors between synthesized and recorded reference speech.
Diphone-based speech synthesis with CLT well fits embedded systems such
as car navigation systems because it produces high-quality voices with a
small footprint. Toward creating more natural and clear
er voices we developed unit-selection-based systems with multiple unit
selection and fusion.
Considering contents creation using TTS, we need a wide variety of
voices in terms of characteristics, emotions and speaking styles.
Cluster adaptive training was introduced to a statistical approach based
on Hidden Markov Models with state outputs generated by Gaussian Mixture
Models for this purpose. Linearly combining cluster mo
dels this approach represents acoustic models of speech parameters used
for speech synthesis. It is easy to generate different voices and
emotional voices with manipulating weights of the cluster mode
ls. This talk will conclude with introduction of some products and
services deployed in Japanese market.
Biography of the lecturer:
Masami Akamine received his Ph.D. degree in Electrical Engineering from
Tohoku University in 1985. Since 1985 he has been with the Toshiba
Research and Development Center. He is currently a Senior Fel
low at Toshiba Research and Consulting, responsible for coordinating
research programs among research groups in Kawasaki Japan, Cambridge UK
and Beijing China. His research interests include speech co
ding, speech synthesis, automatic speech recognition and their
applications. He was awarded as an outstanding researcher by the
Minister of Education, Science and Technology Japan in 2001. He has rece
ived Technology Development Award from the Acoustic Society of Japan in
2002, Society Best Paper Award from IEICE Japan in 2003, Prime
Minister’s Prize from the Japan Institute of Invention and Innova
tion in 2008, and Achievement Award from IEICE Japan in 2012. He was
also honored to receive the Purple Ribbon Medal from the Emperor Japan
in 2013. He is a senior member of IEEE and had served as a m
ember of the Speech and Language Technical Committee for two years since
2012.
Contact information:
Prof. Qiangfu Zhao, The University of Aizu
Email: qf-zhao@u-aizu.ac.jp
|
|
|
|