Speech Synthesis: Historical Perspective – Current Status of Bangla Speech Synthesis
Generating intelligible and natural speech with computers or other hand-held devices has been a goal for the speech community researcher. The purpose is to facilitate visually impaired people as well as improve human-machine interaction (e.g., dialogue system, speech-to-speech translation). It is a process, which is usually referred to as speech synthesis, also known as Text to Speech. The speech synthesis consists of two main components, frontend, and the waveform synthesizing module. The frontend module analyzes the input text and transforms into spoken form, while the synthesizer transforms the spoken form into the speech waveform. The first part is mainly language specific. It requires the development of linguistic resources such as phoneme inventory, lexical rules for the normalization of text to standard spoken form, pronunciation dictionary, linguistic and speech corpus for developing machine learning models. The second part is language independent and several techniques have been developed in the last few decades. These consist of rule-based approaches to data-driven approaches (e.g., concatenative approach, statistical parametric approach). This talk will highlight such techniques – their advantages, limitations, and challenges. In addition, a major focus will be given to discuss the current status of Bangla Speech Synthesis and possible research avenues that the research community can address.