Research

Computer Sciences and Information Technology

Title :

Text To Speech Generation with Chosen Accent and Noise Profiles for Aerospace and Industrial Domains

Area of research :

Computer Sciences and Information Technology

Focus area :

Text To Speech (TTS) System

Principal Investigator :

Dr Hema Murthy, Professor, Indian Institute of Technology (IIT) Madras

Contact info :

Details

Executive Summary :

The project objective is to design and improve Text To Speech Generation System for safety domains like Aerospace and Industrial Applications. Text To Speech systems known as Speech Synthesis systems have steadily improved. However, the output from contemporary speech synthesis system still remains clearly distinguishable from that of actual human speech. The objective of the proposal is many fold a. Improve Voice Experience by identifying and matching closely to the accent of the operator b. Improve Voice Experience by selective code switching thereby mixing local lingo c. Attempt to generate Emotive Speech so that user/operator almost perceives the system to be human d. Generate Voice for the desired accent and gender e. Generate Voice adapted / trained to a defined safety domain like Aerospace or Industrial application f. Generate Voice mixed with selective noise profiles so that the system could be used in certain applications like training g. Optimize the system towards near real time performance The project shall focus on user/operator accent identification, prosody analysis and generation that would satisfy the user/operator, noise mixing to match specific domain environment, selective code switching and mixing to improve Voice Experience. While classical methods like concatenation techniques offer good fidelity, the footprint is very large. Statistical parametric synthesis techniques provide good intelligibility but fail on naturalness. The objective is to improve fidelity and accuracy using deep learning based techniques. Google’s Wavenet (a deep generative model) and Lyrebird have employed Generative Adversarial Networks to copy and emulate a user’s speech characteristic. These methods may be explored for Non-native English Accents in the situation of limited data set. In addition to this, appropriate prosody needs to be incorporated. Appropriate prosodic analysis is therefore required. The speech generated may be limited to around 48 KHz samples. A speech utterance may be limited to a maximum of 25 words in a single phrase. The performance of the TTS system may be measured using Mean Opinion Score (MOS) / Subjective Quality Evaluation Tests, and word error rates (WER). Honeywell, an industry partner shall benefit in using the Accent and Noise Sensitive TTS Generation System (The deliverable) by customizing and integrating the solution along with Automatic Speech Recognition System thereby building conversational interfaces. Promoting natural interfaces and hands-free operation would improve overall safety and operational efficiency of Honeywell products, solutions or services

Total Budget (INR):

34,84,800

Achievements :

Production of accented speech. 1. Generic Indic voice generation and adaptation to various Indian languages. 2. Generic Indian English voice generation and adaptation to various Indian English accents. Development of generic Indian English voice, and adaptation to various accents. Development of generic Indian language Aryan and Dravidian voices, and adaptation to 9 Indian languages. 1. Generic Indian English voice. 2. Adaptation of English voice to different accents. 3. Generic Aryan and Dravian voices. 4. Adaptation to speaker and language. Adaptation of English voice to various Indian English accents

Publications :

 
4

Organizations involved