Research

Engineering Sciences

Title :

Development of spoken Language Corpora for Under Resourced Languages

Area of research :

Engineering Sciences

Principal Investigator :

Dr. Tanmay Bhowmik, Pandit Deendayal Energy University, Gandhinagar, Gujarat

Timeline Start Year :

2024

Timeline End Year :

2027

Contact info :

Details

Executive Summary :

Under-resourced languages are those with limited resources, such as speech data, language models, or text corpora, which are often spoken by smaller communities and are less well-studied than more commonly spoken languages. These languages are often not well-studied and require the creation of speech corpora based on spoken language, which contains prosodic words. These corpora can improve the uniformity and robustness of current AsR systems. A spoken language corpus is a collection of recorded speech that is transcribed and annotated with linguistic information, such as phonetic and prosodic features, which can be used for developing and evaluating speech recognition and language processing systems. These corpora are necessary for training speech recognition systems, developing language models, linguistic research, and preserving cultural heritage. speech recognition systems typically use machine learning algorithms, which require large amounts of annotated speech data. A spoken language corpus can provide a foundation for training these systems, improving their accuracy and performance. Linguistic research can also be conducted on spoken language corpora, focusing on phonetics, prosody, and syntax to deepen our understanding of the language and its structure. In conclusion, spoken language corpora are crucial resources for developing and evaluating speech recognition and language processing systems, as well as linguistic research and cultural heritage preservation.

Total Budget (INR):

18,30,000

Organizations involved