Research

Computer Sciences and Information Technology

Title :

Crowdsourcing for Language Processing (CLAP): A platform for collecting multilingual data for speech and language processing

Area of research :

Computer Sciences and Information Technology

Focus area :

Development of multilingual data collection platform

Principal Investigator :

Prof. Preethi Jyothi, Indian Institute of Technology (IIT) Bombay

Timeline Start Year :

2019

Contact info :

Details

Executive Summary :

Over the last decade, Artificial Intelligence (AI) is increasingly making inroads into society and our lives. However, as with all other technologies, it is an important challenge to make such technology accessible to people from all strata of Indian society. This challenge manifests itself mostly at the interface between humans and computers. A key component of this interface that is highly sensitive to the cultural and linguistic background of the users is automatic speech recognition (ASR). To build competitive ASR systems for Indian languages, one requires large amounts of labeled speech data i.e. speech clips in different Indian languages accompanied by their corresponding text. Publicly-available repositories of labeled speech in Indian languages are currently a limited resource. This project aims at collecting large volumes of labeled speech in a number of different Indian languages in a scalable manner using crowdsourcing. Towards this end, the investigators undertake the following: 1) Investigators will design a mobile application in Android and a corresponding backend server to crowdsource tasks for labeled speech in various Indian languages. Users of this app (workforce) will be given two different types of tasks to complete: a "Speak" task where users will read out prompts in their native tongues and a "Verify" task where users will be asked to confirm whether a prompt and its corresponding speech (obtained from a different user) are well-matched. 2) To collect large volumes of data, it is essential to have effective mechanisms to recruit the workforce as well as retain them. For recruitment, investigators will explore Facebook/Google social media advertising, contact student bodies (as part of the National Social Service scheme), taxi driver associations etc. For incentivizing, investigators will employ gamification, PayTM-based money-transfers and the coupling of AI education with data collection by presenting internship opportunities for top students. Investigators will explore various combinations of these mechanisms to determine the right scheme that gives the best return on investment. 3) Given the crowdsourced nature of the collection, it is possible for poor-quality data to creep into the corpora. To tackle this challenge, investigators will employ a host of techniques to post-process the data. Operations such as noise reduction and volume control will be applied to all the speech clips. Verify tasks coupled with majority voting could be used to catch instances of spamming in speak tasks. As an additional measure to ensure quality, investigators will perform automatic random checks on each user using gold standard verify tasks (where the outcome is known) to catch spammers. Investigators highlight that they intend to make the collected speech data available as publicly-available corpora that can be used by researchers and industry practitioners to build or bootstrap their existing systems. Investigators believe this would be a very valuable contribution towards furthering research on speech technologies in Indian languages.

Co-PI:

Dr Kameswari Chebrolu, Associate Professor, Indian Institute of Technology (IIT) Bombay

Total Budget (INR):

36,88,960

Achievements :

The CLAP application developed for this project has already reached more than 2000 users across India. Investigators have attained high ratings from hundreds of users who affirmed that our application is easy and engaging to use. Investigators have collected speech data in four different languages, and efforts are ongoing in collecting speech for three additional languages.

Organizations involved