Research

Computer Sciences and Information Technology

Title :

Advancement of NLP Techniques for Indian Languages with Focus on Bangla and Hindi

Area of research :

Computer Sciences and Information Technology

Principal Investigator :

Dr. Arnab Bhattacharya, Indian Institute Of Technology Kanpur (IITK), Uttar Pradesh

Timeline Start Year :

2024

Timeline End Year :

2027

Contact info :

Equipments :

Details

Executive Summary :

India is a land of languages, and the digital divide is being addressed by using resources in mother tongues. Automated computational processing of natural language tasks has improved significantly in recent years, but Indian languages, particularly Bangla and Hindi, still struggle with basic NLP tasks. The project aims to enhance performance for these languages by creating large corpora, creating benchmarks, building better NLP models, and using cross-lingual knowledge transfer. Large corpora are essential for building state-of-the-art deep learning-enabled NLP tools, as newspapers, blogs, and social media posts lack quality and variety. Literature articles are best suited for this purpose, and task-specific annotation can deliver quality benchmark datasets similar to what GLUE provides for English. The project also aims to build a generalized framework for automatic grammar correction for Indian languages, which will be useful for other Indian languages. Cross-lingual knowledge transfer from higher-resource Indian languages to lower-resource ones can help create better models due to common traits like scripts and sentence structure. Lastly, the project plans to showcase these works on an interactive website where users can download resources, play with trained models, help annotate data, and provide feedback. This approach aims to bridge the digital divide and improve the performance of NLP tools in Indian languages.

Co-PI:

Dr. Pawan Goyal, Indian Institute Of Technology (IIT) Kharagpur, West Bengal-721302

Total Budget (INR):

40,80,973

Organizations involved