I²R Research Highlights

Advancing Southeast Asian Natural Language Processing with Culturally Aware Models

Cultural Nuances in NLP Models
Recent advancements in Natural Language Processing (NLP) have led to practical technologies like BERT, T5, and GPT-4. However, these models are rooted in English-language datasets and cultural contexts, limiting their effectiveness in multilingual regions such as Southeast Asia.

To address this gap, models like SEA-LION, SeaLLM, and Sailor have been developed, tailored to the linguistic and cultural nuances of the region. These models support languages like Bahasa Indonesia, Thai, and Tagalog, offering culturally relevant solutions.

Instruction Fine-Tuning
Instruction fine-tuning is a key technology, ensuring pre-trained Large Language Models (LLMs) can follow instructions aligned with regional preferences. The CRAFT (Cultural Reasoning and Fine-Tuning) approach generates culturally relevant instructions from large, unlabelled corpora, reducing the need for costly, manual labelling.

For example, CRAFT-generated questions cover topics such as Singapore’s urban planning and cultural influences, providing tailored insights for the region.



Challenges in Cultural Data
Collecting culturally intensive instructions is challenging due to the context-specific nature of cultural nuances. The CRAFT method uses keyword filtering and LLM-powered question generation to construct culturally rich datasets from a corpus of up to a trillion tokens.

Harnessing Processing Power
Processing these vast datasets requires immense computing resources. By leveraging clusters like NSCC and LUMI (20,000+ GPU cores), the team successfully processed the data in days, enabling previously impossible tasks.

Impact and Applications
Culturally aware NLP models are critical for serving Southeast Asia, enabling more accurate and sensitive AI solutions. These models are set to revolutionise industries reliant on cultural understanding, from education to customer service.

Future Plans
The team plans to extend the technology to handle multi-round dialogues, improve evaluation set construction, and enhance multilingual capabilities. A multimodal approach integrating auditory and visual data will further improve local nuanced understanding.

Awards
This work received the Best Paper Award at the C3NLP Workshop (ACL 2024, Bangkok) for “CRAFT: Extracting and Tuning Cultural Instructions from the Wild.”


Read more about this paper here: https://aclanthology.org/2024.c3nlp-1.4/Bin Wang, Geyu Lin, Zhengyuan Liu, Chengwei Wei, Nancy F. Chen, Liu Tianchi, Qiongqiong Wang, Liu Xiaochen, Zou Bowei, and Aw Ai Ti