Singapore builds AI model to ‘represent’ Southeast Asia

Singapore builds AI model to ‘represent’ Southeast Asia

What is the context?

A Singapore government-backed project to build a large language model incorporating Southeast Asian data has attracted fans and critics

  • The Singapore government-backed project builds a large language model trained on Southeast Asian data
  • SEA-LION best represents the region’s languages ​​and culture
  • Critics say regional LLMs may exclude dissenting voices and perpetuate misleading narratives

SINGAPORE – Like millions around the world, Southeast Asians have been trying out big language models like Meta’s Llama 2 and Mistral AI – but in their native Bahasa Indonesia or Thai. The result was usually gibberish in English.

Technology experts warn that this leaves them at a disadvantage, as generative AI transforms education, work and governance around the world.

A Singapore government-led initiative aims to redress the imbalance through the Southeast Asia LLM programme, the first in a family of models called SEA-LION – Southeast Asian Languages ​​in One Network – trained on the region’s languages ​​and cultural norms.

This open-source model, which was trained on data in 11 Southeast Asian languages, including Vietnamese, Thai and Bahasa Indonesia, is a cheaper and more efficient option for businesses, governments and academia in the region, said Leslie Teo of Singapore’s AI branch.

“Do we want to force everyone in Southeast Asia to adapt to the machine, or do we want to make it more accessible so that people in the region can take full advantage of the technology without having to be English speakers?” He said.

“We’re not trying to compete with the big master’s degree holders; we’re trying to complement them, so we’re better represented,” Teo, senior AI product manager, told Context.

There are more than 7,000 languages ​​spoken around the world. However, LLMs, including Open AI’s GPT-4 and Meta’s Llama 2, which are used to build AI systems such as chatbots and other tools, are largely developed for and trained in English.

Governments and technology companies are trying to fill this gap, with India creating datasets in local languages, and an LLM in the UAE running generative AI tools in…

Arabic language and artificial intelligence models in China, Japan and Vietnam in local languages.

These models could help local residents participate more equitably in the global AI economy that is largely dominated by big tech companies, said Nourryanti Jali, an assistant professor in Oklahoma State University’s College of Communication.

“Regional management master’s degrees are also needed because they support self-reliance in technology,” she said. “Less reliance on Western LLMs can provide better privacy for local residents and is better aligned with the national or regional interest.”

Verification and filtering

Multilingual language models, trained on text from several languages ​​simultaneously, can infer semantic and syntactic connections between high-resource languages ​​that have more data, and low-resource languages, the researchers say.

These models can be used in a variety of applications, from translation to customer service chatbots, to content moderation on social media platforms that have struggled to identify hate speech in low-resource languages ​​like Burmese or Amharic.

About 13% of SEA-LION’s data is sourced from Southeast Asian languages ​​- more than any other major language master, Teo said. More than 9% of its data comes from Chinese text, and about 63% from English text.

Multilingual language models are often trained on translated text and other poor-quality data that may contain errors, so Amnesty International Singapore is “cautious” about the data used in SEA-LION training, Teo said in his office at the National University of Singapore. .

“The era of original data has passed – a lot of stuff on the internet now is LLM-generated material, so we need to verify and filter,” he said.

He added: “We cannot be perfect, but we also cannot get rid of everything that we consider bad.”

More governments are contributing data, and companies are testing SEA-LION, which can be deployed faster because of its smaller size and is cheaper to set up and certify, Teo said.

The majority of customer interactions are in Bahasa Indonesia, so models “with this local fluency will enhance our ability to communicate with customers and improve their experiences,” said Paul Kondelis, associate vice president of data science at Indonesian e-commerce company Tokopedia. .

Bias in the data

As more countries and regions build their own LLM programmes, digital and human rights experts worry that they will only reproduce mainstream views expressed online, which can be particularly problematic in countries with authoritarian or censorious governments. Strict on the media, or those that lack a strong civil society.

For example, Chinese social media platforms censor references to the Tiananmen Square uprising and criticism of the government, while many Southeast Asian countries have enacted laws to limit content that authorities consider misleading.

“Training models on such data risks perpetuating biased, biased, incomplete and even misleading narratives,” Galli said.

“Models may fail to highlight important social and political issues such as human rights violations, corruption, or valid criticism of political forces,” she said.

In response to a query about former Indonesian President Suharto, for example, Llama 2 and GPT-4 cited his spotty human rights record, while SEA-LION’s response focused largely on his achievements.

If the model is trained only on positive articles about the government, the model is more likely to embrace a worldview in which the government is completely positive and leaves behind opposing views, said Alia Bhatia, a policy analyst at the Center for Democracy and Technology. , an American non-profit organization.

“Regional MAs may better reflect the linguistic and cultural differences of speakers of the local language, but they may also have less information about the world in general,” she added.

“There is a real danger that government-backed models will inculcate a reactionary view of history and undermine democratic values.”

But the alternative – relying entirely on Western LLMs who enjoy “disproportionately large influences” from wealthy Western liberal democracies – means perpetuating various biases related to cultural values, political beliefs and social norms, according to the Singapore branch report.

“These LLMs have a very particular West Coast American bias — they’re very woke. They don’t represent us,” Teo said.

“We’re not saying our perspective is the only perspective — we’re just trying to rebalance it.”

(Reporting by Reena Chandran. Editing by Zoe Tabari.)

You may also like...

Leave a Reply