How difficult is it to create a virtual assistant that can communicate in low-resource languages like Vietnamese or Swahili?
One of the best aspects of today’s hyperconnected world is that the fruits of technological innovation can spread across the globe. There is no reason a new technology breakthrough in the Western world could not also be replicated to help societies in Africa or Asia.
Globalisation and cross border trade have also allowed companies access to new geographical markets where people speak different languages. So as Facebook and WhatsApp expand into Asia, they stand to gain more customers if they take into consideration the languages of Hindi, Telugu, Mandarin, Tagalog, Malay, and Vietnamese. Similarly, Tiktok, the Chinese video-sharing social platform has expanded across the world into Europe and the Americas to capture new markets.
This same democratisation of technology should also be the case for artificial intelligence applications. However, there is one difference. AI models require training and the availability of data used for this training often dictates who can use the technology and where with satisfactory results.
Conversational AI solutions illustrate this issue quite starkly. Virtual assistants, since their early days have been used on English-language websites. They are also trained on data that is available most readily. There are more than 7000 known languages spoken in the world today and these are not equally distributed among the global population. English and Chinese are the top languages spoken by the greatest number of people. This means they are the most documented and therefore have the most publicly available data sets for training.
What are Low-resource Languages?
English and Chinese aside, there are still several languages that are native to a sizeable number of people but which may not have substantial amounts of data sets for training an AI model. According to Statistica, English and Chinese together make up for 45.3% of internet users. Languages like Vietnamese, Swahili, Hindi, Thai, Urdu, Bengali are spoken by a large population of people but not on the internet.
Low-resource languages are those that have relatively less data available for training conversational AI systems.
In contrast, English, Chinese, Spanish, French, Japanese and more of the European and Western languages are high-resource. There is already a vast corpus of data in those languages which can be tapped for training AI models.
Additionally, it is also common to encounter what are known as mixed languages or creole languages. These are languages composed of a mixture of different languages. For example, Singlish in Singapore is a mixture of English and elements drawn from Chinese, Malay, Tamil, and other dialects. Manglish in nearby Malaysia is similar to Singlish but with its own differences.
In order to train virtual assistants, data from multiple sources are used, including books, scientific papers, dictionaries as well as specific corpora for machine translation and various kinds of annotated text. For the low-resource and mixed languages mentioned above, training becomes hard due to the lack of availability of these resources.
Yet, there are still ways to train NLP algorithm in the low-resource languages, where the amount of data and the knowledge of the language are insufficient by usual standards: 1) traditional approach that translate the data to high-resource language; 2) approaches that try to apply transfer learning from huge pretrained models and fine tune with the low resource language
Best Practices for Multi-language Conversational AI
The good news is that businesses can still find a way to make conversational AI solutions work in low-resource languages. In an ideal case, they should work with a vendor who can support them by following these best practices.
Expand the Available Data
The most straightforward approach to tackling low-resource languages is to address the data shortage problem. This involves compiling text in the language during the data collection phase. This process could achieve useful results, but requires extensive preparatory work in data collection. It is recommended to have an expert in the language involved during this stage. Some international projects like the following are already working on building data sets in these low-resource languages:
- An Crúbadán – Corpus Building for Minority Languages
- The Human Language Project
- The Leipzig Corpora Collection
Explore Cross-lingual Transfer Learning Models
There are also specific NLP models being investigated that employ what is called cross-lingual transfer learning. This approach works on the idea that languages share certain fundamental similarities in their structure. These similarities can then be exploited to build models for low-resource languages from high-resource ones.
Cross-lingual transfer learning involves taking advantage of the similarities in the structures of different languages to build models for low-resource languages from high-resource ones.
Some vendors of conversational AI solutions will have specialised expertise in these areas. They may take it upon themselves to gather the language corpus, collect over time and train the model in the low-resource languages.
Supplement with More Examples
The publicly available data sets may still not be enough to satisfactorily train the virtual assistants. In these cases, look for more examples in the same language that can be used to scale the training process. This could be in the form of product and marketing collateral, pdf documents, chat logs.
If such documented sources are unavailable, get a customer-facing team involved. They can supply examples of real-world queries asked by users in the language through interviews or data from support channels like phone calls.
Look for Multi-lingual Virtual Assistant Vendors
It is not a good practice to have disparate systems of conversational AI solutions that work on individual languages. For example, if you partner with a vendor for English-language markets, another vendor for Vietnamese and yet another provider for Thai, the results will be less than ideal. Look for solutions that can work across multiple languages and stick with one.
Keep the Responses Consistent Across Languages
Giving your virtual assistants the ability to communicate to users across languages is a good first step but not enough. The responses given in these different languages must also be consistent and accurate. Especially in high impact scenarios like healthcare, you cannot afford to have responses in one language differ from that in another.
Some Flexibility for Regional Differences is Good
Even among regions which speak the same low-resource language, there can be differences. These could be purely geographical, cultural or they could be due to legal and regulatory differences specific to that region. For example, even though Malay could be a language preferred by some bank customers in Malaysia and in Singapore, the responses could vary due to differences in legal and regulatory guidelines between the countries.
To find out how KeyReply can help you implement a multi-language conversational AI, get in touch with our experts.