Now language models will be trained in European languages ​​on one of the world's largest text collections

16.11.2022

Translation services such as Google Translate and Apple's virtual assistant Siri are technologies many of us use daily.

The services are based on advanced language models, usually presented as artificial intelligence with names such as BERT or GPT, which have been trained using machine learning.

Illustration with abstract codes.

The language models are owned by a few American and Chinese technology companies. Not only does this give conciderable market power to certain actors, but it also leads to biases in which language ​​models are trained. Commercially less important languages ​​often have weaker, or no custom models at all.

Open language models that support all European languages

High-Performance Language Technologies (HPLT) is the name of a project that will challenge the current monopoly situation where a few large technology companies are behind world-leading services. HPLT will focus on multilingualism and develop training material and language models that support European languages.

HPLT is led by the Czech Charles University and is a collaboration between five universities (Oslo in Norway, Edinburgh in Scotland, Prague in the Czech Republic, Helsinki and Turku in Finland), two providers of high-capacity services (Sigma2 and Czech Cesnet), and a private company (Spanish Prompsit). HPLT has received funding from the EU's Horizon Europe program to carry out the project, which aims to develop language models for deep learning and machine translation tools at scale, with support for all official European languages, and many more. The result will be open downloadable high-quality models. Part of the motivation behind the project is also to take care of European languages.

— By training language models in all major European languages, the HPLT project is going to change the situation completely. Many will benefit from this, especially researchers outside large companies and start-ups who can develop new services using the language models. This can of course also include further development of virtual assistants such as Siri, but this time built with transparent and publicly available technology under the hood, says Andrey Kutuzov at the Department of Computer Science at the University of Oslo, one of the researchers behind the project.

Kutuzov has previously had a leading role in developing the language model NorBERT, one of the first neural language models for Norwegian and Nynorsk, developed in collaboration with the National Library.
 

Models are trained on a copy of the Internet

Modern language technology is impossible without the training and fine-tuning of large deep-learning models. The models are trained using neural networks, which are a rough simplification of the human brain. Deep learning is a method within machine learning, where a neural network is trained to solve advanced tasks on its own. Training of neural language models requires large amounts of data, and training requires enormous amounts of parallel computing power from dozens or hundreds of graphics processing units (GPUs).

To carry out the project, the HPLT researchers will use text data from the so-called Internet Archive, which is perhaps best known for its iconic Wayback Machine. The Internet Archive contains an enormous amount of web pages in various languages. A collection that easily surpasses most data sets that have been used to train newer language models.

The researchers behind the study will download the most relevant data from European domains and establish copies in Norway and the Czech Republic. Then the websites are cleaned, and the texts are extracted to be used in training the language models. We are talking about billions of words of text. The project aims to develop the largest open text collection for languages ​​other than English ever.

Fast and secure data transfer to Europe's fastest supercomputer

In terms of size, it is about seven petabytes of data. This corresponds to the storage capacity of two million DVDs. Now it is obviously out of the question to store the Internet Archive on DVDs. Instead, the seven petabytes of raw data will be stored on the new national storage infrastructure, NIRD, which is owned by Sigma2 and operated by NRIS (Norwegian research infrastructure services).

It is no easy task to transfer and store such large amounts of data. Data capacity and transmission speed between data clusters in Norway and abroad, including the Internet Archive's data centre in California, are essential to carrying out the research. In Norway, we have the Research Network, which is operated and developed by Sikt. This high-capacity network is connected to international research networks so that data quickly and securely is transferred between the national systems and Europe's fastest supercomputer, LUMI in Finland, where the language training will be carried out. Norway is part owner of LUMI through Sigma2, and both Norwegian and Finnish language technology researchers are already among LUMI's pilot users.

Before the training starts, the text data must be cleaned and pre-processed. Together with Czech Cesnet, Sigma2 will provide local CPU power to ensure robust download, storage and pre-processing of the online archive data.

— This is a critical part of the project, which is required both to improve the quality of the training sets and to significantly reduce the amount of data that is copied to storage on LUMI. It is sub-optimal and expensive to store large amounts of data on high-performance storage on computing facilities such as LUMI, therefore it is also important that the data sets are processed in advance, says Lorand Szentannai, senior advisor at Sigma2.

AI resources are becoming increasingly important

The supercomputer LUMI's enormous computing capacity is primarily based on many GPU processors, which are very well suited to research involving artificial intelligence, and especially deep learning.

— We see that the demand for AI resources from both academia and industry is constantly increasing, and as a national supplier we must offer world-class computing and storage resources. Here, LUMI will be a key to enabling research breakthroughs in subject areas driven by high-capacity calculations and data processing going forward, says Gunnar Bøe, Managing Director of Sigma2.

Language technology is one of the newer scientific domains that use high-performance computing in research (HPC). HPC has traditionally been dominated by the natural sciences, however, constant development in technology and digital working methods means that more and more disciplines are dependent on processing and analyzing big data in their research.

Both Norway and the Czech Republic are part of the consortium of 10 countries that together run the LUMI supercomputer, owned by EuroHPC Joint Undertaking. EuroHPC JU is a European joint initiative that will ensure European researchers access to world-class supercomputers.