2022 was a whirlwind year for Natural Language Processing (NLP), the sub-discipline of Computer Science that enables data systems to "make sense" of human language. NLP has a long and rich tradition in Norway, including at the Universities in Bergen, Oslo, Tromsø, and NTNU, where research and development for a niche language like Norwegian – in both its written variants – naturally takes special focus.
Current NLP is dominated by applications of deep learning to vast volumes of textual training data, often digital texts extracted from online content and libraries. So-called large language models (LLMs) – known as, among others, BERT, T5 or GPT – have been at the core of much NLP since 2019, both in automated translation or virtual assistants, as well as in more mundane applications like content recommendation or writing aids.
Large-scale training
In a nutshell, these models are very large artificial neural networks, ranging from around 1 billion to hundreds of billions of parameters, scalar values that determine the flow of information through the network. Parameters are learned from natural language training data, where LLM training, evaluation, and deployment all call for large-scale GPU compute resources. Training a comparatively small model like the Norwegian BERT requires about 10,000 GPU hours, while the compute costs for a massive model like ChatGPT remain unknown.
Access to the Internet Archive
In 2022, the University of Oslo (UiO) entered a partnership with the Norwegian National Library on research data access to both digitized and digitally-born Norwegian language content, aiming to prepare a more general framework for research usage of digital library data. In parallel, UiO established a collaboration with the Internet Archive – best known for its iconic Wayback Machine – under the auspices of a new Horizon Europe project, with partners in the Czech Republic, Finland, Spain, and the UK.
Tens of billions of words
In total, 7 petabytes of Internet data and tens of billions of words of library content will be prepared for NLP training on the new Norwegian Infrastructure for Research Data (NIRD) in the Lefdal Mine Datacenter, as well as on a parallel facility in the Czech Republic. Professor and Project Leader Stephan Oepen and his team were selected with Finnish NLP teams to participate in the LUMI-G burn-in pilot in late 2022.
The supercomputer LUMI's enormous computing capacity is primarily based on many GPU processors, which are especially well suited to research involving artificial intelligence (AI) and deep learning. They successfully ported and optimized training code for multiple LLM architectures to the AMD ecosystem and trained dozens of new LLMs for Finnish and Norwegian.
Repository for 60 European languages
These models will become part of an emerging European repository of open language data and pre-trained LLMs for at least 60 languages. Research access to massive natural language training data outside of select corporate environments, combined with the unprecedented GPU compute capacity on LUMI are expected to help “level the playing field” for university-based NLP research.
About Natural Language Processing
Natural Language Processing, or Language Technology, is one of the newer scientific domains that use high-performance computing in research (HPC). HPC has traditionally been dominated by the natural sciences, however, constant development in technology and digital working methods means that more and more disciplines are dependent on processing and analyzing big data in their research.