The Mímir Project, initiated by the Norwegian government and led by the National Library of Norway (NB), evaluates the impact of copyrighted material on the performance of generative large language models (LLMs) for Norwegian. It is a model for integrating copyrighted content in AI while maintaining ethical and legal standards.
Evaluating the impact of copyrighted materials on generative large language models for Norwegian languages
The project is a groundbreaking national effort conducted in collaboration with the Language Technology Group (LTG) at the University of Oslo and NorwAI at NTNU.
A cornerstone of the project is access to high-performance computing resources provided by Sigma2. Utilising GPUs on the LUMI supercomputer, one of Europe's most powerful computing infrastructures, Mímir trained and evaluated 17 LLMs with 7 billion parameters each. This computational capability enabled comprehensive experiments, including fine-tuning various datasets, such as copyrighted newspapers and books, to assess their influence on model quality.
The impact on Norwegian LLMs and Ethical AI Development
The Mímir Project stands out internationally for its empirical approach to quantifying the contributions of high-quality, curated copyrighted material to LLM performance. Key findings indicate significant performance improvements, particularly in domains requiring factual accuracy and linguistic richness. The integration of datasets, curated from NB's extensive digital collection and other sources, proved essential in achieving these outcomes.
Collaboration with NorwAI and LTG has been pivotal, as they brought expertise in language technology, data preparation, and advanced evaluation techniques. Their joint effort has enabled the rapid development and benchmarking of Norwegian LLMs, which accurately address tasks like sentiment analysis, translation and summarisation.
Mímir advances Norwegian LLM capabilities and offers a knowledge base for policy discussions around copyright in AI training. It provides a template for balancing technological innovation with fair compensation for content creators, fostering a sustainable and ethical AI development ecosystem. In the future, Norwegian public language models are expected to be published and maintained by NB, with computational resources and data storage provided by Sigma2, and in close collaboration with NorwAI and LTG.
About the Mímir project
The Mímir Project is an initiative by the Norwegian government aimed at evaluating the significance and impact of copyrighted material on the development and performance of generative large language models for the Norwegian language and context.
This effort involves three leading institutions: the National Library of Norway (NB), the University of Oslo (UiO/LTG), and the Norwegian University of Science and Technology (NTNU/NorwAI).