EuroLLM-9B Aims to Improve State of the Art LLM Support for European Languages

EuroLLM-9B is an open-source large language model built in Europe and tailored to European languages, including all the official EU languages as well as 11 other non-official albeit commercially important languages. According to the team behind it, its performance makes it one of the best European-made LLM of this size.

EuroLLM-9B is the second LLM created within the EuroLLM initiative, coming a few months after the smaller EuroLLM-1.7B.

The key component in EuroLLM-9B to make its performance stronger for European languages is the tokenizer, which was built using a vocabulary of 128,000 word pieces belonging to European languages. The model was pre-trained on approximately 4 trillion tokens using the GPU infrastructure provided by the Barcelona-based MareNostrum5 supercomputer.

In the post-training phase, the EuroLLM team used publicly available datasets to fine-tune it and make it capable of handling multi-turn conversations and behave as an instruction-following model. One of the goals of the team was showing the model suitability to be fine-tuned for specific use case.

According to the team, the model excels at translating texts across all supported languages, a task in which it outperforms Gemma-2-9B–IT and Aya-expanse-8B.

To assess the model performance, the team ran benchmarks both in English and in EU languages. Unsurprisingly, for European languages, EuroLLM outperforms both European models, such as Mistral-7BV, Salamandra-7B, and others, as well as non-European languages, including LLama-3.1-8B, Qwen-2-5-7B, and others, with Gemma-2-9B achieving comparable results. For English languages, EuroLLM-9B shows good performance, on a par with Mistral-7B.

As expected, a 9B model cannot match the performance of a 70B model. However, the scores are very good and come remarkably close to the larger models, especially when using a beam size of 4.

The model is available on Hugging Face, where you can run it as shown in the following snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "utter-project/EuroLLM-9B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": "You are EuroLLM --- an AI assistant specialized in European languages that provides safe, educational and helpful answers.",
    },
    {
        "role": "user", "content": "What is the capital of Portugal? How would you describe it?"
    },
    ]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

As several Reddit users point out, the need for open-source models tailored to European languages is real, since even larger models like Llama 3.3 70B may perform unsatisfactorily, not to mention the cost of fine-tuning it.

The EuroLLM team is already at work on a larger version of the model, to make it more competitive with larger models, but has not clarified when it could become available.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Sergio De Simone

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

How could we improve? Take the InfoQ reader survey

The InfoQ Newsletter