InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

Java

JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

May 29, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

The decision by Andrej Karpathy to open-source the 700-line llama.c inference interface demystified how developers can interact with LLMs. The public repository's popularity took off, counting thousands of stars, forks and ports to other languages. Even before that, JLama started its journey of becoming the first pure Java-implemented inference engine for any Hugging Face model, from Gemma to Mixtral. The implementation leverages the Vector API and PanamaTensorOperations class with native fallback and it's available in Maven Central.

The library was developed by Jake Luciani, chief software architect at DataStax, and is currently the only Java inference library available in the Maven Central repository. Besides that, there are different alternatives like bindings for the native implementation or ports to the JVM languages like Java (e.g. llama2j, llama2.java or llama3.java), Scala or Kotlin.

Released under the Apache License, the project was built using Java 21 and the new Vector API, promising faster inference. To get started, developers can download models from Hugging Face by using the run-cli.sh script:

    
$ ./run-cli.sh download gpt2-medium
$ ./run-cli.sh download -t XXXXXXXX meta-llama/Llama-2-7b-chat-hf
$ ./run-cli.sh download intfloat/e5-small-v2

Afterwards, developers can chat with the model or complete a prompt:

    
$ ./run-cli.sh complete -p "The best part of waking up is " -t 0.7 -tc 16 -q Q4 -wq I8 models/Llama-2-7b-chat-hf
$ ./run-cli.sh chat -p "Tell me a joke about cats." -t 0.7 -tc 16 -q Q4 -wq I8 models/Llama-2-7b-chat-hf

For those who want "to just chat with a large language model" the library provides a simple web UI. It can be started using:

    
$ ./run-cli.sh download tjake/llama2-7b-chat-hf-jlama-Q4
$ ./run-cli.sh serve models/llama2-7b-chat-hf-jlama-Q4

After starting the application, it can be accessed at http://localhost:8080/ui/index.html.

JLama has implemented features like distributed inference, flash attention, mixture of experts, the Hugging Face SafeTensors model and Tokenizer format. For the moment, the Gemma, Llama and Llama2, Mistral and Mixtral, GPT-2, BERT models and BPE or WordPiece Tokenizers are available. According to the project roadmap, more models will be supported in the future together with LoRA and GraalVM support.

As pointed out by Karpathy, former AI director at Tesla and founding scientist at OpenAI, in the documentation of his repository:

You might think that you need many billion parameter LLMs to do anything useful, but in fact, very small LLMs can have surprisingly strong performance if you make the domain narrow enough. [...] The era of small large generative models is close.

The Java ecosystem has several alternatives for those who want to try to integrate LLM into their application: specific integration in Spring or Quarkus applications, or plain libraries like JLama which can be easily integrated in any vanilla Java application. On top of that, new applications like Devoxx Genie or extensions to Podman Desktop try to help developers to get started quicker with LLMs.

About the Author

Olimpiu Pop

Tech Executive and Engineer Focused on a Holistic Approach and using technology to provide solutions to real problems with minimal impact on the environment. He has experience in developing real-time applications ranging from financial software to IAM. Passionate about tooling and optimising development flows with or without AI. Led and shaped technical organisations of hundreds of developers (from support engineers to Architects). Tech community builder: Transylvania JUG facilitator, member of the program committee for Voxxed Romania and Devoxx UK, conference speaker and podcaster on cybersecurity and open-source topics for 505updates.com. Main editor and troublemaker of JavaAdventCalendar.

Show moreShow less

This content is in the Java topic

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

Write for InfoQ

About the Author

Olimpiu Pop

This content is in the Java topic

Related Topics:

Popular in Java

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter