BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

The decision by Andrej Karpathy to open-source the 700-line llama.c inference interface demystified how developers can interact with LLMs. The public repository's popularity took off, counting thousands of stars, forks and ports to other languages. Even before that, JLama started its journey of becoming the first pure Java-implemented inference engine for any Hugging Face model, from Gemma to Mixtral. The implementation leverages the Vector API and PanamaTensorOperations class with native fallback and it's available in Maven Central.

The library was developed by Jake Luciani, chief software architect at DataStax, and is currently the only Java inference library available in the Maven Central repository. Besides that, there are different alternatives like bindings for the native implementation or ports to the JVM languages like Java (e.g. llama2j, llama2.java or llama3.java), Scala or Kotlin.

Released under the Apache License, the project was built using Java 21 and the new Vector API, promising faster inference. To get started, developers can download models from Hugging Face by using the run-cli.sh script:

    
$ ./run-cli.sh download gpt2-medium
$ ./run-cli.sh download -t XXXXXXXX meta-llama/Llama-2-7b-chat-hf
$ ./run-cli.sh download intfloat/e5-small-v2
    

Afterwards, developers can chat with the model or complete a prompt:

    
$ ./run-cli.sh complete -p "The best part of waking up is " -t 0.7 -tc 16 -q Q4 -wq I8 models/Llama-2-7b-chat-hf
$ ./run-cli.sh chat -p "Tell me a joke about cats." -t 0.7 -tc 16 -q Q4 -wq I8 models/Llama-2-7b-chat-hf
    

For those who want "to just chat with a large language model" the library provides a simple web UI. It can be started using:

    
$ ./run-cli.sh download tjake/llama2-7b-chat-hf-jlama-Q4
$ ./run-cli.sh serve models/llama2-7b-chat-hf-jlama-Q4
    

After starting the application, it can be accessed at http://localhost:8080/ui/index.html.

JLama has implemented features like distributed inference, flash attention, mixture of experts, the Hugging Face SafeTensors model and Tokenizer format. For the moment, the Gemma, Llama and Llama2, Mistral and Mixtral, GPT-2, BERT models and BPE or WordPiece Tokenizers are available. According to the project roadmap, more models will be supported in the future together with LoRA and GraalVM support.

As pointed out by Karpathy, former AI director at Tesla and founding scientist at OpenAI, in the documentation of his repository:

You might think that you need many billion parameter LLMs to do anything useful, but in fact, very small LLMs can have surprisingly strong performance if you make the domain narrow enough.  [...] The era of small large generative models is close.

The Java ecosystem has several alternatives for those who want to try to integrate LLM into their application: specific integration in Spring or Quarkus applications, or plain libraries like JLama which can be easily integrated in any vanilla Java application. On top of that, new applications like Devoxx Genie or extensions to Podman Desktop try to help developers to get started quicker with LLMs.

About the Author

Rate this Article

Adoption
Style

BT