llr3 is a minimal CLI tool for running Llama 3 inference locally using Rust 2024 edition. It provides single-shot execution: send a prompt, get a response, and exit cleanly. The tool automatically downloads the Llama 3 model on first run if not present, making setup seamless.
Built with llama-cpp-2 for direct in-process LLM inference and clap for argument parsing, llr3 delivers fast, deterministic responses without requiring external services or complex configurations.
cargo build --releaseBinary location: ./target/release/llr3
./target/release/llr3 -p "What is Rust?"
./target/release/llr3 --prompt "Explain ownership" -m ~/llama/custom-model.gguf
./target/release/llr3 -s | Flag | Long | Required | Default | Description |
|---|---|---|---|---|
| -p | --prompt | No | - | The prompt to send to the model |
| -s | --serve | No | false | Start REST API server on port 8080 |
| -m | --model | No | ~/llama/llama-3.gguf | Path to GGUF model file |
On first execution, llr3 automatically downloads the Llama 3 Q4_K_M model (~4.7GB) from Hugging Face and saves it to ~/llama/llama-3.gguf. Subsequent runs use the cached model.
./target/release/llr3 -p "Hello, who are you?"Expected: Clean text response from Llama 3, no debug output, program exits after response.
llr3 was created to provide a straightforward, no-nonsense way to run local LLM inference without the overhead of servers, APIs, or cloud services. Key motivations:
- Privacy: All inference happens locally, no data leaves your machine
- Speed: Direct in-process execution without network latency
- Simplicity: Single binary, minimal dependencies, zero configuration
- Deterministic: Greedy sampling ensures reproducible results
- Rust 2024: Leverages latest Rust features for safety and performance
- Self-contained: Auto-downloads model, no manual setup required
Perfect for scripting, automation, quick queries, or integration into larger workflows where you need fast, local AI responses.
- Single-shot only: No conversation memory or multi-turn dialogue
- Context size: Fixed at 2048 tokens for input context
- Max generation: Limited to 512 tokens per response
- Model format: Only supports GGUF models
- Sampling: Greedy sampling only (deterministic, no temperature control)
- No streaming: Output printed at generation time but no streaming API
- CPU-bound: Uses CPU inference via llama.cpp (no GPU acceleration configured)
- Model size: Default Q4_K_M quantization balances quality and size (~4.7GB)
- No fine-tuning: Uses pre-trained Llama 3 model as-is
For multi-turn conversations, GPU acceleration, or advanced sampling, consider using larger LLM frameworks like ollama or llama.cpp directly.
Start the server using the -s flag:
cargo build --release
./target/release/llr3 -sServer output:
Starting server on http://127.0.0.1:8080
Visit http://127.0.0.1:8080/ui for the chat interface
| Method | Path | Description |
|---|---|---|
| GET | /ui | Web-based chat interface |
| GET | /stream/{prompt} | Send prompt and receive response as plain text |
Visit http://127.0.0.1:8080/ui in your browser for the interactive chat interface.
./test-rest.sh "what is the capital of brazil?"
HTTP/1.1 200 OK
content-length: 35
content-type: text/plain
date: Sat, 24 Jan 2026 23:27:15 GMT
The capital of Brazil is Brasília.%
./test-rest.sh "how much is 1 + 4?"
HTTP/1.1 200 OK
content-length: 16
content-type: text/plain
date: Sat, 24 Jan 2026 23:27:29 GMT
The answer is 5!%
./test-rest.sh "hello world in java?"
HTTP/1.1 200 OK
content-length: 1228
content-type: text/plain
date: Sat, 24 Jan 2026 23:29:57 GMT
A classic!
Here is the traditional "Hello, World!" program in Java:
public class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!"); } }
Let me explain what's going on:
* `public class HelloWorld`: This declares a public class named `HelloWorld`.
* `public static void main(String[] args)`: This is the entry point of the program, where the Java Virtual Machine (JVM) starts executing the code. The `main` method is declared as `public`, which means it can be accessed from outside the class, and `static`, which means it can be called without creating an instance of the class. The `String[] args` parameter is an array of strings that represents the command-line arguments passed to the program.
* `System.out.println("Hello, World!");`: This statement uses the `System.out` object to print the string "Hello, World!" to the console, followed by a newline character.
To compile and run this program, you can use the following steps:
1. Save the code in a file named `HelloWorld.java`.
2. Compile the code using the `javac` command: `javac HelloWorld.java`
3. Run the program using the `java` command: `java HelloWorld`
This will print "Hello, World!" to the console.

