llr3 - Local Llama Rust 3 CLI Agent

What is llr3?

llr3 is a minimal CLI tool for running Llama 3 inference locally using Rust 2024 edition. It provides single-shot execution: send a prompt, get a response, and exit cleanly. The tool automatically downloads the Llama 3 model on first run if not present, making setup seamless.

Built with llama-cpp-2 for direct in-process LLM inference and clap for argument parsing, llr3 delivers fast, deterministic responses without requiring external services or complex configurations.

How to Use

Build

cargo build --release

Binary location: ./target/release/llr3

Run

./target/release/llr3 -p "What is Rust?"
./target/release/llr3 --prompt "Explain ownership" -m ~/llama/custom-model.gguf
./target/release/llr3 -s

CLI Arguments

Flag	Long	Required	Default	Description
-p	--prompt	No	-	The prompt to send to the model
-s	--serve	No	false	Start REST API server on port 8080
-m	--model	No	~/llama/llama-3.gguf	Path to GGUF model file

First Run

On first execution, llr3 automatically downloads the Llama 3 Q4_K_M model (~4.7GB) from Hugging Face and saves it to ~/llama/llama-3.gguf. Subsequent runs use the cached model.

Example

./target/release/llr3 -p "Hello, who are you?"

Expected: Clean text response from Llama 3, no debug output, program exits after response.

Rationale

llr3 was created to provide a straightforward, no-nonsense way to run local LLM inference without the overhead of servers, APIs, or cloud services. Key motivations:

Privacy: All inference happens locally, no data leaves your machine
Speed: Direct in-process execution without network latency
Simplicity: Single binary, minimal dependencies, zero configuration
Deterministic: Greedy sampling ensures reproducible results
Rust 2024: Leverages latest Rust features for safety and performance
Self-contained: Auto-downloads model, no manual setup required

Perfect for scripting, automation, quick queries, or integration into larger workflows where you need fast, local AI responses.

Limitations

Single-shot only: No conversation memory or multi-turn dialogue
Context size: Fixed at 2048 tokens for input context
Max generation: Limited to 512 tokens per response
Model format: Only supports GGUF models
Sampling: Greedy sampling only (deterministic, no temperature control)
No streaming: Output printed at generation time but no streaming API
CPU-bound: Uses CPU inference via llama.cpp (no GPU acceleration configured)
Model size: Default Q4_K_M quantization balances quality and size (~4.7GB)
No fine-tuning: Uses pre-trained Llama 3 model as-is

For multi-turn conversations, GPU acceleration, or advanced sampling, consider using larger LLM frameworks like ollama or llama.cpp directly.

REST API Server

Start the server using the -s flag:

cargo build --release
./target/release/llr3 -s

Server output:

Starting server on http://127.0.0.1:8080
Visit http://127.0.0.1:8080/ui for the chat interface

Endpoints

Method	Path	Description
GET	/ui	Web-based chat interface
GET	/stream/{prompt}	Send prompt and receive response as plain text

Chat UI

Visit http://127.0.0.1:8080/ui in your browser for the interactive chat interface.

Testing the REST API

./test-rest.sh "what is the capital of brazil?"

HTTP/1.1 200 OK
content-length: 35
content-type: text/plain
date: Sat, 24 Jan 2026 23:27:15 GMT

The capital of Brazil is Brasília.%

./test-rest.sh "how much is 1 + 4?"

HTTP/1.1 200 OK
content-length: 16
content-type: text/plain
date: Sat, 24 Jan 2026 23:27:29 GMT

The answer is 5!%

./test-rest.sh "hello world in java?"

HTTP/1.1 200 OK
content-length: 1228
content-type: text/plain
date: Sat, 24 Jan 2026 23:29:57 GMT

A classic!

Here is the traditional "Hello, World!" program in Java:

public class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!"); } }

Let me explain what's going on:

* `public class HelloWorld`: This declares a public class named `HelloWorld`.
* `public static void main(String[] args)`: This is the entry point of the program, where the Java Virtual Machine (JVM) starts executing the code. The `main` method is declared as `public`, which means it can be accessed from outside the class, and `static`, which means it can be called without creating an instance of the class. The `String[] args` parameter is an array of strings that represents the command-line arguments passed to the program.
* `System.out.println("Hello, World!");`: This statement uses the `System.out` object to print the string "Hello, World!" to the console, followed by a newline character.

To compile and run this program, you can use the following steps:

1. Save the code in a file named `HelloWorld.java`.
2. Compile the code using the `javac` command: `javac HelloWorld.java`
3. Run the program using the `java` command: `java HelloWorld`

This will print "Hello, World!" to the console.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
design-doc.md		design-doc.md
llr3-logo.png		llr3-logo.png
rest-result.png		rest-result.png
test-rest.sh		test-rest.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llr3 - Local Llama Rust 3 CLI Agent

What is llr3?

How to Use

Build

Run

CLI Arguments

First Run

Example

Rationale

Limitations

REST API Server

Endpoints

Chat UI

Testing the REST API

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llr3 - Local Llama Rust 3 CLI Agent

What is llr3?

How to Use

Build

Run

CLI Arguments

First Run

Example

Rationale

Limitations

REST API Server

Endpoints

Chat UI

Testing the REST API

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages