Skip to content

diegopacheco/local-agent-rust-llama3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llr3 - Local Llama Rust 3 CLI Agent

What is llr3?

llr3 is a minimal CLI tool for running Llama 3 inference locally using Rust 2024 edition. It provides single-shot execution: send a prompt, get a response, and exit cleanly. The tool automatically downloads the Llama 3 model on first run if not present, making setup seamless.

Built with llama-cpp-2 for direct in-process LLM inference and clap for argument parsing, llr3 delivers fast, deterministic responses without requiring external services or complex configurations.

How to Use

Build

cargo build --release

Binary location: ./target/release/llr3

Run

./target/release/llr3 -p "What is Rust?"
./target/release/llr3 --prompt "Explain ownership" -m ~/llama/custom-model.gguf
./target/release/llr3 -s 

CLI Arguments

Flag Long Required Default Description
-p --prompt No - The prompt to send to the model
-s --serve No false Start REST API server on port 8080
-m --model No ~/llama/llama-3.gguf Path to GGUF model file

First Run

On first execution, llr3 automatically downloads the Llama 3 Q4_K_M model (~4.7GB) from Hugging Face and saves it to ~/llama/llama-3.gguf. Subsequent runs use the cached model.

Example

./target/release/llr3 -p "Hello, who are you?"

Expected: Clean text response from Llama 3, no debug output, program exits after response.

Rationale

llr3 was created to provide a straightforward, no-nonsense way to run local LLM inference without the overhead of servers, APIs, or cloud services. Key motivations:

  • Privacy: All inference happens locally, no data leaves your machine
  • Speed: Direct in-process execution without network latency
  • Simplicity: Single binary, minimal dependencies, zero configuration
  • Deterministic: Greedy sampling ensures reproducible results
  • Rust 2024: Leverages latest Rust features for safety and performance
  • Self-contained: Auto-downloads model, no manual setup required

Perfect for scripting, automation, quick queries, or integration into larger workflows where you need fast, local AI responses.

Limitations

  • Single-shot only: No conversation memory or multi-turn dialogue
  • Context size: Fixed at 2048 tokens for input context
  • Max generation: Limited to 512 tokens per response
  • Model format: Only supports GGUF models
  • Sampling: Greedy sampling only (deterministic, no temperature control)
  • No streaming: Output printed at generation time but no streaming API
  • CPU-bound: Uses CPU inference via llama.cpp (no GPU acceleration configured)
  • Model size: Default Q4_K_M quantization balances quality and size (~4.7GB)
  • No fine-tuning: Uses pre-trained Llama 3 model as-is

For multi-turn conversations, GPU acceleration, or advanced sampling, consider using larger LLM frameworks like ollama or llama.cpp directly.

REST API Server

Start the server using the -s flag:

cargo build --release
./target/release/llr3 -s

Server output:

Starting server on http://127.0.0.1:8080
Visit http://127.0.0.1:8080/ui for the chat interface

Endpoints

Method Path Description
GET /ui Web-based chat interface
GET /stream/{prompt} Send prompt and receive response as plain text

Chat UI

Visit http://127.0.0.1:8080/ui in your browser for the interactive chat interface.

Testing the REST API

./test-rest.sh "what is the capital of brazil?"
HTTP/1.1 200 OK
content-length: 35
content-type: text/plain
date: Sat, 24 Jan 2026 23:27:15 GMT

The capital of Brazil is Brasília.%
./test-rest.sh "how much is 1 + 4?"
HTTP/1.1 200 OK
content-length: 16
content-type: text/plain
date: Sat, 24 Jan 2026 23:27:29 GMT

The answer is 5!%
./test-rest.sh "hello world in java?"
HTTP/1.1 200 OK
content-length: 1228
content-type: text/plain
date: Sat, 24 Jan 2026 23:29:57 GMT

A classic!

Here is the traditional "Hello, World!" program in Java:

public class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!"); } }

Let me explain what's going on:

* `public class HelloWorld`: This declares a public class named `HelloWorld`.
* `public static void main(String[] args)`: This is the entry point of the program, where the Java Virtual Machine (JVM) starts executing the code. The `main` method is declared as `public`, which means it can be accessed from outside the class, and `static`, which means it can be called without creating an instance of the class. The `String[] args` parameter is an array of strings that represents the command-line arguments passed to the program.
* `System.out.println("Hello, World!");`: This statement uses the `System.out` object to print the string "Hello, World!" to the console, followed by a newline character.

To compile and run this program, you can use the following steps:

1. Save the code in a file named `HelloWorld.java`.
2. Compile the code using the `javac` command: `javac HelloWorld.java`
3. Run the program using the `java` command: `java HelloWorld`

This will print "Hello, World!" to the console.

About

local-agent-rust-llama3 (llr3): a CLI agent wrapper with a embedded llama 3 model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors