Quickstart

Start the server

Make sure you have uv installed, then:

Ollama (local)
OpenAI
Docker

Install Ollama, then pull a model and start the server:

ollama pull llama3.2:3b
export OLLAMA_URL=http://localhost:11434/v1
uvx --from 'llama-stack[starter]' llama stack run starter

export OPENAI_API_KEY=sk-xxx
uvx --from 'llama-stack[starter]' llama stack run starter

Run a pre-built container image from Docker Hub:

docker run -it \
  -p 8321:8321 \
  -v ~/.llama:/root/.llama \
  -e OLLAMA_URL=http://host.docker.internal:11434 \
  llamastack/distribution-starter

Project setup

The uvx command above is great for trying things out. For a real project, install into a persistent environment:

uv init my-ai-app && cd my-ai-app
uv add 'llama-stack[starter]' openai
export OLLAMA_URL=http://localhost:11434/v1
uv run llama stack run starter

The server is now running at http://localhost:8321. You can use any OpenAI-compatible client.

Verify it works

Before writing any code, confirm the server is healthy and models are registered:

curl
Python

curl -s http://localhost:8321/v1/models | python -m json.tool

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

for model in client.models.list():
    print(model.id)

You should see output listing available models, for example:

{
    "data": [
        {
            "id": "ollama/llama3.2:3b",
            "object": "model",
            "owned_by": "llama_stack",
            ...
        }
    ]
}

If the list is empty or the command fails, check the Troubleshooting section below.

Try it out

Open a new terminal and run:

app.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

response = client.responses.create(
    model="ollama/llama3.2:3b",
    input="What is Llama Stack?",
)
print(response.output_text)

pip install openai && python app.py

Add RAG in 10 lines

Upload a file, create a vector store, and ask questions about it:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")

# Upload a document
file = client.files.create(
    file=open("my-document.pdf", "rb"),
    purpose="assistants",
)

# Create a vector store and index the file
vector_store = client.vector_stores.create(
    name="my-docs",
    file_ids=[file.id],
)

# Ask questions with file search
response = client.responses.create(
    model="ollama/llama3.2:3b",
    input="Summarize the key points",
    tools=[{
        "type": "file_search",
        "vector_store_ids": [vector_store.id],
    }],
)
print(response.output_text)

That's it. Same OpenAI SDK, local model, your own vector store.

Troubleshooting

Port already in use

If you see Address already in use, another process is using port 8321. Either stop it or run on a different port:

uvx --from 'llama-stack[starter]' llama stack run starter --port 8322

Connection refused

If curl returns Connection refused, the server has not finished starting. Wait a few seconds for model registration to complete, then try again. Check the terminal where you started the server for errors.

Model not found

If the API returns an error about a model not being found, make sure you pulled the model first. For Ollama:

ollama pull llama3.2:3b

Then restart the Llama Stack server. You can verify available models with curl http://localhost:8321/v1/models.

What's next?

Use the OpenAI SDK - full compatibility guide
Try the Responses API - build agents with tool calling
Add vector search - RAG with file search and vector stores
See all providers - swap to vLLM, Bedrock, Azure, and more

Start the server​

Verify it works​

Try it out​

Add RAG in 10 lines​

Troubleshooting​

What's next?​