Quickstart
Start the server
Make sure you have uv installed, then:
- Ollama (local)
- OpenAI
- Docker
Install Ollama, then pull a model and start the server:
ollama pull llama3.2:3b
export OLLAMA_URL=http://localhost:11434/v1
uvx --from 'llama-stack[starter]' llama stack run starter
export OPENAI_API_KEY=sk-xxx
uvx --from 'llama-stack[starter]' llama stack run starter
Run a pre-built container image from Docker Hub:
docker run -it \
-p 8321:8321 \
-v ~/.llama:/root/.llama \
-e OLLAMA_URL=http://host.docker.internal:11434 \
llamastack/distribution-starter
The uvx command above is great for trying things out. For a real project, install into a persistent environment:
uv init my-ai-app && cd my-ai-app
uv add 'llama-stack[starter]' openai
export OLLAMA_URL=http://localhost:11434/v1
uv run llama stack run starter
The server is now running at http://localhost:8321. You can use any OpenAI-compatible client.
Verify it works
Before writing any code, confirm the server is healthy and models are registered:
- curl
- Python
curl -s http://localhost:8321/v1/models | python -m json.tool
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
for model in client.models.list():
print(model.id)
You should see output listing available models, for example:
{
"data": [
{
"id": "ollama/llama3.2:3b",
"object": "model",
"owned_by": "llama_stack",
...
}
]
}
If the list is empty or the command fails, check the Troubleshooting section below.
Try it out
Open a new terminal and run:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.responses.create(
model="ollama/llama3.2:3b",
input="What is Llama Stack?",
)
print(response.output_text)
pip install openai && python app.py
Add RAG in 10 lines
Upload a file, create a vector store, and ask questions about it:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
# Upload a document
file = client.files.create(
file=open("my-document.pdf", "rb"),
purpose="assistants",
)
# Create a vector store and index the file
vector_store = client.vector_stores.create(
name="my-docs",
file_ids=[file.id],
)
# Ask questions with file search
response = client.responses.create(
model="ollama/llama3.2:3b",
input="Summarize the key points",
tools=[{
"type": "file_search",
"vector_store_ids": [vector_store.id],
}],
)
print(response.output_text)
That's it. Same OpenAI SDK, local model, your own vector store.
Troubleshooting
Port already in use
If you see Address already in use, another process is using port 8321. Either stop it or run on a different port:
uvx --from 'llama-stack[starter]' llama stack run starter --port 8322
Connection refused
If curl returns Connection refused, the server has not finished starting. Wait a few seconds for model registration to complete, then try again. Check the terminal where you started the server for errors.
Model not found
If the API returns an error about a model not being found, make sure you pulled the model first. For Ollama:
ollama pull llama3.2:3b
Then restart the Llama Stack server. You can verify available models with curl http://localhost:8321/v1/models.
What's next?
- Use the OpenAI SDK - full compatibility guide
- Try the Responses API - build agents with tool calling
- Add vector search - RAG with file search and vector stores
- See all providers - swap to vLLM, Bedrock, Azure, and more