In this tutorial, you'll learn how to use ContextLab to analyze text documents for token usage, redundancy, and salience.
- Python 3.11+
- ContextLab installed (
pip install contextlab) - OpenAI API key (or use mock mode)
# Create .env file
echo "OPENAI_API_KEY=your-key-here" > .env
# Or use mock mode for testing
export CONTEXTLAB_MOCK_MODE=trueCreate a few sample documents:
mkdir -p docs
echo "Machine learning is a subset of AI." > docs/ml.txt
echo "Deep learning uses neural networks." > docs/dl.txt
echo "Machine learning is a subset of artificial intelligence." > docs/ml2.txtcontextlab analyze docs/*.txt \
--model gpt-4o-mini \
--chunk-size 512 \
--overlap 50 \
--out .contextlab \
--mockimport asyncio
from contextlab import analyze
async def main():
report = await analyze(
paths=["docs/*.txt"],
model="gpt-4o-mini",
chunk_size=512,
overlap=50,
output_dir=".contextlab"
)
print(f"Analyzed {len(report.chunks)} chunks")
print(f"Total tokens: {report.total_tokens}")
# Find most salient chunks
top_chunks = sorted(report.chunks, key=lambda c: c.salience, reverse=True)[:3]
for chunk in top_chunks:
print(f"\nChunk {chunk.id}:")
print(f" Salience: {chunk.salience:.3f}")
print(f" Text: {chunk.text[:60]}...")
asyncio.run(main())Each chunk contains:
- id: Unique identifier
- text: Chunk content
- tokens: Token count
- salience: TF-IDF-based importance score (0-1)
- redundancy: Max cosine similarity with other chunks (0-1)
- embedding: Vector representation (for similarity)
Salience measures how "important" or "distinctive" a chunk is based on TF-IDF scoring:
- High salience (>0.5): Contains distinctive keywords
- Medium salience (0.2-0.5): Average content
- Low salience (<0.2): Common/generic content
Redundancy measures similarity to other chunks:
- High redundancy (>0.8): Very similar to another chunk
- Medium redundancy (0.5-0.8): Some overlap
- Low redundancy (<0.5): Mostly unique content
Different models have different context windows. Adjust chunk size accordingly:
# For GPT-4 with 128k context
report = await analyze(
paths=["large_docs/*.md"],
model="gpt-4",
chunk_size=2048,
overlap=200
)report = await analyze(
text="Your text content here...",
model="gpt-4o-mini"
)ContextLab automatically selects the right tokenizer:
# For GPT models
report = await analyze(paths=["docs/*.txt"], model="gpt-4")
# For Claude models
report = await analyze(paths=["docs/*.txt"], model="claude-3")
# For Llama models
report = await analyze(paths=["docs/*.txt"], model="llama-2")# List all runs
contextlab viz
# View specific run
contextlab viz <run_id> --headlessfrom contextlab.io.ds import DataStore
store = DataStore()
runs = store.list_runs(limit=10)
for run in runs:
print(f"Run {run.run_id}: {run.num_chunks} chunks, {run.total_tokens} tokens")-
Choose appropriate chunk sizes: Smaller chunks (256-512) for fine-grained analysis, larger chunks (1024-2048) for document-level analysis
-
Use overlap wisely: 10-20% overlap preserves context boundaries
-
Mock mode for testing: Use
--mockflag to avoid API costs during development -
Batch processing: Process multiple documents in one call for efficiency
-
Storage management: Periodically clean up old runs with
store.delete_run(run_id)