01_intro_analyze.md

Tutorial 1: Introduction to Context Analysis

In this tutorial, you'll learn how to use ContextLab to analyze text documents for token usage, redundancy, and salience.

Prerequisites

Python 3.11+
ContextLab installed (pip install contextlab)
OpenAI API key (or use mock mode)

Basic Analysis

Step 1: Set up your environment

# Create .env file
echo "OPENAI_API_KEY=your-key-here" > .env

# Or use mock mode for testing
export CONTEXTLAB_MOCK_MODE=true

Step 2: Prepare sample documents

Create a few sample documents:

mkdir -p docs
echo "Machine learning is a subset of AI." > docs/ml.txt
echo "Deep learning uses neural networks." > docs/dl.txt
echo "Machine learning is a subset of artificial intelligence." > docs/ml2.txt

Step 3: Run analysis

Using CLI

contextlab analyze docs/*.txt \
    --model gpt-4o-mini \
    --chunk-size 512 \
    --overlap 50 \
    --out .contextlab \
    --mock

Using Python SDK

import asyncio
from contextlab import analyze

async def main():
    report = await analyze(
        paths=["docs/*.txt"],
        model="gpt-4o-mini",
        chunk_size=512,
        overlap=50,
        output_dir=".contextlab"
    )

    print(f"Analyzed {len(report.chunks)} chunks")
    print(f"Total tokens: {report.total_tokens}")

    # Find most salient chunks
    top_chunks = sorted(report.chunks, key=lambda c: c.salience, reverse=True)[:3]
    for chunk in top_chunks:
        print(f"\nChunk {chunk.id}:")
        print(f"  Salience: {chunk.salience:.3f}")
        print(f"  Text: {chunk.text[:60]}...")

asyncio.run(main())

Understanding Results

Chunks

Each chunk contains:

id: Unique identifier
text: Chunk content
tokens: Token count
salience: TF-IDF-based importance score (0-1)
redundancy: Max cosine similarity with other chunks (0-1)
embedding: Vector representation (for similarity)

Salience

Salience measures how "important" or "distinctive" a chunk is based on TF-IDF scoring:

High salience (>0.5): Contains distinctive keywords
Medium salience (0.2-0.5): Average content
Low salience (<0.2): Common/generic content

Redundancy

Redundancy measures similarity to other chunks:

High redundancy (>0.8): Very similar to another chunk
Medium redundancy (0.5-0.8): Some overlap
Low redundancy (<0.5): Mostly unique content

Advanced Usage

Custom chunk sizes

Different models have different context windows. Adjust chunk size accordingly:

# For GPT-4 with 128k context
report = await analyze(
    paths=["large_docs/*.md"],
    model="gpt-4",
    chunk_size=2048,
    overlap=200
)

Analyzing text directly

report = await analyze(
    text="Your text content here...",
    model="gpt-4o-mini"
)

Using different tokenizers

ContextLab automatically selects the right tokenizer:

# For GPT models
report = await analyze(paths=["docs/*.txt"], model="gpt-4")

# For Claude models
report = await analyze(paths=["docs/*.txt"], model="claude-3")

# For Llama models
report = await analyze(paths=["docs/*.txt"], model="llama-2")

Inspecting Results

View stored data

# List all runs
contextlab viz

# View specific run
contextlab viz <run_id> --headless

Access via Python

from contextlab.io.ds import DataStore

store = DataStore()
runs = store.list_runs(limit=10)

for run in runs:
    print(f"Run {run.run_id}: {run.num_chunks} chunks, {run.total_tokens} tokens")

Best Practices

Choose appropriate chunk sizes: Smaller chunks (256-512) for fine-grained analysis, larger chunks (1024-2048) for document-level analysis
Use overlap wisely: 10-20% overlap preserves context boundaries
Mock mode for testing: Use --mock flag to avoid API costs during development
Batch processing: Process multiple documents in one call for efficiency
Storage management: Periodically clean up old runs with store.delete_run(run_id)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial 1: Introduction to Context Analysis

Prerequisites

Basic Analysis

Step 1: Set up your environment

Step 2: Prepare sample documents

Step 3: Run analysis

Using CLI

Using Python SDK

Understanding Results

Chunks

Salience

Redundancy

Advanced Usage

Custom chunk sizes

Analyzing text directly

Using different tokenizers

Inspecting Results

View stored data

Access via Python

Best Practices

Next Steps

FilesExpand file tree

01_intro_analyze.md

Latest commit

History

01_intro_analyze.md

File metadata and controls

Tutorial 1: Introduction to Context Analysis

Prerequisites

Basic Analysis

Step 1: Set up your environment

Step 2: Prepare sample documents

Step 3: Run analysis

Using CLI

Using Python SDK

Understanding Results

Chunks

Salience

Redundancy

Advanced Usage

Custom chunk sizes

Analyzing text directly

Using different tokenizers

Inspecting Results

View stored data

Access via Python

Best Practices

Next Steps