Skip to content

Latest commit

 

History

History
184 lines (127 loc) · 4.12 KB

File metadata and controls

184 lines (127 loc) · 4.12 KB

Tutorial 1: Introduction to Context Analysis

In this tutorial, you'll learn how to use ContextLab to analyze text documents for token usage, redundancy, and salience.

Prerequisites

  • Python 3.11+
  • ContextLab installed (pip install contextlab)
  • OpenAI API key (or use mock mode)

Basic Analysis

Step 1: Set up your environment

# Create .env file
echo "OPENAI_API_KEY=your-key-here" > .env

# Or use mock mode for testing
export CONTEXTLAB_MOCK_MODE=true

Step 2: Prepare sample documents

Create a few sample documents:

mkdir -p docs
echo "Machine learning is a subset of AI." > docs/ml.txt
echo "Deep learning uses neural networks." > docs/dl.txt
echo "Machine learning is a subset of artificial intelligence." > docs/ml2.txt

Step 3: Run analysis

Using CLI

contextlab analyze docs/*.txt \
    --model gpt-4o-mini \
    --chunk-size 512 \
    --overlap 50 \
    --out .contextlab \
    --mock

Using Python SDK

import asyncio
from contextlab import analyze

async def main():
    report = await analyze(
        paths=["docs/*.txt"],
        model="gpt-4o-mini",
        chunk_size=512,
        overlap=50,
        output_dir=".contextlab"
    )

    print(f"Analyzed {len(report.chunks)} chunks")
    print(f"Total tokens: {report.total_tokens}")

    # Find most salient chunks
    top_chunks = sorted(report.chunks, key=lambda c: c.salience, reverse=True)[:3]
    for chunk in top_chunks:
        print(f"\nChunk {chunk.id}:")
        print(f"  Salience: {chunk.salience:.3f}")
        print(f"  Text: {chunk.text[:60]}...")

asyncio.run(main())

Understanding Results

Chunks

Each chunk contains:

  • id: Unique identifier
  • text: Chunk content
  • tokens: Token count
  • salience: TF-IDF-based importance score (0-1)
  • redundancy: Max cosine similarity with other chunks (0-1)
  • embedding: Vector representation (for similarity)

Salience

Salience measures how "important" or "distinctive" a chunk is based on TF-IDF scoring:

  • High salience (>0.5): Contains distinctive keywords
  • Medium salience (0.2-0.5): Average content
  • Low salience (<0.2): Common/generic content

Redundancy

Redundancy measures similarity to other chunks:

  • High redundancy (>0.8): Very similar to another chunk
  • Medium redundancy (0.5-0.8): Some overlap
  • Low redundancy (<0.5): Mostly unique content

Advanced Usage

Custom chunk sizes

Different models have different context windows. Adjust chunk size accordingly:

# For GPT-4 with 128k context
report = await analyze(
    paths=["large_docs/*.md"],
    model="gpt-4",
    chunk_size=2048,
    overlap=200
)

Analyzing text directly

report = await analyze(
    text="Your text content here...",
    model="gpt-4o-mini"
)

Using different tokenizers

ContextLab automatically selects the right tokenizer:

# For GPT models
report = await analyze(paths=["docs/*.txt"], model="gpt-4")

# For Claude models
report = await analyze(paths=["docs/*.txt"], model="claude-3")

# For Llama models
report = await analyze(paths=["docs/*.txt"], model="llama-2")

Inspecting Results

View stored data

# List all runs
contextlab viz

# View specific run
contextlab viz <run_id> --headless

Access via Python

from contextlab.io.ds import DataStore

store = DataStore()
runs = store.list_runs(limit=10)

for run in runs:
    print(f"Run {run.run_id}: {run.num_chunks} chunks, {run.total_tokens} tokens")

Best Practices

  1. Choose appropriate chunk sizes: Smaller chunks (256-512) for fine-grained analysis, larger chunks (1024-2048) for document-level analysis

  2. Use overlap wisely: 10-20% overlap preserves context boundaries

  3. Mock mode for testing: Use --mock flag to avoid API costs during development

  4. Batch processing: Process multiple documents in one call for efficiency

  5. Storage management: Periodically clean up old runs with store.delete_run(run_id)

Next Steps