# Parallel Data API Reference ## Core Functions ### `run_enrichment(config_file: str | Path) -> None` Run data enrichment using a YAML configuration file. **Parameters:** - `config_file`: Path to YAML configuration file **Raises:** - `FileNotFoundError`: If config file doesn't exist - `ValueError`: If config is invalid - `NotImplementedError`: If source type is not supported **Example:** ```python from parallel_web_tools import run_enrichment run_enrichment("configs/my_enrichment.yaml") ``` --- ### `run_enrichment_from_dict(config: dict) -> None` Run data enrichment using a configuration dictionary. **Parameters:** - `config`: Configuration dictionary matching YAML schema **Raises:** - `ValueError`: If config is invalid - `NotImplementedError`: If source type is not supported **Example:** ```python from parallel_web_tools import run_enrichment_from_dict config = { "source": "data.csv", "target": "enriched.csv", "source_type": "csv", "source_columns": [ {"name": "company", "description": "Company name"} ], "enriched_columns": [ {"name": "revenue", "description": "Annual revenue"} ] } run_enrichment_from_dict(config) ``` --- ## Schema Classes ### `SourceType(Enum)` Enumeration of supported data source types. **Values:** - `SourceType.CSV`: CSV file source - `SourceType.DUCKDB`: DuckDB database source - `SourceType.BIGQUERY`: Google BigQuery source **Example:** ```python from parallel_web_tools import SourceType source_type = SourceType.CSV ``` --- ### `Column(dataclass)` Represents a column with name and description. **Attributes:** - `name` (str): Column name - `description` (str): Column description **Example:** ```python from parallel_web_tools import Column col = Column("revenue", "Annual revenue in USD") ``` --- ### `InputSchema(dataclass)` Schema for input data configuration. **Attributes:** - `source` (str): Source location (file path, table name) - `target` (str): Target location - `source_type` (SourceType): Type of data source - `source_columns` (list[Column]): Input columns - `enriched_columns` (list[Column]): Columns to enrich **Example:** ```python from parallel_web_tools import InputSchema, Column, SourceType schema = InputSchema( source="data.csv", target="enriched.csv", source_type=SourceType.CSV, source_columns=[Column("company", "Company name")], enriched_columns=[Column("revenue", "Annual revenue")] ) ``` --- ## Utility Functions ### `load_schema(filename: str) -> dict` Load schema from YAML file. **Parameters:** - `filename`: Path to YAML file **Returns:** - Dictionary containing schema configuration **Example:** ```python from parallel_web_tools import load_schema schema_dict = load_schema("config.yaml") ``` --- ### `parse_schema(schema: dict) -> InputSchema` Parse schema dictionary into InputSchema object. **Parameters:** - `schema`: Schema dictionary **Returns:** - `InputSchema` object **Raises:** - `ParseError`: If schema is invalid **Example:** ```python from parallel_web_tools import parse_schema schema_dict = { "source": "data.csv", "target": "enriched.csv", "source_type": "csv", "source_columns": [{"name": "company", "description": "Company name"}], "enriched_columns": [{"name": "revenue", "description": "Annual revenue"}] } schema = parse_schema(schema_dict) ``` --- ## Processor Functions (Advanced) For direct access to processors (advanced usage): ### `process_csv(schema: InputSchema) -> None` Process CSV file and enrich data. ### `process_duckdb(schema: InputSchema) -> None` Process DuckDB table and enrich data. ### `process_bigquery(schema: InputSchema) -> None` Process BigQuery table and enrich data. **Example:** ```python from parallel_web_tools import InputSchema, Column, SourceType from parallel_web_tools.processors import process_csv schema = InputSchema( source="data.csv", target="enriched.csv", source_type=SourceType.CSV, source_columns=[Column("company", "Company name")], enriched_columns=[Column("revenue", "Annual revenue")] ) process_csv(schema) ``` --- ## Configuration Schema ### YAML Configuration Format ```yaml source: path/to/source # File path or table name target: path/to/target # Output location source_type: csv # One of: csv, duckdb, bigquery source_columns: - name: column_name description: Column description enriched_columns: - name: new_column_name description: Description of enriched column ``` ### Dictionary Configuration Format ```python { "source": "path/to/source", "target": "path/to/target", "source_type": "csv", # or "duckdb", "bigquery" "source_columns": [ {"name": "column_name", "description": "Column description"} ], "enriched_columns": [ {"name": "new_column", "description": "Description"} ] } ``` --- ## Environment Variables Required environment variables: - `PARALLEL_API_KEY`: Your Parallel API key (required) - `DUCKDB_FILE`: Path to DuckDB file (optional, default: `data/file.db`) - `BIGQUERY_PROJECT`: Google Cloud Project ID for BigQuery (optional) Load from `.env.local`: ```python from dotenv import load_dotenv load_dotenv(".env.local") ``` --- ## Error Handling All functions may raise standard Python exceptions: - `FileNotFoundError`: Config or data file not found - `ValueError`: Invalid configuration - `NotImplementedError`: Unsupported source type - `ParseError`: Schema parsing failed **Example with error handling:** ```python from parallel_web_tools import run_enrichment, ParseError try: run_enrichment("config.yaml") except FileNotFoundError: print("Config file not found") except ParseError as e: print(f"Invalid configuration: {e}") except Exception as e: print(f"Enrichment failed: {e}") ```