I'm a Data & AI Architect with deep specialization in agentic AI systems and large-scale data platforms. I design end-to-end solutions at the intersection of data engineering and autonomous AI β building multi-agent architectures, real-time pipelines, and cloud-native systems that scale.
I'm actively exploring the intersection of data engineering and AI agents, developing customized data science agents that combine traditional data processing with intelligent automation. My work focuses on:
- Data Agents Development: Creating intelligent agents for data processing and analysis.
- Agentic Workflows: Designing autonomous systems for data pipeline management.
- AI-Powered Data Solutions: Integrating LLMs with traditional data engineering patterns.
π¬ Current Project: Developing production-ready AI agents for BigQuery analytics, combining ADK, MCP protocols, and BQML capabilities with RAG-enhanced documentation retrieval.
As a thought leader in data analytics and AI, I actively contribute to the developer community through speaking engagements and knowledge sharing.
Singapore Technology Week 2025 - October 9, 2025
Talk: "Unleash the Power of Generative AI in BigQuery with Colab Data Science Agents and BigFrames"
Demonstrated practical applications of Generative AI in BigQuery, showcasing how to leverage Colab Data Science Agents and BigFrames for advanced data analytics workflows. Explored the integration of AI-powered tools with BigQuery to enable data scientists and analysts to build intelligent data processing pipelines with natural language interfaces and automated insights generation.
Key Topics: Generative AI, BigQuery, Colab Data Science Agents, BigFrames, Data & AI workshops
Google Cloud Next Extended Singapore 2025 - June 14, 2025
Talk: "Metadata: The Key to Unlocking Data Analytics in the Agentic Era"
Presented insights on Google Cloud's latest data analytics innovations from Next '25, focusing on AI integration with BigQuery and the crucial role of metadata in enabling AI agents. Covered specialized AI agents for various user roles, AI-assisted notebooks, and the BigQuery AI Query Engine's capabilities with both structured and unstructured data.
Key Topics: BigQuery metadata, AI agents, data governance, query optimization, autonomous data processing
GDG Monthly Meetup #10 - October 24, 2024
Talk: "Harnessing Real-Time Insights: LLM Inference for Streaming Data with SQL"
Explored practical techniques for performing real-time inference on streaming data using large language models (LLMs) and SQL. Demonstrated seamless integration of LLMs into existing application workflows, enabling real-time insights, predictions, and classifications directly within familiar SQL environments.
Key Topics: Real-time data processing, LLM integration, streaming analytics, SQL-based AI inference
bq-agent-app - Multi-Agent BigQuery System
A powerful AI-powered data analysis system combining BigQuery with Google Agent Development Kit (ADK). Features multi-agent orchestration with specialized sub-agents for data retrieval, data science workflows, and BQML operations. Includes RAG corpus integration for BQML documentation and MCP protocol support.
Tech Stack: Python, ADK, MCP, Gemini 2.5, BigQuery, Vertex AI Key Features: Multi-agent architecture, Python code execution, Statistical analysis, BQML with RAG, MCP integration
mcp-cr - Model Context Protocol Server
A comprehensive tutorial for deploying MCP (Model Context Protocol) servers to Google Cloud Run, featuring a zoo animal database with interactive tools. Demonstrates modern AI integration patterns with cloud-native deployment.
Tech Stack: Python, FastMCP, Google Cloud Run, Docker Key Features: MCP server implementation, Cloud Run deployment, Interactive AI tools, RESTful API
mdm-gcp - Master Data Management with AI
Production-ready MDM solution with 5-strategy AI matching for batch processing and 4-strategy real-time matching for streaming. Features vector embeddings with Gemini, fuzzy matching, business rules, and AI natural language reasoning. Unified batch and streaming architecture with BigQuery and Spanner.
Tech Stack: Python, BigQuery, Spanner, Gemini, Vertex AI Vector Search Key Features: 5-strategy AI matching, Vector embeddings, Real-time streaming, Unified batch+streaming architecture
data-clean-room-demo - BigQuery Data Clean Rooms
Comprehensive BigQuery Data Clean Room implementation with Analytics Hub integration. Demonstrates privacy-preserving analytics, BQML collaborative ML, and secure data sharing patterns with automated setup scripts for both DCR and DCX deployments.
Tech Stack: Python, BigQuery, Analytics Hub, BQML, Vertex AI Key Features: Privacy-preserving analytics, BQML collaborative ML, Analytics Hub automation, Data exchange patterns
random-stuff - BigQuery Analytics Toolkit
Production-ready BigQuery tools and demos covering advanced analytics patterns. Includes FinOps cost optimization, geospatial routing, Places Insights competitive analysis, RLS/CLS security with Dataform, Firebase Analytics integration, Streaming CDC pipelines, and dbt migration workflows.
Tech Stack: Python, BigQuery, Dataform, dbt, PySpark, Jupyter Key Features: FinOps cookbook, Geospatial analysis, Places Insights, RLS/CLS security, Streaming CDC, dbt+Spark+BQ, Firebase Analytics
random-stuff/agent_stuff - AI Agent Configs & Guides
Curated collection of AI agent configurations, coding standards, and workspace architecture guides for multi-model agentic workflows. Includes OpenClaw workspace architecture guides for Anthropic and Gemini, Google-style coding standards for AI-generated code, BigQuery data science agent prompt libraries, and opencode configuration scripts.
Tech Stack: Python, OpenClaw, Anthropic Claude, Gemini, Google Cloud Key Features: OpenClaw workspace architecture guides (Anthropic + Gemini), Google-style AI coding standards (Python/Go/Java), BQ agent prompt library, opencode config + sync scripts, dbt migration agents
spark-hybrid-compute - Advanced Spark Integration
Comprehensive solution for Spark integration with BigLake Metastore and Apache Iceberg, supporting both Dataproc and Docker-based deployments. Demonstrates hybrid cloud computing patterns for modern data lakes.
Tech Stack: Apache Spark, BigLake, Apache Iceberg, Dataproc, Docker, Jupyter Key Features: Hybrid cloud architecture, Iceberg table management, BigQuery integration, Multiple deployment options
bigquery-antipattern-recognition - BigQuery SQL Optimization
Enhanced fork of Google Cloud Platform's utility for identifying and rewriting common anti-patterns in BigQuery SQL. Added query grouping functionality and clustering optimization patterns for improved performance analysis.
Tech Stack: Java, BigQuery, Maven, Docker, Cloud Run, Vertex AI Key Features: 15+ antipattern detections, AI-powered SQL rewriting, Query grouping analysis, Remote UDF deployment
sheets-pyspark - Google Sheets with PySpark
Integration of Google Sheets as a data source for PySpark on Dataproc Serverless. Includes Airflow demo for scheduling notebook execution with three deployment options: PythonVirtualenvOperator, Vertex AI Custom Training, and Dataproc Serverless.
Tech Stack: Python, PySpark, Dataproc Serverless, Airflow, Google Sheets API, Jupyter Key Features: Sheets as data source, Dataproc Serverless, Airflow orchestration, Multiple execution options
dataflow-kafka-bq-examples - Kafka to BigQuery Streaming
Comprehensive Dataflow examples for streaming Kafka data to BigQuery. Features multi-branch processing, Beam SQL aggregations, multi-stream joins, and both custom Java pipelines and Flex Templates for different deployment scenarios.
Tech Stack: Java, Apache Beam, Kafka, Dataflow, BigQuery, Beam SQL Key Features: Multi-branch processing, Beam SQL joins, Real-time aggregations, Flex Template deployment
beam-dataflow-iceberg-bqms - Beam with Iceberg Tables
Demonstration of Apache Beam with standard BigQueryIO and Managed I/O for BigQuery operations. Showcases 8 pipeline patterns including BigQuery Iceberg and BigLake Iceberg table operations with automatic schema handling.
Tech Stack: Python, Apache Beam, Dataflow, Apache Iceberg, BigQuery, BigLake Key Features: Managed I/O, BigQuery Iceberg tables, BigLake integration, Multiple pipeline patterns
cf-pubsub-to-bq - Real-Time Data Ingestion
Complete real-time data pipeline solution from Pub/Sub to BigQuery using Cloud Run Functions. Includes data generation, streaming processing, and automated table management.
Tech Stack: Go, Pub/Sub, BigQuery, Cloud Run Functions, Dataflow Key Features: Real-time processing, Automated data generation, Partitioned tables, End-to-end pipeline
dataflow-pubsub-to-bq-examples-py - Pub/Sub to BigQuery Streaming
Python streaming pipeline from Pub/Sub to BigQuery using BigQuery Storage Write API. Features micro-batching, Pub/Sub metadata capture, and partitioned tables with DirectRunner and DataflowRunner V2 support.
Tech Stack: Python, Apache Beam, Dataflow, Pub/Sub, BigQuery Key Features: Storage Write API, Micro-batching, Pub/Sub metadata capture, Runner V2 support
dataflow-pubsub-perf-test - Dataflow/BigQuery Performance Testing
Test infrastructure for diagnosing the Dataflow/BigQuery "Noisy Neighbor" throughput degradation pattern. Six rounds of testing across Pub/Sub and Kafka sources (Python + Java SDKs) β 2.2 billion rows, 2.4 TB, 901k rows/sec peak, zero errors. Confirmed linear scaling and identified a shared Kafka consumer group as the root cause of production degradation. Exceeded the BigQuery Storage Write API regional quota and sustained it.
Tech Stack: Java, Python, Apache Beam, Dataflow, Pub/Sub, Kafka (Google Managed), BigQuery Storage Write API Key Features: 2.2B rows / 2.4 TB scale testing, 901k rows/sec peak throughput, Noisy Neighbor root-cause diagnosis, Multi-source testing (Pub/Sub + Kafka), Python + Java SDK coverage
gemini-cli-1c - One-Click Gemini CLI Setup
Automated one-command installation script for a complete development environment with NVM, Node.js, and Google's Gemini CLI. Streamlines developer onboarding for AI-powered workflows.
Tech Stack: Shell, Node.js, NVM, Gemini CLI Key Features: One-command installation, Environment configuration, Developer productivity tools
vision-sandbox - Agentic Vision Tool
Agentic vision tool built as an OpenClaw skill, leveraging Gemini's native code execution sandbox for spatial grounding, visual math, and UI auditing tasks. Demonstrates OpenClaw skill architecture for vision-based agentic workflows.
Tech Stack: Python, Gemini, Google Cloud, OpenClaw Key Features: Agentic vision analysis, Spatial grounding, Visual math, UI auditing, OpenClaw skill architecture
- Big Data Processing: Apache Spark, Dataproc, distributed computing, Iceberg tables
- Data Warehousing: BigQuery, data modeling, partitioning strategies, performance optimization
- Real-Time Streaming: Pub/Sub, Kafka, Apache Beam, event-driven architectures
- Database Technologies: PostgreSQL, Spanner, Redis, Cassandra
- Master Data Management: AI-powered entity resolution, vector embeddings, multi-strategy matching
- AI Agents: Multi-agent systems, agentic workflows, autonomous data processing
- LLM Integration: Gemini AI, prompt engineering, RAG systems, AI-powered analytics
- ML Engineering: Model deployment, MLOps, BQML, production ML systems
- Vector Search: Semantic similarity, embeddings generation, hybrid search strategies
- Google Cloud Platform: Comprehensive expertise across data, AI, and compute services
- Serverless Computing: Cloud Functions, Cloud Run, event-driven architectures
- Infrastructure as Code: Terraform, deployment automation
- Data Governance: Data Clean Rooms, Analytics Hub, privacy-preserving analytics
- Website: johanesalxd.cc
- X (Twitter): @johanesalxd
- LinkedIn: johanesalxd




