Developers & Practitioners

Next '26 Hands-On: 10 Codelabs to Build Featured Tech

Wed, 22 Apr 2026 13:00:00 +0000

Significant contributors to this article include Megan O'Keefe, Senior Staff Developer Advocate, and Karl Weinmeister, Director of Developer Relations.

Whether you are joining us in person in Las Vegas or tuning in virtually from around the world, Google Cloud Next '26 offers a deep look into the practical evolution of AI. With 89% of sessions this year dedicated to artificial intelligence, the focus has shifted from high-level concepts to the "Day 2" reality of building and maintaining agentic systems.

We've assembled 55+ new codelabs across Cloud at Next, and we want to share 10 highlights with you. The following curated list of codelabs is designed to help you translate the announcements from the talks and demos into functional code. These labs provide a structured way to explore the latest in multi-agent orchestration, data grounding, and enterprise security for your own workflows.

Dive into Codelabs!

1—Build Rich Agent Experiences (ADK + A2UI) | Codelab

Improve user interaction through intuitive, high-quality interfaces that allow users to interact with agentic systems seamlessly.

2—Building a Multi-Agent System | Codelab

Build the architecture required to make multiple agents work together to achieve a shared goal.

3—Beyond the Simple SELECT: AlloyDB NL2SQL | Codelab

Democratize data access by building systems that allow users to query complex databases using natural language, supported by high-speed vector search.

4—Beat Fraud with an AI Shield (Spanner & BigQuery Graph) | Codelab

Implement real-time reasoning with Spanner and BigQuery Graph databases. Analyze complex relationships in your data to prevent fraud at the point of transaction.

5—Building Secure Agents: Protecting Access and Data | Codelab

Protect the reasoning engine with Model Armor and Identity and Access Management (IAM) to manage agent access and ensure that sensitive data remains protected during execution.

6—Ground Agents with Google Maps Platform | Codelab

Use Geo-intelligent logistics to ground your agents in real-world location data to optimize field operations and logistics in real-time.

7—Deploy and Scale Agents on Agent Engine | Codelab

Deploy agents as containerized microservices that scale dynamically with your workload.

8—The Ultimate Guide to Cloud Run: From Zero to Production | Codelab

Achieve rapid deployment using this lab as a blueprint for moving from a local prototype to a production-ready, auto-scaling platform on Cloud Run.

9—Developer Keynote: Building Agents with Skills | Codelab

Learn the ins and outs of AI agent development including Agent Development Kit (ADK), prompting, Agent Skill usage, and MCP.

10—General Keynote: Forecasting with AI Agents | Codelab

Transform unstructured chaos into actionable business intelligence in seconds.

Start Building Today

These codelabs will connect you to the heart of the conference. You'll be able to bridge the high-level announcements, talks, and demos into the reality of the technology featured at Next '26. Whether you're here in person or attending virtually, these labs provide the concrete skills to drive real-world value during the conference and long after the conference ends.

And there's more!

Go to the Codelab landing page to find the Cloud Next '26 tag and access more than 75 total codelabs that support the featured tech at this year's conference.

Level Up Your Agents: Announcing Google's Official Skills Repository

Wed, 22 Apr 2026 13:00:00 +0000

As AI models improve, technical practitioners are increasingly turning to agentic AI tools to build with Google Cloud products, from Firebase and the Gemini API, to BigQuery and GKE. But how can you ensure that the model is equipped with accurate, up-to-date information about these technologies?

One way to do this is to plug your AI agent into a grounded, real-time information source. For instance, Google offers a Model Context Protocol (MCP) server for its developer documentation. But heavily using MCP servers can cause a problem called “context bloat,” where huge amounts of context are loaded into the context window, confusing the model and racking up token costs.

We need a way to equip agents with additional, condensed expertise — and we can do this with Agent Skills.

Skills are “a simple, open format for giving agents new capabilities and expertise.” Think of a skill as compact, agent-first documentation for a specific technology or task. Skills are written in Markdown and can contain reference files, code snippets, and other assets. Agents load in skill information only as-needed, reducing the risk of context bloat.

Today, on Day 1 of Google Cloud Next 2026, we’re excited to announce the launch of Google’s official Agent Skills repository:

github.com/google/skills

This repository is starting off with thirteen skills, focused on Google Cloud technologies:

A selection of products: AlloyDB, BigQuery, Cloud Run, Cloud SQL, Firebase, Gemini API, and Google Kubernetes Engine (GKE).
Three Well-Architected Pillar skills: Security, Reliability, and Cost Optimization
“Recipe” skills for Google Cloud Onboarding, Authentication, and Network Observability.

Use npx skills install github.com/google/skills to install these skills to your agents of choice, including Antigravity, Gemini CLI, and third-party agents.

Stay tuned as we launch additional skills in this repo in the coming weeks and months!

Now get building!

What’s new with the Cross-Cloud Network at Next ‘26

Wed, 22 Apr 2026 12:00:00 +0000

While generative AI sparked a revolution, the true paradigm shift is the rapid evolution from standalone AI models to multi-agent autonomous systems. In this new era, the network transcends basic connectivity to become the critical integration layer for your agentic enterprise.

As AI agents and services surge, your core applications remain as vital as ever. To thrive in this rapidly evolving landscape, you need a planet-scale network to connect, protect, govern, deliver, and secure all your users, data, agents, AI services, and core applications across clouds and on-premises.

Google Cloud's Cross-Cloud Network provides this unified foundation, and is now used by 65% of the Fortune 100 and handles up to 27 exabytes of data per month. At Google Cloud Next, we are introducing networking innovations to accelerate your AI infrastructure, strengthen security, and simplify operations.

Optimized networking infrastructure for AI

As we move toward an agentic world, the network must support massive-scale inference paired with reinforcement learning. At Google, we’ve spent years refining this cycle to power our own global AI services. Today, we’re announcing AI infrastructure network innovations that bring this same architecture directly to your workloads, across agents, inference, training, and beyond.

Networking for agents

The Gemini Enterprise Agent Platform is a comprehensive enterprise environment designed to build, scale, govern, and optimize the next generation of autonomous agents. Key innovations being announced in preview include:

Agent Gateway: Air-traffic control for agentic traffic

Agent Gateway understands MCP and A2A agentic protocols and provides an open, extensible, scalable way to enforce centralized governance policies to securely connect agents, models, and tools across runtimes.

Ambient networking: A seismic shift in service-to-service connectivity

Ambient networking, a new integrated data plane for Google Kubernetes Engine (GKE) and Cloud Run, provides service discovery, zero-trust access, and traffic management without the need for complex and resource-heavy sidecar proxies. It reduces operational overhead and enables up to a 10x reduction in GKE resource usage for layer 4 (L4) mesh capabilities

Ambient networking underpins two new capabilities:

Service bindings automatically establish service-to-service connectivity, allowing developers to move faster to build and scale their agentic applications and services.
Network Services Monitoring bridges application and network observability gaps resulting in faster root-cause analysis and simplified troubleshooting.

Rich partner integrations and customizations

With the help of Service Extensions, we are developing solutions for identity, governance, and AI security for agent-to-anywhere traffic. Coming soon in preview to Agent Gateway are:

Identity and governance administration: Offering delegated authorization to Cloud IAM and partner services from Okta, Ping, Saviynt, and Silverfort to enforce real-time, contextual governance policies based on application and business context.
Runtime security: As a universal enforcement point by integrating with Google Cloud’s Model Armor and partner solutions from Broadcom, Check Point, Cisco, CrowdStrike, Exabeam, F5, Netskope, Palo Alto Networks, Thales, and Zscaler. Together, these can help to secure agentic communications against emerging AI attack vectors.

These innovations are built on an open foundation including Envoy and Kubernetes, providing strong, integrated governance in multicloud environments using standard Kubernetes Gateway APIs.

Networking for inference

At Google we run inference at scale with optimized use of distributed GPU and TPU resources, automatic failover between regions for high availability, and optimized global request routing for fast end-user performance. GKE Inference Gateway delivers these capabilities to our cloud customers including the following new innovations:

Multi-region support allows scaling inference services across regions, enabling cross-regional failover, optimized utilization, and reduced global latency (preview).
Predictive latency boost improves utilization with intelligent request routing based on predefined performance targets (preview).
Disaggregated serving leverages llm-d’s SGLang support, offering the flexibility to choose between vLLM and SGLang for model serving (GA).

Gemini Enterprise Agent Platform reduced Time to First Token (TTFT) latency by over 35% for Qwen3-Coder by using GKE Inference Gateway.

“Before GKE Inference Gateway, managing our inference stack with Ray Serve created a complex, dual-orchestration layer that was a significant burden on our small operations team. Moving to the Inference Gateway and native Kubernetes deployments was the 'North Star' architecture we needed to simplify management and achieve robust production stability with a GKE-native batteries-included solution.” - Mikhail Lubinets, Lead HPC Engineer, Technology Innovation Institute

Networking for training

At Google, we build and run the largest AI models in the world — and we built a network to support that. Some of the new enhancements are:

Massive scale with Virgo Network

This new non-blocking data center fabric removes latency barriers:

Virgo can link up-to 134,000 chips with 47 Petabits/sec of non-blocking bi-sectional bandwidth in a single fabric. This delivers a staggering 1.6M Exaflops of FP4 compute.
With enhancements in Pathways and JAX, you can further connect these Virgo fabrics to scale to over 1 million TPU chips in a single training cluster.
We are also making Virgo Network available on NVIDIA Vera Rubin NVL72, supporting up to 960,000 GPUs.

For more on Virgo Network, check out this blog.

Accelerator network profiles

It’s easier than ever to handle the complex networking prerequisites for accelerator-equipped GKE node pools with DRANET, which improves bandwidth for distributed AI/ML workloads by up to 60% (GA).

AI-native Cloud Interconnect

SLA-backed, and optimized for efficiency, Cloud Interconnect supports petabit-scale data transfers and is available with a fixed price option. Cloud Interconnect now supports:

400 Gbps circuits with up to 3.2 Tbps in a single connection (GA)
Partner Cross-Cloud Interconnect for AWS (GA), CoreWeave (in preview soon), and Lumen (in preview soon)

Cross-Cloud Network for AI and core applications

The Cross-Cloud Network helps ensure you can securely connect users, data, locations, applications, services, and infrastructure anywhere in the world, at planetary scale. We designed our global multi-shard network to scale horizontally to meet the demands of the AI era and enable us to accommodate our 10x WAN traffic growth from 2020 to 2025.

These are some of the improvements we’re making to the Cross-Cloud Network:

Ultra Low Latency Solution for financial exchanges

In partnership with CME Group, we are bringing the world's leading derivatives marketplace to Google Cloud. To support CME Group’s performance requirements, we developed an ultra low latency (ULL) networking and compute solution. This fully managed cloud environment will allow CME Group and its clients to migrate its core trading systems to Google Cloud.

Now in preview, the solution is designed to meet the unique and exacting requirements of running financial exchanges in the cloud. It includes several new technologies:

Deterministic high-performance compute powered by ULL networking, with bare metal and VM form factors, delivers a comprehensive portfolio for your trading compute needs.
Scalable multicast data distribution with hardware-based ultra-low latency enables reliable one-to-many market data sharing.
Nanosecond-level clock sync enabled by Firefly, a novel clock synchronization system. Firefly achieves sub-10ns NIC-to-NIC synchronization to support high-frequency trading.
Advanced network observability with 64-bit nanosecond timestamps, support for multiple traffic-mirroring destinations and multicast traffic, and support for auditing and regulatory requirements.
Low-latency inference allowing exchange participants to connect their AI-driven services to the exchange’s infrastructure.

“The Google Cloud Ultra Low Latency Solution provides the level of performance necessary for CME Group futures and options markets to run in the cloud, expanding access to clients worldwide.” - Sunil Cutinho, CIO, CME Group

Cross-cloud observability for networks, applications, and agents

Whether you’re running core applications or new AI agents, you need visibility into your network infrastructure. Cloud Network Insights, now in preview, offers network performance monitoring (NPM) and digital experience monitoring (DEM) to dramatically reduce the mean time to detect and mitigate network-related agent, application, and API issues.

Cloud Network Insights is enabled by technologies from Broadcom’s AppNeta and powered by AI-enabling natural language queries through Gemini Cloud Assist.

"In an environment as complex and high-scale as Sabre’s, total visibility isn't just a luxury — it's a requirement for operational resilience. Cloud Network Insights will enable us to further shift our posture from reactive troubleshooting to proactive optimization. By providing granular, real-time telemetry across our global cloud footprint, it helps eliminate the traditional 'black box' of the network, allowing our teams to resolve bottlenecks before they impact the traveler experience." - Alfredo Rodriguez, VP Cloud Platform Infrastructure, Sabre Corporation

“Cloud Network Insights closes the 'visibility gap' between the private corporate network and the public cloud, empowering our joint customers to pinpoint performance bottlenecks in seconds rather than hours.” - Alan Davidson, CIO, Broadcom

Cross-Cloud Network for distributed applications

Multicloud and hybrid networks require secure, reliable, and high-performance connectivity. New enhancements for our foundational networking services and tools include:

Private Service Connect

Private Service Connect traffic volume grew 4x in 2025 and it now supports 40+ Google and third-party published services, enabling secure private global access to your managed services.
Private Service Connect endpoint-based security allows for granular authorization policies for producer-to-consumer service communications (preview).
Gemini Cloud Assist for Private Service Connect provides for automated troubleshooting (preview).

Cloud-native IP address management (IPAM)

Cloud Number Registry is an IPAM solution powered by agentic technologies. Network admins can easily find free IP ranges, track utilization, and allocate resources (preview). It also integrates with Infoblox Universal DDI for Cross-Cloud Network IPAM discovery and enforcement.
Hybrid Subnets allow you to migrate legacy workloads from on-premises to a VPC without needing to change hard-coded IP addresses (GA).
Cloud NAT allows you to connect your IPv6-only workloads to private IPv4 destinations using the combined power of DNS64 and private NAT64 (in preview soon).

Network Connectivity Center (NCC)

Partner Cross-Cloud Interconnect for AWS is available as a connectivity type in NCC (preview).
Support for static routes using an internal load balancer as the next hop allows the integration of Secure Web Proxy and third-party network security virtual appliances (GA).
Support for privately used public IP (PUPI) allows the exchange of PUPI IPv4 addresses with VPC spokes and producer VPC spokes (GA).

Granular networking charge visibility

Cost Explorer and the new App Optimize API now provide attribution of associated Data Transfer costs to the originating resources for Google Cloud products (in preview soon).

Cross-Cloud Network for internet-facing services

As part of Cross-Cloud Network, the Global Front End simplifies how you deliver, scale, and protect web, API, and AI workloads. New capabilities include:

Global Front End Enterprise delivers simplified consumption by combining capabilities from global Cloud Load Balancing, Google Cloud Armor, Cloud CDN, and Service Extensions with up to 15% lower TCO (in preview soon).
Post quantum cryptography (PQC) helps secure your workloads with industry-standard algorithms that provide a layered defense against both classical and quantum adversaries.
Google tag gateway, enabling advertisers to serve tags from their own domain, which can significantly improve the accuracy and resilience of measurement signals (GA soon).

In addition, Cloud CDN, an important part of the Global Front End, now offers:

Built-in image optimization to help you deliver content that best fits your end users’ screens and saves on bandwidth costs (in preview soon).
GKE Gateway support so you can enable and manage caching services using GKE APIs (GA).

Cross-Cloud Network’s Cloud WAN for global enterprises

Cloud WAN is a fully managed, reliable global backbone to connect your enterprise. New capabilities include:

Expanded geographic reach: Our network spans more than 10 million kilometers of terrestrial and subsea fiber, and Network Connectivity Center’s site-to-site data transfer is now available in over 25 countries.
NCC Gateway enables third-party secure service edge (SSE) integrations from Palo Alto Networks (GA soon) and Symantec (preview).
The Verified Peering Provider program, which offers highly reliable internet connectivity to Google, now has dramatically expanded availability through 175+ providers worldwide.
Last mile connectivity: Provision site-to-cloud private connectivity in minutes with preferred partners from the Google Cloud console (in preview soon).

“Cloud WAN enables Dun & Bradstreet to evolve our global network via composable, cloud-native constructs. Leveraging NCC, we’ve built a resilient, high-performance platform that simplifies operations and optimizes costs. This foundation supports continued modernization and AI-driven workloads. We expect to extend this architecture as new patterns emerge, maintaining our blueprints-first approach.” - Josh Barry, VP, Network Engineering, Dun & Bradstreet

AI-powered security against evolving threats

The threat landscape is evolving faster than ever, with AI-driven attacks. Staying ahead requires the latest defenses. Cross-Cloud Network relies on Cloud NGFW and Cloud Armor for advanced security capabilities. Here’s the latest on those offerings.

Cloud NGFW

Advanced malware sandbox uses AI models trained on data from 70k+ customers to stop 99% of known and unknown malware, including evasive zero-days. Advanced malware sandbox is powered by Palo Alto Networks Advanced Wildfire (in preview soon).
Internal Application and proxy Network Load Balancer support helps to enforce consistent, service-centric security for abstracted services like GKE, Cloud Run, and Private Service Connect traffic (preview).
Project-level policies allow for creating and managing Cloud NGFW endpoints, security profiles, and security profile groups at the project level (in preview soon).

Cloud Armor

Managed rules, built-in rulesets across 15 threat categories, deliver automated threat protection against a broad set of attacks and zero-day CVEs. This is powered by Thales Imperva based on visibility to 1.5 trillion web requests each month (in preview soon).
Google Cloud Fraud Defense integration helps to discern the legitimacy and authorization of bots, humans, and agents. Fraud Defense is the evolution of reCAPTCHA, which protects over 14 million domains from fraud and abuse.
Adaptive protection for Network Load Balancers & VMs brings advanced machine learning to L3/L4 traffic, to detect and mitigate volumetric DDoS attacks (in preview soon).
A simplified user experience with a visual rule builder makes custom rule creation easier (in preview soon).

AI-powered network operations

Finally, new AI-powered technologies in Gemini Cloud Assist can help automate manual tasks, ease troubleshooting, predict reliability issues, improve security, and help optimize your network to reduce toil and improve reliability with new specialist agents. These include:

A network security agent that streamlines network security operations by assisting with policy generation, recommendations, and impact analysis (in preview soon).
A network agent that optimizes workload placement for performance and reliability, and also provides advanced cost estimation for observability services (in preview soon).

Additionally, to enable customers and partners to build their own agents, we are releasing Network observability MCP tools and agent skills. This will allow their agents to leverage connectivity tests, and allows for natural language querying of VPC Flow Logs (both in preview).

The network that scales with you

We built our Cross-Cloud Network on the same global infrastructure that powers Google’s largest AI and internet services. This provides you with a blazing-fast, planet-scale foundation that is both secure by design and open by principle, allowing you to integrate your trusted partners across any environment.

As we move into the agentic era, our flexible, future-proof solutions ensure you can quickly adopt the latest AI technologies while maintaining the reliability of your core applications.

Whatever comes next, we’ve built the network to help you lead it. Attend our networking sessions at Next ’26 to learn more, or learn more about the Cross-Cloud Network!

Introducing Gemini Enterprise Agent Platform, powering the next wave of agents

Wed, 22 Apr 2026 12:00:00 +0000

In the early days of generative AI, building safe and reliable business tools took massive engineering effort and a high tolerance for trial and error. We helped solve that with Vertex AI, our trusted AI development platform. But today, we’re managing a different level of complexity, with agents interacting across multiple systems — and often without security and governance guardrails.

To move toward a truly autonomous enterprise, one where agents can act with the same independence and reliability as a member of your team, you need a foundation that can sustain that level of trust.

What’s new: Today, we’re launching Gemini Enterprise Agent Platform — our new, comprehensive platform to build, scale, govern, and optimize agents. It’s the evolution of Vertex AI, bringing the model selection, model building, and agent building capabilities that customers love, together with new features for agent integration, DevOps, orchestration, and security.

Agent Platform provides a single destination for your technical teams to build agents that can transform your products, services, and operations. These agents can be seamlessly delivered to your employees through the Gemini Enterprise app, all while remaining tightly integrated with your IT operations to help ensure control, governance, and security as you scale.

The platform also provides first-class access to more than 200 of the world’s leading models through Model Garden. This includes our latest first-party breakthroughs like Gemini 3.1 Pro, Gemini 3.1 Flash Image, and Lyria 3, alongside our open models like Gemma 4. And, of course, customers have full flexibility to use the best model for the job with support for third-party models like Anthropic’s Claude Opus, Sonnet and Haiku.

Moving forward, all Vertex AI services and roadmap evolutions will be delivered exclusively through the Agent Platform, rather than as a standalone service, to power the next generation of agent development.

Why Agent Platform matters for your business: Agent Platform helps you move from managing individual AI tasks to delegating business outcomes with total confidence. You can:

Build: Choose the right environment for the job — from the low-code, visual interface of the new Agent Studio, to the code-first logic of the upgraded Agent Development Kit (ADK). We’ve simplified the entire lifecycle with AI-native coding capabilities to help you ship production-grade agents faster.
Scale: Clear the path to production with the re-engineered Agent Runtime. This supports long-running agents that maintain state for days at a time and are backed by Memory Bank for persistent, long-term context.
Govern: Establish centralized control with Agent Identity, Agent Registry, and Agent Gateway. These capabilities help ensure every agent — whether built on Agent Platform or sourced from our partner ecosystem — has a trackable identity and operates within enterprise-grade guardrails.
Optimize: Guarantee quality with Agent Simulation, Agent Evaluation, and Agent Observability. These tools provide full execution traces and a real-time lens into agent reasoning to help ensure your agents always hit their goals.

Get started with Agent Platform: Visit Agent Platform in the Google Cloud console to explore new features and start building today.

Keep reading for a deeper look at our latest releases and how Agent Platform helps you deliver the production-ready agents you can trust at every stage of the journey.

How customers are achieving more with Gemini Enterprise Agent Platform

"Burns & McDonnell uses Agent Platform to transform how organizational knowledge is applied across the enterprise. Using ADK, we are building an AI agent that turns decades of project data into real-time, actionable intelligence. Agent Platform enables this innovation to scale responsibly by combining deterministic business rules with probabilistic reasoning — making AI a trusted operational capability, not just a productivity tool. With Agent Platform, we aren’t just managing knowledge; we are activating experience to drive faster, more confident decisions." – Matt Olson, Chief Innovation Officer, Burns & McDonnell

“Color Health uses Agent Platform to power our Virtual Cancer Clinic, delivering end-to-end care. By building our Color Assistant with the Agent Development Kit (ADK) and scaling it via Agent Runtime, we are helping more women get screened for breast cancer. The Color Assistant engages users to check screening eligibility, connects them to clinicians, and helps schedule appointments. The power of the agent lies in the scale it enables — helping us reach more people and respond to individual risk and eligibility in real time.” – Jayodita Sanghvi, PhD., Head of AI Platform, Color

“By rebuilding Comcast’s Xfinity Assistant with Agent Development Kit (ADK), we’ve moved beyond simple scripted automation to conversational generative intelligence that delivers personalized troubleshooting and self-service support to our customers. Agent Runtime has been a massive accelerator, allowing us to deploy a sophisticated multi-agent architecture that increases digital containment while ensuring secure, grounded interactions via Gemini. We aren't just reducing repeat interactions by solving customers’ issues the first time; we're redefining the customer experience at scale.” – Rick Rioboli, Chief Technical Officer, Connectivity & Platforms, Comcast

“Geotab uses Agent Platform to rapidly accelerate our AI Agent Center of Excellence. Google's Agent Development Kit (ADK) provides the flexibility to orchestrate various frameworks under a single, governable path to production, while offering an exceptional developer experience that dramatically speeds up our build-test-deploy cycle. For Geotab, ADK is the foundation that allows us to rapidly and safely scale our agentic AI solutions across the enterprise” – Mike Branch, Vice President, Data & Analytics, GeoTab

"Gurunavi uses Agent Platform to power 'UMAME!', an AI restaurant discovery app that leverages Memory Bank to achieve a deep understanding of user context. Unlike conventional prompt-based systems, our agent remembers a user's past actions and preferences to proactively present the best options. This eliminates the need for manual searches and creates a seamless experience that will improve user satisfaction by 30% or more. We view this memory function as a non-negotiable feature for the future of new culinary experiences.” – Toshiaki Iwamoto, CTO, Gurunavi

"At L'Oréal, Beauty Tech is not just a support function — it is a powerful catalyst to create the beauty that moves the world. To live up to that ambition, we decided to build our own proprietary Beauty Tech Agentic Platform, powered by Google Cloud. Leveraging Agent Development Kit (ADK), we are leading a fundamental shift: moving from deterministic workflow automation to autonomous, outcome-oriented agent orchestration. Our agents are not locked in a vacuum — through Model Context Protocol (MCP), they are securely connected to our single sources of truth, including our Beauty Tech Data Platform and core operational applications. Google Cloud gives us the resilience, the multi-LLM flexibility, and the enterprise-grade trust framework we need to scale this platform globally, while keeping human oversight at the center." – Etienne BERTIN, Group CIO, L'Oréal

“Payhawk uses Agent Platform to transform our AI agents from simple task executors into genuine financial assistants. By leveraging Memory Bank, we have moved from stateless interactions to long-term context retention. Our agents now act like dedicated team members, autonomously recalling user-specific constraints and history. For example, our Financial Controller Agent now remembers a user’s habits to auto-submit expenses, reducing submission time by over 50%. This shift allows our agents to anticipate needs based on past behavior rather than just reacting to prompts.” – Diyan Bogdanov, Principal Applied AI Engineer, Payhawk

"PayPal uses Agent Platform to rapidly build and deploy agents in production. Specifically, we use Agent Development Kit (ADK) and visual tools to inspect agent interactions, and manage multi-agent workflows. This provides the step-by-step visibility we need to visualize the flow of intent and payment mandates. Finally, Agent Payment Protocol (AP2) on Agent Platform provides the critical foundation for trusted agent payments. helping our ecosystem accelerate the shipping of secure agent-based commerce experiences." – Nitin Sharma, Principal Engineer, AI, PayPal

Build AI agents

Build agents quickly and easily by empowering your developers, business users and everyone in between to build and deploy agents at scale.

Build smarter agents, faster

A major upgrade to ADK: More than six trillion tokens are processed monthly on Gemini models through ADK. Unlock more powerful reasoning by organizing agents into a network of sub-agents. This new, graph-based framework allows you to define clear, reliable logic for how agents work together to solve complex problems.
Workspaces are secure-by-design: Give agents a hardened, sandboxed environment to run bash commands and manage files safely, isolated from your core systems.
Multimodal streaming: Bring human-like stability to real-time interactions with multimodal support for live audio and video cues.

Connect your agents to the enterprise

Securely access any system: Use plug-and-play architecture with Native Ecosystem Integrations to connect agents to your internal data and tools without custom coding.
Automate background operations: Activate your data in BigQuery and Pub/Sub with Batch & Event-driven agents. This way, you can run massive, asynchronous tasks like content evaluation or data analysis in the background.

Go from idea to production in hours

Enable AI-driven development: A programmatic interface for coding agents to access Google’s complete suite of agentic capabilities, allowing them to build, evaluate, and deploy production-ready agents on your behalf.
Bringing agent building directly to Agent Studio: Now, you can move seamlessly from building simple prompts to deploying complex agents in Agent Studio. Once you're ready for deep customization, export your logic directly into ADK to continue development in a full-code environment.
Get a head start with pre-built agents: Access a curated set of agent templates in Agent Garden — including code modernization, financial analysis, economic research, invoice processing, and more — that serve as immediate building blocks for your multi-agent systems.

Scale AI agents

To move from a proof-of-concept to a live environment, you need a platform that can handle the performance, state, and security requirements of real-world work.

Powering high-performance agent execution

The latest Agent Runtime: Our revamped Agent Runtime delivers sub-second cold starts and allows you to provision new agents in seconds.
Support for multi-day workflows: You can now deploy long-running agents that run autonomously for days at a time. This allows your agents to manage complex, multi-step workflows and deep reasoning tasks that require extended persistence, like managing a sales prospecting sequence.
Autonomous action with security-by-design environments: Agent Sandbox provides a hardened environment to safely execute model-generated code and perform computer use tasks like browser-based automation without risk to your host systems.
Agent-to-agent orchestration: Enables agents to seamlessly delegate tasks to one another, including support for complex, generative, and deterministic orchestration patterns. This ensures that for critical flows such as compliance, your agents follow well-specified paths every time.

Move beyond temporary session data to high-accuracy context

Personalize interactions: Agent Memory Bank dynamically generates and curates long-term memories from conversations. Using new Memory Profiles, agents can recall high-accuracy details with low latency, ensuring context is never lost.
Link AI interactions to your existing records: Store and manage history using Agent Sessions. With Custom Session IDs, you can use your own unique identifiers to track sessions and map them directly to your internal database and CRM records.
Enable real-time, human-like interactions: Using the WebSocket protocol for Bidirectional Streaming, you can help ensure your agents are highly responsive during live customer or employee interactions, processing audio and video without lag.

Govern AI agents

Govern with a secure-by-design architecture that applies enterprise rigor to every agent in your fleet – from the ones you build on Agent Platform to the ones you source from our partner ecosystem.

Manage all of your agents through a single source of truth for identity and access.

Assign every agent a verifiable identity: Agent Identity improves the security posture of your agents by ensuring every agent receives a unique cryptographic ID. This creates a clear, auditable trail for every action an agent takes, mapped back to defined authorization policies.
Maintain a central library of approved tools: Our new Agent Registry provides a single source of truth for your enterprise. It indexes every internal agent, tool, and skill, simplifying discovery and ensuring only governed, approved assets are available to your users.
Manage your agent fleet from one control point: Agent Gateway acts as the air traffic control for your agent ecosystem. It provides secure, unified connectivity between agents and tools across any environment, while enforcing consistent security policies and Model Armor protections to safeguard against prompt injection and data leakage.

Use AI-powered insights to detect hidden risks and suspicious behavior before they impact your business.

Detect suspicious behavior in real-time: Agent Anomaly Detection uses statistical models and an LLM-as-a-judge framework to flag unusual reasoning. This works alongside Agent Threat Detection to provide visibility into malicious activity, such as reverse shells or connections to known bad IP addresses.
Uncover vulnerabilities automatically: A new Agent Security dashboard, powered by Security Command Center, unifies threat detection and risk analysis. It allows your teams to map relationships between agents and models, automate asset discovery, and scan for vulnerabilities in the underlying operating system and language packages.

Optimize AI agents

Agent Platform gives you the visibility needed to understand how your AI is performing, making it easy to refine their logic and get smarter over time.

Test your agents before they ship

Simulate realistic conversations: Use Agent Simulation to test agents against human-like synthetic user interactions and virtualized tools in a controlled environment. Agents are automatically scored based on task success and safety across multi-step conversations.

Monitor and improve in production

Track live performance: Use Agent Evaluation to continuously score agents against live traffic using multi-turn autoraters that can evaluate the logic of an entire conversation, not just a single response. With turnkey dashboards and Agent Observability, you can visually trace complex reasoning to debug issues as they happen.
Automate agent refinement: Instead of manually digging through logs, Agent Optimizer automatically clusters real-world failures and suggests refined system instructions to improve accuracy.

Detailed technical guides and a full list of updates are available in our updated documentation and release notes. Agent Platform is the new standard for enterprise agent development, built to help you move from experimentation to production-scale impact, starting today.

Next ‘26: Redefining security for the AI era with Google Cloud and Wiz

Wed, 22 Apr 2026 12:00:00 +0000

aside_block: <ListValue: [StructValue([('title', 'Our news today from Next ‘26'), ('body', <wagtail.rich_text.RichText object at 0x7f7273d56c40>), ('btn_text', ''), ('href', ''), ('image', None)])]>

The AI era demands a new security era. Organizations are facing the dual challenge of harnessing the potential of AI while defending against its malicious use, and Google Cloud can help you adapt and thrive.

The latest research from Google Cloud shows that adversaries are using AI to accelerate the speed, scale, and sophistication of attacks. Meanwhile, M-Trends 2026 also showed that increased threat actor coordination has driven down the time to hand-off from an initial access to a secondary threat actor from eight hours to 22 seconds in the last three years.

Today at Google Cloud Next, we are showcasing how Google Cloud can help you defend against increasingly sophisticated threats at machine speed, protect AI and multicloud environments, and secure cloud workloads at scale.

Delivering agentic defense

Our full-stack AI approach, from the chips to the models, gives you a competitive advantage with better integration and velocity to help protect customers. Not only can Google action insights from the world’s largest threat observatory and Mandiant frontline experts, but we also bring cutting-edge insights and breakthroughs from Google DeepMind, to help make your platforms more secure.

Today we are introducing three new agents in Google Security Operations to help you defend at the speed of AI.

Threat Hunting agent, now in preview, can help teams proactively hunt for novel attack patterns and stealthy adversary behaviors that bypass traditional defenses.
Detection Engineering agent, now in preview, can identify coverage gaps and create new detections for threat scenarios, reducing toil and transforming detection creation from a manual craft into an automated science.
Third-Party Context agent, coming soon to preview, can enrich your workflows with contextual data from third-party content.

Initiating a threat hunt with the Threat Hunting agent

Our Triage and Investigation agent processed over 5 million alerts in the last year, reducing a typical 30-minute manual analysis to 60 seconds with Gemini.

“Operational resilience and cybersecurity are the bedrock of customer trust at BBVA. By integrating advanced artificial intelligence, such as the Triage and Investigation agent, we are able to scale in new ways," said Diego Martinez Blanco, head of Security Technology, BBVA.

“It handles the initial heavy lifting and filters out false positives so we can prioritize issues that require human attention. The agent's transparent explanations allow our team to understand recommendations and ultimately dedicate our resources to more complex investigations,” he said.

You can build your own security agents with remote Google Cloud model context protocol (MCP) server support for Google Security Operations, now generally available. To make it even easier, you can also access the MCP server client directly from the Google Security Operations chat interface, available in preview.

Organizations leveraging an intelligence-led, AI-augmented approach to modern security operations with Google Cloud's agentic defense can realize a strong ROI. Christopher Kissel
Research Vice President, IDC

Findings report created by the Threat Hunting agent

Security teams can also automate response actions with agentic automation in Google Security Operations. To further move teams from manual triage to agentic defense, we introduced dark web intelligence in Google Threat Intelligence, now in preview. Internal tests show it can analyze millions of daily external events with 98% accuracy to elevate threats that truly matter.

"IDC found that organizations experienced measurable operational gains, including substantial reductions in mean time to detect and mean time to respond, fewer false positives, and higher analyst productivity with AI-powered context and automation. These operational improvements translate into significant business outcomes, such as shorter disruption periods, lower incident-related costs, and improved executive confidence in security posture and decision-making," said Christopher Kissel, research vice president, IDC. "Organizations leveraging an intelligence-led, AI-augmented approach to modern security operations with Google Cloud's agentic defense can realize a strong ROI."

New partner-supported workflows for Google Security Operations

Today, we are also announcing a robust cohort of new partner integrations for Google Security Operations. Designed to deliver high-fidelity security workflows right out of the box, our latest participating Google Cloud Security integration ecosystem partners include Darktrace, Gigamon, and SAP.

Protecting AI and cloud applications across any infrastructure

AI and cloud applications are built across multiple platforms and models. To protect them end-to-end, we want to make it easier and faster to mitigate risk, regardless of where and how you build. This support includes major cloud environments like Amazon Web Services, Google Cloud, Microsoft Azure, and Oracle Cloud; software-as-a-service (SaaS) environments like OpenAI; and even custom hosted environments.

Wiz, now a part of Google Cloud, expands and deepens our ability to protect the apps you build and run. Wiz empowers you to quickly and securely adopt AI, while also helping protect the AI development lifecycle.

Wiz announced its AI-Application Protection Platform (AI-APP) at the RSA Conference, providing deep visibility, risk posture, and runtime analysis for your AI applications. Wiz also announced Wiz Security Agents and Wiz Workflows, helping you identify and respond to risks and threats at machine speed.

Today, we’re taking our commitment to secure customers in any cloud, platform, and AI environment further. Wiz now supports Databricks as well as new agent studios like AWS Agentcore, Gemini Enterprise Agent Platform, Microsoft Azure Copilot Studio, and Salesforce Agentforce, so customers gain visibility however their teams choose to build.

In addition, Wiz continues to support security ecosystems with integrations to the outer layer of the cloud, including Google Cloud Apigee, Cloudflare AI Security for Apps, and the Vercel platform, further extending the power of the Wiz Security Graph. We’ve also updated how we integrate security detections from Wiz Defend with Google Security Operations and Mandiant Threat Defense to help analysts more easily configure automatic threat information forwarding.

Wiz is also announcing new capabilities designed to secure the AI-native development lifecycle, helping teams to innovate faster and more securely:

Secure vibe-coded applications: Wiz is announcing a new integration, generally available in May, that runs Wiz security scanning directly inside the Lovable platform so vulnerabilities, secrets, and misconfigurations caught by Wiz surface in Lovable's built-in security view, right where teams are already building.
Secure AI-generated code: Wiz removes risks from AI-generated code the moment it is created. Inline AI security hooks integrate directly into IDEs and agent workflows to evaluate prompts and scan AI-generated output instantly, injecting security guardrails before the code is ever committed.
Agent-based remediation: Wiz Skills equip coding agents and AI-native IDEs with full code-to-cloud context and validated attack surface findings from the Wiz Security Graph. These capabilities enable teams to trigger automated, agent-driven remediation workflows either locally from the developer's individual IDE or globally at the repository and pull request level within your version control system.
Eliminate shadow AI: Wiz’s dynamic AI-Bill of Materials (AI-BOM) automatically inventories all AI frameworks, models, and IDE extensions across your environment. This provides complete visibility into what is writing code across your stack, allowing you to track sanctioned corporate tools like Gemini Code Assist and GitHub Copilot while simultaneously uncovering unapproved shadow AI plugins.

You can learn more about the Wiz announcements here.

Securing your agents and the agentic web

In addition to securing your cloud and AI workloads, Google Cloud’s secure-by-design foundation can help you innovate at the speed of AI — from agents to fraud defense to the web.

Securing and governing agents with the Gemini Enterprise Agent Platform
To build, orchestrate, govern, and optimize agents, today we are announcing Gemini Enterprise Agent Platform including:

Agent Identity to enable access management and AI governance at scale. Our new capability provides agents unique identities to operate autonomously with specific authentication flows, and with scoped human delegation.
Agent Gateway, which enables policy enforcement for all agent-to-agent and agent-to-tool connections. It governs your enterprise agent traffic and understands agent protocols like MCP and Agent2Agent (A2A) to inspect and secure every agent interaction.
Model Armor, our runtime protection for model and agent interactions, now integrates with Agent Gateway, Agent Runtime, and Langchain available in preview, and Firebase, generally available, to help developers add inline enforcement and sanitization of agent traffic and interactions without the need to change code. These integrations expand Model Armor's protection against runtime risks such as prompt injections, tool poisoning, and sensitive data leakage across Google Cloud services and our AI portfolio.

Securing the agentic web with Google Cloud Fraud Defense and Chrome Enterprise
Today, we are evolving reCAPTCHA with the launch of Google Cloud Fraud Defense, generally available. This comprehensive platform is designed to discern the legitimacy and authorization of bots, humans, and agents. Using the same scale and signals that protect Google’s own ecosystem, Fraud Defense will soon offer in preview agent-specific capabilities for human users and AI agents that can help secure the digital commerce journey, from account creation and login to payment and checkout.

Our commitment to securing AI extends to the browser, a vital endpoint for interacting with AI. Chrome Enterprise provides comprehensive data protection for the AI era with the visibility and controls needed to embrace AI safely without compromising corporate data:

AI-aware extension threat detections, now in preview, can surface advanced extension telemetry that helps security teams detect and respond to anomalous AI agent activity.
New shadow AI reporting, generally available soon, can help you gain visibility into the shadow AI landscape by flagging employee use of unsanctioned web-based AI and SaaS applications.

What’s new in Trusted Cloud

We continue to offer new security controls and enhance capabilities across identity, data, and networking on our cloud platform to help you secure your environments. Today we’re announcing the following updates:

Simplifying permissions with modern IAM
To help achieve least privilege quickly and simply, we’ve streamlined our predefined roles catalog with easy-to-use administrator, editor, and viewer roles, such as the IAM role picker and the ability to re-authenticate sensitive actions.

Data security
We are announcing several new capabilities for our cloud platform data security portfolio to help protect your most sensitive data and accelerate AI transformation.

Confidential Computing: In partnership with NVIDIA, today we’re announcing Confidential Computing support for G4 VMs, featuring NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Google Compute Engine (GCE) Confidential G4 VMs, available in preview globally, to help strengthen confidentiality and integrity for a wide spectrum of sensitive AI workloads. In partnership with Intel, we’re also introducing the preview of C4 Confidential VMs, bringing Intel TDX to 6th Gen Xeon processors to help protect diverse AI and analytics workloads while providing industry-leading compute density and performance.
Cloud Key Management Services (KMS): We are announcing the new Confidential External Key Manager (cEKM) in preview, giving you the flexibility to host and protect external keys in any region and maintain verifiable control within a confidential environment.
Post-quantum cryptography (PQC): We are introducing KMS Quantum Safe Key Imports, available in preview, to help you bring your own keys with quantum-safe algorithms.
Secret Manager: To help prevent password leaks and mitigate prompt injection risks, we are announcing the general availability of the native integration of our Secret Manager with Agent Development Kit.

Network security
Google Cloud’s Cross-Cloud Network security products offer several new capabilities:

Cloud NGFW: We’re announcing the Cloud NGFW advanced malware sandbox, in preview later this year, to help defend against highly evasive zero-day threats. This capability is powered by Palo Alto Networks Advanced Wildfire, trained on data from more than 70,000 Palo Alto Networks customers to stop 99% of known and unknown malware.
Cloud Armor: We have released new Cloud Armor managed rules, powered by Thales Imperva and available in preview, to detect Layer 7 application attacks and zero-day CVEs (like React2Shell).

Advancing Google Cloud security with SCC
As our Google Cloud-native security solution, Security Command Center (SCC) establishes a cloud security baseline to protect both your traditional and AI applications on Google Cloud:

AI agents, models, and MCP servers are secured by providing continuous discovery and comprehensive risk analysis to identify threats, vulnerabilities, and misconfigurations.
SCC will add deep runtime visibility to uncover shadow AI for your Google Cloud workloads. Coming soon in preview, SCC will automatically discover unmanaged agentic workloads — including agents, MCP servers hosted on Cloud Run, GKE, and inference endpoints running on GKE, and surface those as posture findings in SCC.
Our enhanced Security Command Center Standard tier provides data security posture management, compliance, vulnerability management, and risk analysis to help any Google Cloud customer establish strong security, compliance and risk coverage from the start at no additional costs.

Take the next step

When you make Google part of your security team, you gain the power of an intelligence-driven, AI-native defense; the freedom of an open cloud that’s secure-by-design; and the industry's most-battle tested experts as an extension of your organization.

For more on these new innovations and how you can secure what’s next, tune in to watch our security spotlight. And be sure to check out the many great security breakout sessions — live and on-demand — to learn more about all of our Next ‘26 announcements.

From keynote to the terminal: Join our Next ‘26 developer livestreams

Tue, 21 Apr 2026 16:00:00 +0000

The main stage at Google Cloud Next is where the vision is set. This year, we’re bridging the gap between those massive "Cloud-scale" announcements and your local terminal.

We are thrilled to announce the Next ‘26 developer livestreams, a daily broadcast live from the show floor at Google Cloud Next. We aren't just reporting the news, we’re deconstructing it into actionable demos and immediate workflows before the keynote seats are even cold.

What to expect

Real-time demos that turn inspiration into versioning.
Energy from the show floor delivered straight to your screen.
Interviews with the builders, community leaders, and disruptors moving at light speed.

Schedule

Day 1: From the Next ‘26 main stage to the terminal
When: Wednesday, April 22, beginning at 11 AM PT

Immediately following the opening keynote host Jason Davenport kicks things off with special guests including Acquired's Ben Gilbert and David Rosenthal to get their reaction to the day’s announcements. Then we dive into the hardware and platforms powering the next wave of AI with Addy Osmani, Shubham Saboo, Philip Kelly of Baseten, Yasmeen Ahmad, and other surprise guests.

Day 2: Next ‘26 Developer keynote deep-dive
When: Thursday, April 23, beginning at 12 PM PT

Fresh off the Developer Keynote, we’re taking the tech to the terminal. We’ll be live-coding agentic workflows and testing new announcements in real-world scenarios.

Host Stephanie Wong will sit down with Michele Catasta (President & Head of AI at Replit). We’ll also feature a "hot off the press" breakdown with Google Cloud’s Sarah Kennedy and Ricky Robinett, plus a security deep dive with Ankur Kotwal and Wiz’s Salman Ladha. And hear from LangChain’s Harrison Chase, and conversations with Googlers Kevin Moore and Ines Envid, and more!

Where to watch

Don’t just watch the news — build it with us. We’ll be streaming live across all your favorite platforms. Bookmark the links below and set your reminders now!

Replays will be available on-demand.

Next digital pass

To make sure you don't miss any of the action, claim your complimentary digital pass today. Stream select breakout and Spotlight sessions, catch the big keynote announcements as they drop, and enjoy short-form videos – all from wherever you happen to be. Plus, your digital ticket unlocks special offers once Next wraps up. Register now!

What’s next after Next? Stay agent-ready with GEAR

The conversation around AI agents is moving fast. Want to stay in the loop? Join the Gemini Enterprise Agent Ready (GEAR) program and get access to curated news and learning materials from the experts at Google.

Introducing the Builders Hub from the Google Developer Program

Tue, 21 Apr 2026 13:26:00 +0000

Today’s developer experience is often spread across dozens of consoles, documentation pages, and sites to stay informed. We know that the friction of jumping between surfaces can slow down the most important part of your day: building.

To solve this, we are introducing Builders Hub within Google Developer Program as a new centralized service designed to provide developers with a unified entry point, a workbench for projects, and resources—including personalized suggestions for community engagement and learning.

Whether you are a vibe coder, an AI Builder, or a professional developer, Builders Hub has something to offer you. Learn more about how the new Builders Hub helps you move faster.

A Frictionless "Front Door"

Getting started should be measured in seconds, not hours. Builders Hub eliminates onboarding complexity by providing a unified activation point for all Google developer tools.

Unified Project Dashboard: Access and view all of your Google Cloud, Firebase and AI Studio projects and apps from a single destination. You can now see at a glance exactly which services are enabled across your entire environment without hopping between separate consoles.

Personalized Learning & Interests: Receive tailored recommendations and compatible interest suggestions based on the specific services you’ve selected. Builders Hub understands your tech stack and serves up the most relevant learning paths to help you master new tools faster.

The Integrated Workbench: Build While You Learn

We’re moving beyond static documentation. The new Builders Hub introduces an interactive environment where learning and execution happen side-by-side, allowing you to focus on innovation.

Integrated Credits & Seamless Execution: Unlock Google Cloud credits directly within a Codelab to get started with zero friction. This seamless flow allows you to spin up real environments immediately, so you can learn by doing without the traditional operational toil of manual billing or account setup.

Showcase Your Proficiency with Badges: Every milestone counts. Unlock and showcase digital badges that highlight your specific achievements and skill sets. These credentials allow you to prove your proficiency to the global community and potential employers.

Grow With Your Career

With Builders Hub, Google Developer Program is no longer just a place to start—it’s where you build a legacy. We’ve expanded the Hub to prioritize community and professional recognition, giving you the tools to turn your technical proficiency into career-defining milestones.

Discover Local Communities & Events: Connect with other builders in your backyard. The Hub now features a dedicated discovery engine for communities and local events, making it easier than ever to build your network and find your tribe.

The Pulse of Google Developers: Stay connected with an integrated feed of upcoming events and recent blog posts from across all of Google’s developer channels, curated directly within your workbench.

Get Started

The transition to agentic, AI-driven development requires a new set of tools and a more integrated experience. Builders Hub is built to be your workbench for this next era.

Access the new Builders Hub today by signing into Google Developer Program at builders.google!

Create Expert Content: Deploying a Multi-Agent System with Terraform and Cloud Run

Fri, 17 Apr 2026 08:56:00 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal: a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first three parts of this series, we laid the essential groundwork by establishing its core capabilities and local verification process:

In part 1, we standardize the agent's capabilities through the Model Context Protocol (MCP), connecting it to Reddit for trend discovery and Google Cloud Docs for technical grounding. In part 2, we built a multi-agent architecture and integrated the Vertex AI memory bank to allow the system to learn and persist user preferences across different conversations. In part 3, we verified the full end-to-end lifecycle locally using a dedicated test runner to ensure that research, content creation, and cloud-based memory retrieval were perfectly synchronized.

If you’d like to dive straight into the code, you can clone the repository here.

Deployment to Cloud Run and the Path to Production

To help you transition from this local prototype to a production service, this final part focuses on building the production backbone of your agent using the foundational deployment patterns provided by the Agent Starter Pack. We will implement the essential structural components required for monitoring, data integrity, and long-term state management in the cloud. You will learn to implement the application server and helper utilities needed for a production-ready deployment before provisioning secure, reproducible infrastructure with Terraform.

While the Dockerfile packages your agent's code and its specialized dependencies, such as Node.js for the Reddit MCP tool, Terraform is used to build the platform it lives on. Terraform automates the creation of your Artifact Registry, least-privilege service accounts, and Secret Manager integrations to ensure your API keys remain protected.

By the end of this part, you will have a standardized application framework deployed on Google Cloud Run and a roadmap for graduating your prototype through continuous evaluation, CI/CD and advanced observability.

Production Utilities and Server: Building the System's Body

In this section, you implement the structural components required for monitoring and long-term state management in the cloud.

The Application Server: Initializing the FastAPI server and establishing a vital connection to the Vertex AI memory bank.
Implementing Telemetry: Enabling 'Agent Traces' for visibility into internal reasoning.

The Application Server

The fast_api_app.py file serves as the vital entry point for your agent, transforming the core logic into a production FastAPI server that acts as the "body" of your system. When deploying to Cloud Run, this server is essential because it provides the necessary web interface to listen for incoming HTTP requests and dispatch them to the agent for processing. Beyond basic serving, its most critical role is establishing a connection to the Vertex AI memory bank by defining a MEMORY_URI, which allows the ADK framework to persist and retrieve user preferences across different production sessions. Additionally, the application server initializes production-grade telemetry for real-time monitoring.

Go back to the dev_signal_agent folder.

code_block: <ListValue: [StructValue([('code', 'cd ..'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8dd60>)])]>

Paste the following code in dev_signal_agent/fast_api_app.py:

code_block: <ListValue: [StructValue([('code', 'import os\r\nfrom fastapi import FastAPI\r\nfrom google.adk.cli.fast_api import get_fast_api_app\r\nfrom google.cloud import logging as cloud_logging\r\nfrom vertexai import agent_engines\r\nfrom dev_signal_agent.app_utils.env import init_environment\r\n\r\n# --- Initialization & Secure Secret Retrieval ---\r\n# We now unpack the SECRETS dictionary returned by our updated env.py\r\nPROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()\r\nlogger = cloud_logging.Client().logger(__name__)\r\n\r\n# Access sensitive credentials from the SECRETS dictionary \r\n# These keys stay in memory and are NOT injected into os.environ\r\nREDDIT_CLIENT_ID = SECRETS.get("REDDIT_CLIENT_ID")\r\nREDDIT_CLIENT_SECRET = SECRETS.get("REDDIT_CLIENT_SECRET")\r\nREDDIT_USER_AGENT = SECRETS.get("REDDIT_USER_AGENT")\r\nDK_API_KEY = SECRETS.get("DK_API_KEY")\r\n\r\n# --- Configuration & Sessions ---\r\nAGENT_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))\r\n# Non-sensitive configuration uses environment variables \r\nBUCKET = os.environ.get("AI_ASSETS_BUCKET") \r\nUSE_IN_MEMORY = os.environ.get("USE_IN_MEMORY_SESSION", "").lower() in ("true", "1")\r\n\r\n# --- MEMORY BANK CONNECTION ---\r\ndef _get_memory_bank_uri():\r\n if USE_IN_MEMORY: return None, None\r\n # We use \'dev_signal_agent\' as the display name for the Vertex AI memory bank\r\n name = os.environ.get("AGENT_ENGINE_MEMORY_BANK_NAME", "dev_signal_agent") \r\n existing = list(agent_engines.list(filter=f"display_name={name}"))\r\n ae = existing[0] if existing else agent_engines.create(display_name=name)\r\n uri = f"agentengine://{ae.resource_name}"\r\n print(f"DEBUG: Connecting to Memory Bank: {uri} (display_name={name})")\r\n return uri, uri\r\n\r\nSESSION_URI, MEMORY_URI = _get_memory_bank_uri()\r\n\r\n# --- Initialize FastAPI with ADK ---\r\napp: FastAPI = get_fast_api_app(\r\n agents_dir=AGENT_DIR,\r\n web=True,\r\n artifact_service_uri=f"gs://{BUCKET}" if BUCKET else None,\r\n allow_origins=os.getenv("ALLOW_ORIGINS", "").split(",") if os.getenv("ALLOW_ORIGINS") else None,\r\n session_service_uri=SESSION_URI,\r\n memory_service_uri=MEMORY_URI, # <--- Connects the Memory Bank\r\n otel_to_cloud=True, # <--- Enables production telemetry\r\n)\r\n\r\nif __name__ == "__main__":\r\n import uvicorn\r\n # Standard Cloud Run port is 8080 \r\n uvicorn.run(app, host="0.0.0.0", port=8080)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8de20>)])]>

Implementing Telemetry

In a production environment, visibility into your agent's reasoning is critical. We leverage the built-in observability features of the Google ADK by setting the otel_to_cloud=True flag in our application server. This single parameter handles the majority of the instrumentation automatically, exporting "Agent Traces" directly to the Google Cloud Console. These traces provide a "visual waterfall" of the agent's operation, including individual agent thought processes, LLM invocations, and MCP tool calls.

Monitoring vs. Targeted Evaluation

It is essential to understand that production tracing is subject to sampling to balance performance and cost. Because Cloud Run captures only a subset of requests, not every individual user interaction will be visible.

System Traces (Monitoring): Used to analyze behavior "at large," such as identifying latency bottlenecks or system timeouts.
Reasoning Traces (Evaluation): High-quality evaluation mandates targeted trace capture. This means calling the agent specifically for a test case where you know you will evaluate that particular request in full detail.

Viewing the Trace

To see your traces, navigate to the Trace Explorer in the Google Cloud Console and filter for your service (e.g., dev-signal). Clicking a specific Trace ID opens a Gantt chart that allows you to distinguish between cognitive reasoning failures (wrong decisions) and physical system issues (timeouts).

For advanced configurations, refer to the following documentation:

Infrastructure as Code: Provisioning Secure Cloud Resources

We utilize the infrastructure-as-code patterns provided by the Agent Starter Pack's security-first design. The starter pack builds the professional platform required to automate the creation of least-privilege service accounts and robust secret management in seconds.

Using Terraform ensures that your entire Google Cloud environment - from IAM roles to Secret Manager versions - is defined in reproducible, secure code. We break our infrastructure into the following logical blocks:

Resources & Variables: Define the specific project, region, and sensitive API secrets used by the agent.
Core Infrastructure: Enable essential APIs and provision a private Artifact Registry to host your agent's container images.
Identity & Access Management (IAM): Configure specialized Service Accounts that strictly follow the Principle of Least Privilege to ensure your system remains secure.
Secret Management: Securely ingest API credentials into Google Secret Manager for protected runtime access.
Cloud Run Configuration: Define the container environment, resource limits, and automated secret injection for the final deployment.

To begin provisioning, return to the root folder of your project (dev-signal) and create the necessary deployment directories:

code_block: <ListValue: [StructValue([('code', 'cd ..\r\nmkdir deployment\r\ncd deployment\r\nmkdir terraform\r\ncd terraform'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8dee0>)])]>

Terraform Resources and Variables

The variables.tf file defines the configurable parameters for your deployment, allowing you to customize the infrastructure without altering the underlying logic. It includes variables for the project_id, the deployment region (defaulting to us-central1), and the service_name for your Cloud Run instance. Furthermore, it defines a secrets map used to securely ingest sensitive API credentials—such as Reddit and Developer Knowledge keys—into Google Secret Manager for runtime access. This modular approach ensures your production environment remains reproducible, secure, and adaptable across different projects.

Paste the following code into deployment/terraform/variables.tf:

code_block: <ListValue: [StructValue([('code', 'variable "project_id" {\r\n description = "The Google Cloud Project ID"\r\n type = string\r\n}\r\nvariable "region" {\r\n description = "The Google Cloud region to deploy to"\r\n type = string\r\n default = "us-central1"\r\n}\r\nvariable "service_name" {\r\n description = "The name of the Cloud Run service"\r\n type = string\r\n default = "dev-signal"\r\n}\r\nvariable "secrets" {\r\n description = "A map of secret names and their values (e.g., REDDIT_CLIENT_ID, DK_API_KEY)"\r\n type = map(string)\r\n default = {}\r\n}\r\nvariable "ai_assets_bucket" {\r\n description = "The GCS bucket for storing AI assets"\r\n type = string\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d9d0>)])]>

Core Infrastructure Logic

We define our infrastructure in logical blocks. Here is what each part does:

1. Enable APIs: Ensures the project has the necessary services active (Cloud Run, Vertex AI, etc.). We use disable_on_destroy = false to prevent accidental data loss if the Terraform is destroyed.

Paste the following code into deployment/terraform/main.tf:

code_block: <ListValue: [StructValue([('code', 'resource "google_project_service" "services" {\r\n project = var.project_id\r\n for_each = toset([\r\n "run.googleapis.com",\r\n "artifactregistry.googleapis.com",\r\n "cloudbuild.googleapis.com",\r\n "aiplatform.googleapis.com",\r\n "secretmanager.googleapis.com",\r\n "logging.googleapis.com"\r\n ])\r\n service = each.key\r\n disable_on_destroy = false\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d340>)])]>

2. Artifact Registry: Creates a private Docker registry to store our agent's container images.

code_block: <ListValue: [StructValue([('code', 'resource "google_artifact_registry_repository" "repo" {\r\n location = var.region\r\n project = var.project_id\r\n repository_id = "dev-signal-repo"\r\n description = "Docker repository for Dev Signal Agent"\r\n format = "DOCKER"\r\n depends_on = [google_project_service.services]\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d850>)])]>

3. Service Account & IAM: Adhering to the Principle of Least Privilege - This is a critical security step. In accordance with the Principle of Least Privilege, we avoid using the default compute service account and instead provision a dedicated user-managed service account (dev-signal-sa). By designating this as the Cloud Run service identity, we can grant it only the minimum necessary permissions—specifically roles/aiplatform.user, roles/logging.logWriter, and roles/storage.objectAdmin. This granular access control ensures that the agent has the exact permissions required to interact with Vertex AI and Cloud Storage without over-granting access to other sensitive cloud resources, significantly reducing the potential impact of a compromised account. Learn more best practices for using service accounts securely.

code_block: <ListValue: [StructValue([('code', 'resource "google_service_account" "agent_sa" {\r\n project = var.project_id\r\n account_id = "${var.service_name}-sa"\r\n display_name = "Dev Signal Agent Service Account"\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8da30>)])]>

4. Secret Management: This handles your API keys securely. It creates secrets in Google Secret Manager and gives the agent's Service Account permission to access them at runtime.

code_block: <ListValue: [StructValue([('code', 'resource "google_secret_manager_secret" "agent_secrets" {\r\n project = var.project_id\r\n for_each = toset(keys(var.secrets))\r\n secret_id = each.key\r\n replication {\r\n auto {}\r\n }\r\n depends_on = [google_project_service.services]\r\n}\r\n\r\nresource "google_secret_manager_secret_version" "agent_secrets_version" {\r\n for_each = toset(keys(var.secrets))\r\n secret = google_secret_manager_secret.agent_secrets[each.key].id\r\n secret_data = var.secrets[each.key]\r\n}\r\n\r\nresource "google_secret_manager_secret_iam_member" "secret_accessor" {\r\n project = var.project_id\r\n for_each = toset(keys(var.secrets))\r\n secret_id = google_secret_manager_secret.agent_secrets[each.key].id\r\n role = "roles/secretmanager.secretAccessor"\r\n member = "serviceAccount:${google_service_account.agent_sa.email}"\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d910>)])]>

5. Cloud Run Configuration:

Security Best Practice: To satisfy production security standards, our main.tf grants the Service Account the secretmanager.secretAccessor role. Our Python application then uses the Secret Manager SDK to pull these credentials directly into local memory at runtime, ensuring they never touch the container's environment configuration

code_block: <ListValue: [StructValue([('code', '# 6. Cloud Run Service Deployment\r\nresource "google_cloud_run_v2_service" "default" {\r\n project = var.project_id\r\n name = var.service_name\r\n location = var.region\r\n ingress = "INGRESS_TRAFFIC_ALL"\r\n\r\n template {\r\n service_account = google_service_account.agent_sa.email\r\n \r\n containers {\r\n image = "us-docker.pkg.dev/cloudrun/container/hello" # Placeholder until first build\r\n \r\n env {\r\n name = "GOOGLE_CLOUD_PROJECT"\r\n value = var.project_id\r\n }\r\n env {\r\n name = "GOOGLE_CLOUD_LOCATION"\r\n value = "global"\r\n }\r\n env {\r\n name = "GOOGLE_GENAI_USE_VERTEXAI"\r\n value = "True"\r\n }\r\n env {\r\n name = "AI_ASSETS_BUCKET"\r\n value = var.ai_assets_bucket\r\n }\r\n\r\n resources {\r\n limits = {\r\n cpu = "1"\r\n memory = "2Gi"\r\n }\r\n }\r\n }\r\n }\r\n \r\n traffic {\r\n type = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"\r\n percent = 100\r\n }'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d880>)])]>

Provision the Infrastructure

Before we can deploy our code, we need to provision the Google Cloud infrastructure we just defined.

Initialize Terraform: This downloads the necessary provider plugins. Run this in deployment/terraform folder:

code_block: <ListValue: [StructValue([('code', 'terraform init'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8db20>)])]>

Create a Variables File:

Paste this code in deployment/terraform/terraform.tfvars and update it with your project details and secrets.

code_block: <ListValue: [StructValue([('code', 'project_id = "your-project-id"\r\nregion = "us-central1"\r\nservice_name = "dev-signal"\r\nai_assets_bucket = "your-bucket-name"\r\nsecrets = {\r\n REDDIT_CLIENT_ID = "your_client_id"\r\n REDDIT_CLIENT_SECRET = "your_client_secret"\r\n REDDIT_USER_AGENT = "your_user_agent"\r\n DK_API_KEY = "your_dk_api_key"\r\n}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d040>)])]>

Plan configuration: This allows you to review the changes before they are applied. Run this in the deployment/terraform folder:

code_block: <ListValue: [StructValue([('code', 'terraform plan -out=plan.tfplan'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d1f0>)])]>

Apply Configuration: Once you have reviewed the plan and confirmed it does what you want, run:

code_block: <ListValue: [StructValue([('code', 'terraform apply plan.tfplan'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8dcd0>)])]>

Deployment: Containerization and the Cloud Build Pipeline

In this final stage of the build process, we package our agent's "body" and "brain" into a portable, production-ready container. This ensures that every component - from our Python logic to the Node.js environment required for the Reddit MCP tool - is bundled together with its exact dependencies.

We utilize a Dockerfile to define this environment and a Makefile to orchestrate the deployment pipeline. When you trigger the deployment, Google Cloud Build takes your local source code, builds the container image according to the Dockerfile, and stores it in the private Artifact Registry created earlier by Terraform. Finally, the pipeline automatically updates your Cloud Run service to serve traffic using this fresh image, completing the journey from local code to a live, secure cloud workload.

Paste this code in dev-signal/Dockerfile:

code_block: <ListValue: [StructValue([('code', 'FROM python:3.12-slim\r\n\r\n# Install Node.js and npm for MCP tools (like reddit-mcp)\r\nRUN apt-get update && apt-get install -y \\\r\n curl \\\r\n && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \\\r\n && apt-get install -y nodejs \\\r\n && npm install -g reddit-mcp \\\r\n && apt-get clean \\\r\n && rm -rf /var/lib/apt/lists/*\r\n\r\nRUN pip install --no-cache-dir uv==0.8.13\r\n\r\nWORKDIR /code\r\n\r\nCOPY ./pyproject.toml ./README.md ./uv.lock* ./\r\nCOPY ./dev_signal_agent ./dev_signal_agent\r\n\r\nRUN uv sync --frozen\r\n\r\nEXPOSE 8080\r\n\r\nCMD ["uv", "run", "uvicorn", "dev_signal_agent.fast_api_app:app", "--host", "0.0.0.0", "--port", "8080"]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8dc40>)])]>

The Makefile automates the build and deploys.

Paste this code in dev-signal/Makefile:

code_block: <ListValue: [StructValue([('code', 'PROJECT_ID ?= $(shell gcloud config get-value project)\r\nREGION ?= us-central1\r\nIMAGE_REPO ?= dev-signal-repo\r\nIMAGE := $(REGION)-docker.pkg.dev/$(PROJECT_ID)/$(IMAGE_REPO)/agent:latest\r\n\r\n# Deploy via Cloud Build & Container\r\ndocker-deploy:\r\n\t@echo "? Building and deploying to $(PROJECT_ID) via Cloud Build..."\r\n\tgcloud builds submit --tag $(IMAGE) --project $(PROJECT_ID) .\r\n\tgcloud run services update dev-signal \\\r\n\t\t--image $(IMAGE) \\\r\n\t\t--region $(REGION) \\\r\n\t\t--project $(PROJECT_ID) \\\r\n--labels dev-tutorial=dev-signal-agent'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d790>)])]>

Deploy Application

Now that our infrastructure is ready, we can build and deploy the application code.

Run the following command from the root of your project:

code_block: <ListValue: [StructValue([('code', 'make docker-deploy'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d100>)])]>

What happens when you run this?

Build: Google Cloud Build takes your local code and the Dockerfile, builds a container image, and stores it in the Artifact Registry.
Deploy: It updates the Cloud Run service defined in Terraform to use this new image.

When the deployment completes, you should get a message like this:

Service [dev-signal] revision [dev-signal...] has been deployed and is serving 100 percent of traffic.

Service URL: https://dev-signal-...-.us-central1.run.app

Verification: Accessing and Testing Your Deployed Agent

Since production services are private by default, this section covers how to grant permissions and access the agent securely.

Managing IAM Permissions: Granting the necessary run.invoker role to authorized users.

Secure Access via Cloud Run Proxy: Using the gcloud proxy to interact with your live service.

Granting User Permissions

Before you can invoke the service, you must grant your Google account the roles/run.invoker role for this specific service. Run the following command:

code_block: <ListValue: [StructValue([('code', 'gcloud run services add-iam-policy-binding dev-signal \\\r\n --member="user:$(gcloud config get-value account)" \\\r\n --role="roles/run.invoker" \\\r\n --region=us-central1 \\\r\n --project=$(gcloud config get-value project)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8dfd0>)])]>

Launch the Proxy

Now, access your private service securely via the proxy:

code_block: <ListValue: [StructValue([('code', 'gcloud run services proxy dev-signal \\\r\n --region us-central1 \\\r\n --project $(gcloud config get-value project)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7271c8d3d0>)])]>

Visit http://localhost:8080 to chat with your deployed agent! See a possible test scenario in part 3 of the series.

Summary

Congratulations! You have successfully built Dev Signal.

What we covered:

Tooling (MCP): You connected your agent to Reddit, Google Docs, and a Local Image Generator using the Model Context Protocol.
Architecture: You implemented a Root Orchestrator managing specialized agents (Scanner, Expert, Drafter).
Memory: You integrated Vertex AI memory bank to give your agent long-term persistence across sessions.
Production: You deployed the entire stack to Google Cloud Run using Terraform for secure, reproducible infrastructure.

You now have a solid foundation for building sophisticated, stateful AI applications on Google Cloud.

Building Event-Driven Data Agents with BigQuery, Pub/Sub, and ADK

Fri, 10 Apr 2026 21:00:00 +0000

The Need for Real-Time Autonomous Agents

Data is only as valuable as your ability to act on it. In the modern enterprise, reacting to events hours—or even minutes—after they occur is often too late. Whether you're dealing with financial fraud or dynamic supply chain disruptions, every second counts.

But a lot of systems still rely on slow scheduled batch jobs or fragile microservices that constantly pull for changes. By the time a problem surfaces, it's often too late. That leaves human investigators scrambling to piece things together by digging through logs and database queries. It's a slow, painful process that just doesn't scale.

Enter Event-Driven Data Agents

What if, instead of waiting for slow pipelines and manual triage, your data platform could instantly push an alert as soon as an anomaly is detected, triggering an autonomous AI agent to investigate and resolve it?

This is the promise of the Event-Driven Data Agent architecture. By combining BigQuery continuous queries, Pub/Sub, and the ADK Agents on Vertex AI Agent Engine, you can build a pipeline that triages events in real time and autonomously investigates them. The agent uses advanced reasoning to gather context, analyze the data, and either resolve the issue on the spot or escalate it to a person when human-in-the-loop intervention is needed.

The Hybrid Architecture: How it Works

This event-driven pipeline leverages three core building blocks:

Detection: BigQuery continuous queries monitor live data streams and detect anomalies using a rules-based engine.
Routing: Pub/Sub reliably delivers these events, using Single Message Transforms (SMTs) to reshape the payloads into the exact format your AI agents expect, thereby triggering the agentic pipeline to start its investigation.
Resolution: A Vertex AI Agent (built with ADK) receives the event, investigates using custom tools, and logs its decision.

Let’s dive in and explore each component. To make this concrete, we'll walk through a simple use case: detecting and investigating fraudulent financial transactions in real-time.

Part 1: BigQuery Continuous Queries

BigQuery continuous queries allow you to build real-time event streams natively using standard SQL. They are persistent SQL queries that run continuously, analyzing incoming data and immediately exporting SQL results to destinations like Pub/Sub.

The shift from pulling to pushing streaming events natively in BigQuery means you can detect complex anomalies (like a user transacting in two different countries within a user specified window) within your data warehouse using standard SQL. There’s no need to move your data to a separate streaming analytics engine.

This transformation is powered by the launch of BigQuery continuous query stateful data processing in public preview, which introduces native support for stream-to-stream JOINs, windowed aggregations, and tumbling windows. By allowing you to correlate disparate data streams and calculate complex metrics—such as rolling averages or sum totals—directly in BigQuery, we are democratizing stream processing for any SQL user. This eliminates the need for specialized external tools or deep data science expertise to build a real-time 'System of Action' that detects and reacts to events as they happen. This approach also helps manage LLM token costs; by using stateful SQL to filter for specific anomalies, you ensure that your agents only process the exact context they need, rather than overwhelming them with raw data.

Implementing this is straightforward. By combining a standard SQL query with an EXPORT DATA statement, you can route matching rows directly into a Pub/Sub topic the second they occur:

code_block: <ListValue: [StructValue([('code', 'EXPORT DATA OPTIONS (\r\n format = "CLOUD_PUBSUB",\r\n uri = "https://pubsub.googleapis.com/projects/YOUR_PROJECT_ID/topics/cymbal-bank-escalations-topic"\r\n) AS (\r\n WITH TransactionHeuristics AS (\r\n SELECT\r\n *,\r\n _CHANGE_TIMESTAMP AS bq_changed_ts,\r\n FROM APPENDS(TABLE `cymbal_bank.retail_transactions`, CURRENT_TIMESTAMP() - INTERVAL 10 MINUTE)\r\n )\r\n SELECT\r\n TO_JSON_STRING(STRUCT(\r\n window_end,\r\n user_id,\r\n COUNT(*) AS tx_count,\r\n SUM(amount) AS total_window_spend,\r\n MAX_BY(merchant_name, amount) AS highest_value_merchant,\r\n MAX_BY(merchant_category_code, amount) AS highest_value_mcc,\r\n 100 AS final_risk_score,\r\n STRUCT(\r\n APPROX_COUNT_DISTINCT(location_country) > 1 AS is_impossible_travel,\r\n LOGICAL_OR(NOT is_trusted_device) AS has_security_mismatch\r\n ) AS logic_signals\r\n )) AS data\r\n FROM TUMBLE(TABLE TransactionHeuristics, "bq_changed_ts", INTERVAL 2 MINUTE)\r\n GROUP BY window_start, window_end, user_id\r\n HAVING APPROX_COUNT_DISTINCT(location_country) > 1\r\n);'), ('language', 'lang-sql'), ('caption', <wagtail.rich_text.RichText object at 0x7f726c387df0>)])]>

Part 2: Pub/Sub & Single Message Transforms (SMT)

Bridging the schema gap with Pub/Sub. The exported event data from our continuous query is sent directly to a Pub/Sub topic. Before this raw data can be consumed by our AI agent, the payload needs to be transformed to match the schema expected by our agent.

Instead of deploying something like a dedicated Cloud Function to reformat these messages, you can handle it entirely within the Pub/Sub subscription using a Single Message Transform (SMT). SMTs allow you to run lightweight, inline JavaScript User-Defined Functions (UDFs) directly within Pub/Sub to map, reshape, or clean the payload on the fly.

For instance, you can define a transform.yaml with a Javascript snippet that intercepts the BigQuery payload and extracts the exact query format our Agent Engine expects:

code_block: <ListValue: [StructValue([('code', 'function process(res) {\r\n let bq_payload = JSON.parse(res.message.data);\r\n res.message.data = JSON.stringify({"query": bq_payload});\r\n return res;\r\n}'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726c3876d0>)])]>

To configure the routing pipeline, you create a Pub/Sub Push Subscription. This subscription automatically pushes every transformed BigQuery event directly to your AI agent's webhook endpoint:

code_block: <ListValue: [StructValue([('code', 'gcloud pubsub subscriptions create cymbal-bank-escalations-sub \\\r\n --topic=projects/$PROJECT_ID/topics/cymbal-bank-escalations-topic \\\r\n --message-transforms-file=setup/transform.yaml \\\r\n --push-endpoint="https://YOUR_AGENT_WEBHOOK_URL" \r\n --push-no-wrapper \\\r\n --ack-deadline=600 \\\r\n --push-auth-service-account="adk-agent-sa@$PROJECT_ID.iam.gserviceaccount.com"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726c3873d0>)])]>

Notice the push-endpoint parameter above. This webhook URL is generated by our final architectural piece: the AI Agent itself.

Part 3: ADK and Vertex AI Agent Engine

When an agent is deployed to Vertex AI Agent Engine, the platform automatically provisions a secure streamQuery endpoint specifically designed to receive these incoming events.

This is the brain of the operation. Once an anomaly is detected and routed via Pub/Sub, the message triggers an ADK agent deployed on Vertex AI.

To implement the reasoning loop, you define your agent, equipped with tools, and deploy it:

code_block: <ListValue: [StructValue([('code', 'investigation_agent = Agent(\r\n model="gemini-2.5-flash",\r\n name="fraud_investigation_agent",\r\n description="Expert fraud analyst agent that autonomously investigates alerts...",\r\n instruction=(\r\n "You are an expert fraud investigator for Cymbal Bank. "\r\n "Your goal is to investigate financial transaction alerts, "\r\n "determine if they are fraudulent, and take appropriate action. "\r\n "Use the BigQuery toolset to analyze data in the transactions table.."\r\n "Use the Google Search toolset to search for the merchant..."\r\n "Conslidate your findings and use the escalate_to_human tool if required..."\r\n ),\r\n tools=[\r\n bigquery_toolset,\r\n google_search,\r\n ],\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726c387dc0>)])]>

Equipped with specific instructions and this custom toolset, the agent autonomously investigates the alert by actively gathering external context. It can query BigQuery for a user’s transaction history, analyze unstructured data like receipts, or ground its findings with Google Search to verify a merchant's reputation. Ultimately, it categorizes the transaction as a FALSE_POSITIVE or flags it as ESCALATION_NEEDED.

The Human-in-the-Loop Advantage

This approach is central to the architecture's scalability. By effectively filtering out the noise, it dramatically reduces operational overhead and ensures that your investigators only spend their time on the most complex cases. And since ADK offers an impressive array of tools and integrations, you can have your agent escalate events to a wide array of enterprise systems for both human-in-the-loop engagement, or even automate pipelines end-to-end using human-on-the-loop observability.

Bringing it All Together: Agent Analytics

Once your pipeline is live, the work shifts from building to monitoring. Unlike traditional software, autonomous agents run persistently in the background. Because they operate behind the scenes, having deep observability into what they are doing, how long they take, and how much they cost is critical.

By initializing the BigQuery Agent Analytics plugin during deployment, the ADK automatically logs all trace data, tool usage, and execution latency directly into BigQuery:

By joining this trace data with the structured decisions output by your agent, you unlock rich analytics. This enables you to build dynamic dashboards and set up custom alerts to monitor your AI workforce in real-time. You can check out this Codelab to learn more about using and the Agent Analytics Plugin.

Conclusion

The convergence of real-time data streaming and Agentic AI is changing how we handle operational alerts.

Detect in real-time with BigQuery continuous queries.
Transform and Route with Pub/Sub SMTs.
Investigate and Resolve with Vertex AI Agent Engine.
Analyze with BigQuery Agent Analytics Plugin

This architecture enables you to build a proactive, autonomous workforce capable of handling anomalies the moment they occur—all within a governed, scalable, and serverless Google Cloud environment.

Ready to get hands on?

Check out our codelab for a step-by-step guide on how to build this Cymbal Bank pipeline from scratch!

Migrating to Google Cloud’s Application Load Balancer: A practical guide

Fri, 10 Apr 2026 16:00:00 +0000

Migrating your existing application load balancer infrastructure from an on-premises hardware solution to Cloud Load Balancing offers substantial advantages in scalability, cost-efficiency, and tight integration within the Google Cloud ecosystem. Yet, a fundamental question often arises: "What about our current load balancer configurations?"

Existing on-premises load balancer configurations often contain years of business-critical logic for traffic manipulation. The good news is that not only can you fully migrate existing functionalities, but this migration also presents a significant opportunity to modernize and simplify your traffic management.

This guide outlines a practical approach for migrating your existing load balancer to Google Cloud’s Application Load Balancer. It addresses common functionalities, leveraging both its declarative configurations and the innovative, event-driven Service Extensions edge compute capability.

A simple, phased approach to migration

Transitioning from an imperative, script-based system to a cloud-native, declarative-first model requires a structured plan. We recommend a straightforward, four-phase approach.

Phase 1: Discovery and mapping

Before commencing any migration, you must understand what you have. Analyze and categorize your current load balancer configurations. What is each rule's intent? Is it performing a simple HTTP-to-HTTPS redirect? Is it engaged in HTTP header manipulation (addition or removal)? Or is it handling complex, custom authentication logic?

Most configurations typically fall into two primary categories:

Common patterns: Logic that is common to most web applications, such as redirects, URL rewrites, basic header manipulation, and IP-based access control lists (ACLs).
Bespoke business logic: Complex logic unique to your application, like custom proprietary token authentication, advanced header extraction / replacement, dynamic backend selection based on HTTP attributes, or HTTP response body manipulation.

Phase 2: Choose your Google Cloud equivalent

Once your rules are categorized, the next step involves mapping them to the appropriate Google Cloud feature. This is not a one-to-one replacement; it's a strategic choice.

Option 1: the declarative path (for ~80% of rules)
For the majority of common patterns, leveraging the Application Load Balancer's built-in declarative features is usually the best approach. Instead of a script, you define the desired state in a configuration file. This is simpler to manage, version-control, and scale.

Common patterns to declarative feature mapping:

Redirects/rewrites -> Application Load Balancer URL maps
ACLs/throttling -> Google Cloud Armor security policies
Session persistence -> backend service configuration

Option 2: The programmatic path (for complex, bespoke rules)
When dealing with complex, bespoke business logic, you have a programmatic equivalent: Service Extensions, a powerful edge compute capability that allows you to inject custom code (written in Rust, C++ or Go) directly into the load balancer's data path. This approach gives you flexibility in a modern, managed, and high-performance framework.

This flowchart helps you decide the appropriate Google Cloud feature for each configuration

Phase 3: Test and validate

Once you’ve chosen the appropriate path for your configurations, you are ready to deploy your new Application Load Balancer configuration in a staging environment that mirrors your production setup. Thoroughly test all application functionality, paying close attention to the migrated logic. Use a combination of automated testing and manual QA to validate the redirects, security policies, and that the custom Service Extensions logic are behaving as expected.

Phase 4: Phased cutover (canary deployment)

Don't flip a single switch for all your traffic; instead, implement a phased migration strategy. Start the transitioning process by routing a small percentage of production traffic (e.g., 5-10%) to your new Google Cloud load balancer. During this initial period, be sure to monitor key metrics like latency, error rates, and application performance. As you gain confidence, you can progressively increase the percentage of traffic routed to the Application Load Balancer. Always have a clear rollback plan to revert back to the legacy infrastructure in the event you encounter critical issues.

Best practices for a smooth migration

Drawing from our practical experience, we have compiled the following recommendations to assist you in planning your load balancer migrations.

Analyze first, migrate second: A thorough analysis of your existing configurations is the most critical step. Don't "lift and shift" logic that is no longer needed.
Prefer declarative: Always default to Google Cloud's managed, declarative features (URL Maps, Cloud Armor) first. They are simpler, more scalable, and require less maintenance.
Use Service Extensions strategically: Reserve Service Extensions for the complex, bespoke business logic that declarative features cannot handle.
Monitor everything: Continuously monitor both your existing load balancers and Google Cloud load balancers during the migration. Watch key metrics like traffic volume, latency, and error rates to detect and address issues instantly.
Train your team: Ensure your team is trained on Cloud Load Balancing concepts. This will empower them to effectively operate and maintain the new infrastructure.

Migrating from the existing on-premises load balancer infrastructure is more than just a technical task, it's an opportunity to modernize your application delivery. By thoughtfully mapping your current load balancing configurations and capabilities to either declarative Application Load Balancer features or programmatic Service Extensions, you can build a more scalable, resilient, and cost-effective infrastructure destined for future demands.

To get started, review the Application Load Balancer and Service Extensions features and advanced capabilities to come up with the right design for your application. For more guidance and complex use cases, contact your Google Cloud team.

Create Expert Content: Local Testing of a Multi-Agent System with Memory

Fri, 10 Apr 2026 08:11:00 +0000

In part 1 and part 2 of this series, we established the essential groundwork by standardizing the core capabilities through the Model Context Protocol (MCP) and constructing a multi-agent architecture integrated with the Vertex AI memory bank to provide long-term intelligence and persistence. Now, we'll explore how to test your multi-agent system locally!

If you’d like to dive straight into the code and explore it at your own pace, you can clone the repository here.

Testing the agent Locally

Before transitioning your agentic system to Google Cloud Run, it is essential to ensure that its specialized components work seamlessly together on your workstation. This testing phase allows you to validate trend discovery, technical grounding, and creative drafting within a local feedback loop, saving time and resources during the development process.

In this section, you will configure your local secrets, implement environment-aware utilities, and use a dedicated test runner to verify that Dev Signal can correctly retrieve user preferences from the Vertex AI memory bank on the cloud. This local verification ensures that your agent's "brain" and "hands" are properly synchronized before moving to deployment.

Environment Setup

Create a .env file in your project root. These variables are used for local development and will be replaced by Terraform/Secret Manager in production.

Paste this code in dev-signal/.env and update with your own details.

Note: GOOGLE_CLOUD_LOCATION is set as global because that is where Gemini-3-flash-preview is supported. We will use GOOGLE_CLOUD_LOCATION for the model location.

code_block: <ListValue: [StructValue([('code', '# Google Cloud Configuration\r\nGOOGLE_CLOUD_PROJECT=your-project-id\r\nGOOGLE_CLOUD_LOCATION=global\r\nGOOGLE_CLOUD_REGION=us-central1\r\nGOOGLE_GENAI_USE_VERTEXAI=True\r\nAI_ASSETS_BUCKET=your_bucket_name\r\n\r\n# Reddit API Credentials\r\nREDDIT_CLIENT_ID=your_client_id\r\nREDDIT_CLIENT_SECRET=your_client_secret\r\nREDDIT_USER_AGENT=my-agent/0.1\r\n\r\n# Developer Knowledge API Key\r\nDK_API_KEY=your_api_key'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d2fa0>)])]>

Helper Utilities

Create a new directory for your application utils.

code_block: <ListValue: [StructValue([('code', 'cd dev_signal_agent\r\nmkdir app_utils\r\ncd app_utils'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d20d0>)])]>

Environment configuration

This module standardizes how the agent discovers the active Google Cloud Project and Region, ensuring a seamless transition between development environments. Using load_dotenv(), the script first checks for local configurations before falling back to google.auth.default() or environment variables to retrieve the Project ID. This automated approach ensures your agent is properly authenticated and grounded in the correct cloud context without requiring manual configuration changes.

Beyond basic project discovery, the script provides a robust Secret Management layer. It attempts to resolve sensitive credentials, such as Reddit API keys, first from the local environment (for rapid development) and then dynamically from the Google Cloud Secret Manager API for production security. By returning these as a dictionary rather than injecting them into environment variables, the module maintains a clean security posture.

The script further calibrates the environment by distinguishing between global and regional requirements for different AI services. It specifically assigns the "global" location for models to access cutting-edge preview features while designating a regional location, such as us-central1, for infrastructure like the Vertex AI Agent Engine. By finalizing this setup with a global SDK initialization, the module integrates these settings into the session, allowing the rest of your application to interact with models and memory banks without having to repeatedly pass project or location parameters.

Paste this code in dev_signal_agent/app_utils/env.py

code_block: <ListValue: [StructValue([('code', 'import os\r\nimport google.auth\r\nimport vertexai\r\nfrom google.cloud import secretmanager\r\nfrom dotenv import load_dotenv\r\n\r\ndef _fetch_secrets(project_id: str):\r\n """Fetch secrets from Secret Manager and return them as a dictionary."""\r\n secrets_to_fetch = ["REDDIT_CLIENT_ID", "REDDIT_CLIENT_SECRET", "REDDIT_USER_AGENT", "DK_API_KEY"]\r\n fetched_secrets = {}\r\n\r\n # First, check local environment (for local development via .env)\r\n for s in secrets_to_fetch:\r\n val = os.getenv(s)\r\n if val:\r\n fetched_secrets[s] = val\r\n\r\n # If keys are missing (common in production), fetch from Secret Manager API\r\n if len(fetched_secrets) < len(secrets_to_fetch):\r\n client = secretmanager.SecretManagerServiceClient()\r\n for secret_id in secrets_to_fetch:\r\n if secret_id not in fetched_secrets:\r\n name = f"projects/{project_id}/secrets/{secret_id}/versions/latest"\r\n try:\r\n response = client.access_secret_version(request={"name": name})\r\n # DO NOT set os.environ[secret_id] here. \r\n # Keep it in this dictionary only.\r\n fetched_secrets[secret_id] = response.payload.data.decode("UTF-8")\r\n except Exception as e:\r\n print(f"Warning: Could not fetch {secret_id} from Secret Manager: {e}")\r\n\r\n return fetched_secrets\r\n\r\ndef init_environment():\r\n """Consolidated environment discovery."""\r\n load_dotenv()\r\n try:\r\n _, project_id = google.auth.default()\r\n except Exception:\r\n project_id = os.getenv("GOOGLE_CLOUD_PROJECT")\r\n \r\n model_location = os.getenv("GOOGLE_CLOUD_LOCATION", "global")\r\n service_location = os.getenv("GOOGLE_CLOUD_REGION", "us-central1")\r\n \r\n secrets = {}\r\n if project_id:\r\n vertexai.init(project=project_id, location=service_location)\r\n # Fetch secrets into a local variable\r\n secrets = _fetch_secrets(project_id)\r\n \r\n return project_id, model_location, service_location, secrets'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d2e50>)])]>

Local testing script

The Google ADK comes with a built-in Web UI, This UI is excellent for visualizing agent logic and tool composition.

You can launch it by running in the project root:

code_block: <ListValue: [StructValue([('code', 'uv run adk web'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d2a30>)])]>

However, the default Web UI will not test the long-term memory integration described in this tutorial because it is not pre-connected to a Vertex AI memory session. By default, the generic UI often relies on in-memory services that do not persist data across sessions. Therefore, we use the dedicated test_local.py script to explicitly initialize the VertexAiMemoryBankService. This ensures that even in a local environment, your agent is communicating with the real cloud-based memory bank to validate preference persistence.

The test_local.py script:

Connects to the real Vertex AI Agent Engine in the cloud for memory storage.
Uses an in-memory session service for local chat history (so you can wipe it easily).
Run a chat loop where you can talk to your agent.

Go back to the root folder dev-signal:

code_block: <ListValue: [StructValue([('code', 'cd ../..'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d26d0>)])]>

Paste this code in dev-signal/test_local.py

code_block: <ListValue: [StructValue([('code', 'import asyncio\r\nimport os\r\nimport google.auth\r\nimport vertexai\r\nimport uuid\r\nfrom dotenv import load_dotenv\r\nfrom google.adk.runners import Runner\r\nfrom google.adk.memory.vertex_ai_memory_bank_service import VertexAiMemoryBankService\r\nfrom google.adk.sessions import InMemorySessionService\r\nfrom vertexai import agent_engines\r\nfrom google.genai import types\r\nfrom dev_signal_agent.agent import root_agent\r\n\r\n# Load environment variables\r\nload_dotenv()\r\n\r\nasync def main():\r\n # 1. Setup Configuration\r\n project_id = os.getenv("GOOGLE_CLOUD_PROJECT")\r\n # Agent Engine (Memory) MUST use a regional endpoint\r\n resource_location = "us-central1"\r\n agent_name = "dev-signal"\r\n \r\n print(f"--- Initializing Vertex AI in {resource_location} ---")\r\n vertexai.init(project=project_id, location=resource_location)\r\n\r\n # 2. Find the Agent Engine Resource for Memory\r\n existing_agents = list(agent_engines.list(filter=f"display_name={agent_name}"))\r\n if existing_agents:\r\n agent_engine = existing_agents[0]\r\n agent_engine_id = agent_engine.resource_name.split("/")[-1]\r\n print(f"✅ Using persistent Memory Bank from Agent: {agent_engine_id}")\r\n else:\r\n print(f"❌ Error: Agent Engine \'{agent_name}\' not found. Please deploy with Terraform first.")\r\n return\r\n\r\n # 3. Initialize Services\r\n # We use InMemorySessionService for easier local testing (IDs are flexible)\r\n # BUT we use VertexAiMemoryBankService for REAL cloud persistence\r\n session_service = InMemorySessionService()\r\n \r\n memory_service = VertexAiMemoryBankService(\r\n project=project_id,\r\n location=resource_location,\r\n agent_engine_id=agent_engine_id\r\n )\r\n\r\n # 4. Create a Runner\r\n runner = Runner(\r\n agent=root_agent,\r\n app_name="dev-signal",\r\n session_service=session_service,\r\n memory_service=memory_service \r\n )\r\n\r\n # 5. Run a Test Loop\r\n user_id = "local-tester"\r\n \r\n print("\\n--- TEST SCENARIO ---")\r\n print("1. Start a session, tell the agent your preference (e.g., \'write in rhymes\').")\r\n print("2. Type \'new\' to start a FRESH session (local state wiped).")\r\n print("3. Ask for a blog post. The agent should retrieve your preference from the CLOUD memory.")\r\n \r\n current_session_id = f"session-{str(uuid.uuid4())[:8]}"\r\n await session_service.create_session(\r\n app_name="dev-signal",\r\n user_id=user_id,\r\n session_id=current_session_id\r\n )\r\n print(f"\\n--- Chat Session (ID: {current_session_id}) ---")\r\n\r\n while True:\r\n user_input = input("\\nYou: ")\r\n \r\n if user_input.lower() in ["exit", "quit"]:\r\n break\r\n \r\n if user_input.lower() == "new":\r\n # Simulate starting a completely fresh session\r\n current_session_id = f"session-{str(uuid.uuid4())[:8]}"\r\n await session_service.create_session(\r\n app_name="dev-signal",\r\n user_id=user_id,\r\n session_id=current_session_id\r\n )\r\n print(f"\\n--- Fresh Session Started (ID: {current_session_id}) ---")\r\n print("(Local history is empty, retrieval must come from Memory Bank)")\r\n continue\r\n\r\n print("Agent is thinking...")\r\n async for event in runner.run_async(\r\n user_id=user_id,\r\n session_id=current_session_id,\r\n new_message=types.Content(parts=[types.Part(text=user_input)])\r\n ):\r\n if event.content and event.content.parts:\r\n for part in event.content.parts:\r\n if part.text:\r\n print(f"Agent: {part.text}")\r\n \r\n if event.get_function_calls():\r\n for fc in event.get_function_calls():\r\n print(f"?️ Tool Call: {fc.name}")\r\n\r\nif __name__ == "__main__":\r\n asyncio.run(main())'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d2ee0>)])]>

Running the Test

First, ensure you have your Application Default Credentials set up:

code_block: <ListValue: [StructValue([('code', 'gcloud auth application-default login'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d27c0>)])]>

Then run the script:

code_block: <ListValue: [StructValue([('code', 'uv run test_local.py'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f72714d2be0>)])]>

Test Scenario

This scenario validates the full end-to-end lifecycle of the agent: from discovery and research to multimodal content creation and long-term memory retrieval.

Phase 1: Teaching & Multimodal Creation (Session 1)

Goal: Establish technical context and set a specific stylistic preference.

Discovery

Ask the agent to find trending Cloud Run topics.

Input: "Find high-engagement questions about AI agents on Cloud Run from the last 21 days."

Research

Instruct the agent to perform a deep dive on a specific result.

Input: "Use the GCP Expert to research topic #1."

Personalization

Request a blog post and explicitly set your style preference.

Input: "Draft a blog post based on this research. From now on, I want all my technical blogs written in the style of a 90s Rap Song."

Image generation

Ask the agent to generate an image that demonstrates the main ideas in the blog using the Nano Banana Pro tool. The image would be saved to your bucket in Google Cloud and you should get the path to see it which will look like this: https://storage.mtls.cloud.google.com/...

Phase 2: Long-Term Memory Recall (Session 2)

Goal: Verify the agent recalls preferences across a completely fresh session.

Type new in the console to wipe local session history and start a fresh state.
Retrieval) Inquire about your stored preferences to test the Vertex AI memory bank.

Input: "What are my current topics of interest and what is my preferred blogging style?"

Verification: Confirm the agent successfully retrieves your "AI Agents on Cloud Run" interest and "Rap" style from the cloud.

Final Test: Ask for a new blog on a different topic (e.g., "GKE Autopilot") and ensure it is automatically written as a rap song without being prompted.

Summary

In this part of our series we focused on verifying the agent's functionality in a local environment before proceeding to cloud deployment. By configuring local secrets and utilizing environment-aware utilities, we used a dedicated test runner to confirm that the core reasoning and tool logic are properly integrated. We successfully validated the full lifecycle: from Reddit discovery to expert content creation, confirming that the agent correctly retrieves preferences from the cloud-based Vertex AI memory bank even in completely fresh sessions.

Ready to run the test scenario yourself? Clone the repository and try the test_local.py script to see 'Dev Signal' retrieve your preferences from the Vertex AI memory bank in real-time. For a deeper dive into the underlying mechanics of memory orchestration, check out this quickstart guide.

In the final part of this series, we will transition our prototype into production service on Google Cloud Run using Terraform for secure infrastructure and explore the roadmap to production excellence through continuous evaluation and security

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, Follow me on Linkedin and X.

Experimenting with GPUs: GKE managed DRANET and Inference Gateway AI Deployment

Wed, 08 Apr 2026 10:05:00 +0000

Building and serving models on infrastructure is a strong use case for businesses. In Google Cloud, you have the ability to design your AI infrastructure to suit your workloads. Recently, I experimented with Google Kubernetes Engine (GKE) managed DRANET while deploying a model for inference with NVIDIA B200 GPUs on GKE. In this blog, we will explore this setup in easy to follow steps.

What is DRANET

Dynamic Resource Allocation (DRA) is a feature that lets you request and share resources among Pods. DRANET allows you to request and allocate networking resources for your Pods, including network interfaces that support TPUs & Remote Direct Memory Access (RDMA). In my case, the use of high-end GPUs.

How GPU RDMA VPC works

The RDMA network is set up as an isolated VPC, which is regional and assigned a network profile type. In this case, the network profile type is RoCEv2. This VPC is dedicated for GPU-to-GPU communication. The GPU VM families have RDMA capable NICs that connect to the RDMA VPC. The GPUs communicate between multiple nodes via this low latency, high speed rail aligned setup.

Design pattern example

Our aim was to deploy a LLM model (Deepseek) onto a GKE cluster with A4 nodes that support 8 B200 GPUs and serve it via GKE Inference gateway privately. To set up an AI Hypercomputer GKE cluster, you can use the Cluster Toolkit, but in my case, I wanted to test the GKE managed DRANET dynamic setup of the networking that supports RDMA for the GPU communication.

This design utilizes the following services to provide an end-to-end solution:

VPC: Total of 3 VPC. One VPC manually created, two created automatically by GKE managed DRANET, one standard and one for RDMA.
GKE: To deploy the workload.
GKE Inference gateway: To expose the workload internally using a regional internal Application Load Balancers type gke-l7-rilb.
A4 VM’s: These support RoCEv2 with NVIDIA B200 GPU.

Putting it together

To get access to the A4 VM a future reservation was used. This is linked to a specific zone.

Begin: Set up the environment

Create a standard VPC, with firewall rules and subnet in the same zone as the reservation.
Create a proxy-only subnet this will be used with the Internal regional application load balancer attached to the GKE inference gateway

Next: Create a standard GKE cluster node and default node pool.

code_block: <ListValue: [StructValue([('code', 'gcloud container clusters create $CLUSTER_NAME \\\r\n --location=$ZONE \\\r\n --num-nodes=1 \\\r\n --machine-type=e2-standard-16 \\\r\n --network=${GVNIC_NETWORK_PREFIX}-main \\\r\n --subnetwork=${GVNIC_NETWORK_PREFIX}-sub \\\r\n --release-channel rapid \\\r\n --enable-dataplane-v2 \\\r\n --enable-ip-alias \\\r\n --addons=HttpLoadBalancing,RayOperator \\\r\n --gateway-api=standard \\\r\n --enable-ray-cluster-logging \\\r\n --enable-ray-cluster-monitoring \\\r\n --enable-managed-prometheus \\\r\n --enable-dataplane-v2-metrics \\\r\n --monitoring=SYSTEM'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5b0a0>)])]>

Once that is complete you can connect to your cluster:

code_block: <ListValue: [StructValue([('code', 'gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5ba60>)])]>

Create a GPU node pool (this example uses, A4 VM with reservation) and additionals flags:

---accelerator-network-profile=auto (GKE automatically adds the gke.networks.io/accelerator-network-profile: auto label to the nodes)

--node-labels=cloud.google.com/gke-networking-dra-driver=true (Enables DRA for high-performance networking)

code_block: <ListValue: [StructValue([('code', 'gcloud beta container node-pools create $NODE_POOL_NAME \\\r\n --cluster $CLUSTER_NAME \\\r\n --location $ZONE \\\r\n --node-locations $ZONE \\\r\n --machine-type a4-highgpu-8g \\\r\n --accelerator type=nvidia-b200,count=8,gpu-driver-version=latest \\\r\n --enable-autoscaling --num-nodes=1 --total-min-nodes=1 --total-max-nodes=3 \\\r\n --reservation-affinity=specific \\\r\n--reservation=projects/$PROJECT/reservations/$RESERVATION_NAME/reservationBlocks/$BLOCK_NAME \\\r\n --accelerator-network-profile=auto \\\r\n--node-labels=cloud.google.com/gke-networking-dra-driver=true'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5bc40>)])]>

Next: Create a ResourceClaimTemplate, which will be used to attach the networking resources to your deployments. The deviceClassName: mrdma.google.com is used for GPU workloads:

code_block: <ListValue: [StructValue([('code', 'apiVersion: resource.k8s.io/v1\r\nkind: ResourceClaimTemplate\r\nmetadata:\r\n name: all-mrdma\r\nspec:\r\n spec:\r\n devices:\r\n requests:\r\n - name: req-mrdma\r\n exactly:\r\n deviceClassName: mrdma.google.com\r\n allocationMode: All'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5b490>)])]>

Deploy model and inference

Now that a cluster and node pool is setup, we can deploy a model and serve it via Inference gateway. In my experiment I used DeepSeek but this could be any model.

Deploy model and services

The nodeSelector: gke.networks.io/accelerator-network-profile: auto is used to assign to the GPU node
The resourceClaims: attaches the resource we defined for networking

Create a secret (I used Hugging Face token):

code_block: <ListValue: [StructValue([('code', 'kubectl create secret generic hf-secret \\\r\n --from-literal=hf_token=${HF_TOKEN}'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5b160>)])]>

Deployment

code_block: <ListValue: [StructValue([('code', 'apiVersion: apps/v1\r\nkind: Deployment\r\nmetadata:\r\n name: deepseek-v3-1-deploy\r\nspec:\r\n replicas: 1\r\n selector:\r\n matchLabels:\r\n app: deepseek-v3-1\r\n template:\r\n metadata:\r\n labels:\r\n app: deepseek-v3-1\r\n ai.gke.io/model: deepseek-v3-1\r\n ai.gke.io/inference-server: vllm\r\n examples.ai.gke.io/source: user-guide\r\n spec:\r\n containers:\r\n - name: vllm-inference\r\n image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250819_0916_RC01\r\n resources:\r\n requests:\r\n cpu: "190"\r\n memory: "1800Gi"\r\n ephemeral-storage: "1Ti"\r\n nvidia.com/gpu: "8"\r\n limits:\r\n cpu: "190"\r\n memory: "1800Gi"\r\n ephemeral-storage: "1Ti"\r\n nvidia.com/gpu: "8"\r\n claims:\r\n - name: rdma-claim\r\n command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]\r\n args:\r\n - --model=$(MODEL_ID)\r\n - --tensor-parallel-size=8\r\n - --host=0.0.0.0\r\n - --port=8000\r\n - --max-model-len=32768\r\n - --max-num-seqs=32\r\n - --gpu-memory-utilization=0.90\r\n - --enable-chunked-prefill\r\n - --enforce-eager\r\n - --trust-remote-code\r\n env:\r\n - name: MODEL_ID\r\n value: deepseek-ai/DeepSeek-V3.1\r\n - name: HUGGING_FACE_HUB_TOKEN\r\n valueFrom:\r\n secretKeyRef:\r\n name: hf-secret\r\n key: hf_token\r\n volumeMounts:\r\n - mountPath: /dev/shm\r\n name: dshm\r\n livenessProbe:\r\n httpGet:\r\n path: /health\r\n port: 8000\r\n initialDelaySeconds: 1800\r\n periodSeconds: 10\r\n readinessProbe:\r\n httpGet:\r\n path: /health\r\n port: 8000\r\n initialDelaySeconds: 1800\r\n periodSeconds: 5\r\n volumes:\r\n - name: dshm\r\n emptyDir:\r\n medium: Memory\r\n nodeSelector:\r\n gke.networks.io/accelerator-network-profile: auto\r\n resourceClaims:\r\n - name: rdma-claim\r\n resourceClaimTemplateName: all-mrdma\r\n---\r\napiVersion: v1\r\nkind: Service\r\nmetadata:\r\n name: deepseek-v3-1-service\r\nspec:\r\n selector:\r\n app: deepseek-v3-1\r\n type: ClusterIP\r\n ports:\r\n - protocol: TCP\r\n port: 8000\r\n targetPort: 8000\r\n---\r\napiVersion: monitoring.googleapis.com/v1\r\nkind: PodMonitoring\r\nmetadata:\r\n name: deepseek-v3-1-monitoring\r\nspec:\r\n selector:\r\n matchLabels:\r\n app: deepseek-v3-1\r\n endpoints:\r\n - port: 8000\r\n path: /metrics\r\n interval: 30s'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5b5b0>)])]>

Deploy GKE Inference Gateway

This install needed Custom Resource Definitions (CRDs) in your GKE cluster:

For GKE versions 1.34.0-gke.1626000 or later, install only the alpha InferenceObjective CRD:

code_block: <ListValue: [StructValue([('code', 'kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/crd/bases/inference.networking.x-k8s.io_inferenceobjectives.yaml'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5b1c0>)])]>

Create Inference pool

code_block: <ListValue: [StructValue([('code', 'helm install deepseek-v3-pool \\\r\n oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool \\\r\n --version v1.0.1 \\\r\n --set inferencePool.modelServers.matchLabels.app=deepseek-v3-1 \\\r\n --set provider.name=gke \\\r\n --set inferenceExtension.monitoring.gke.enabled=true'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7273d5b8e0>)])]>

Create the Gateway, HTTPRoute and InferenceObjective

code_block: <ListValue: [StructValue([('code', '# 1. The Regional Internal Gateway (ILB)\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: Gateway\r\nmetadata:\r\n name: deepseek-v3-gateway\r\n namespace: default\r\nspec:\r\n gatewayClassName: gke-l7-rilb\r\n listeners:\r\n - name: http\r\n protocol: HTTP\r\n port: 80\r\n allowedRoutes:\r\n namespaces:\r\n from: Same\r\n---\r\n# 2. The HTTPRoute (Routing to the Pool)\r\napiVersion: gateway.networking.k8s.io/v1\r\nkind: HTTPRoute\r\nmetadata:\r\n name: deepseek-v3-route\r\n namespace: default\r\nspec:\r\n parentRefs:\r\n - name: deepseek-v3-gateway\r\n rules:\r\n - matches:\r\n - path:\r\n type: PathPrefix\r\n value: /\r\n backendRefs:\r\n - group: inference.networking.k8s.io\r\n kind: InferencePool\r\n name: deepseek-v3-pool\r\n---\r\n# 3. The Inference Objective (Performance Logic)\r\napiVersion: inference.networking.x-k8s.io/v1alpha2\r\nkind: InferenceObjective\r\nmetadata:\r\n name: deepseek-v3-objective\r\n namespace: default\r\nspec:\r\n poolRef:\r\n name: deepseek-v3-pool'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7255317b20>)])]>

Once complete, you can create a test VM in your main VPC and make a call to the IP address of the GKE Inference Gateway:

code_block: <ListValue: [StructValue([('code', 'curl -N -s -X POST "http://$GATEWAY_IP/v1/chat/completions" \\\r\n -H "Content-Type: application/json" \\\r\n -d \'{\r\n "model": "deepseek-ai/DeepSeek-V3.1",\r\n "messages": [{"role": "user", "content": "Box A: red. Box B: blue. Box C: empty. Move A to C, Move B to A, Swap B and C. Where is red?"}],\r\n "stream": true\r\n }\' | stdbuf -oL grep "data: " | sed -u \'s/^data: //\' | grep -v "\\[DONE\\]" | \\\r\n jq --unbuffered -rj \'.choices[0].delta | (.reasoning_content // .reasoning // .content // empty)\''), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7255317040>)])]>

Next Steps

Take a deeper dive into GKE managed DRANET and GKE Inference Gateway, review the following.

Blog: DRA: A new era of Kubernetes device management with Dynamic Resource Allocation
Document set: DRANET
Documentation: AI Hypercomputer

Want to ask a question, find out more or share a thought? Please connect with me on Linkedin.

See beyond the IP and secure URLs with Google Cloud NGFW

Tue, 07 Apr 2026 17:30:00 +0000

In a cloud-first world, traditional IP-based defenses are no longer enough to protect your perimeter. As services migrate to shared infrastructure and content delivery networks, relying on static IP addresses and FQDNs can create security gaps.

Because single IP addresses can host multiple services, and IPs addresses can change frequently, we are introducing domain filtering with a wildcard capability in Cloud Next Generation Firewall (NGFW) Enterprise. This new capability provides increased security and granular policy controls.

Why domain and SNI filtering matters

The Cloud NGFW URL filtering service performs deep inspections of HTTP payloads to secure workloads against threats from both public and internal networks. This service elevates security controls to the application layer and helps restrict access to malicious domains.

Key use cases include:

Granular egress control: This capability enables the precise allowing and blocking of connections based on domain names and SNI information found in egress HTTP(S) messages. By inspecting Layer 7 (L7) headers, it offers significantly finer control than traditional filtering based solely on IP addresses and FQDNs, which can be inefficient when a single IP hosts multiple services.
Control access without decrypting: For organizations that prefer not to perform full TLS decryption on their traffic, Cloud NGFW can still enforce security policies by controlling traffic based on SNI headers provided during the TLS handshake. This allows for effective domain-level filtering while maintaining end-to-end encryption for privacy or compliance reasons.
Reduced operational overhead: Implementing domain-based filtering helps reduce the constant maintenance typically required to track frequently changing IP addresses and DNS records. By focusing on stable domain identities rather than dynamic network attributes, security teams can minimize the manual effort involved in updating firewall rulebases.
Flexible matching: The service utilizes matcher strings within URL lists, supporting limited wildcard domains to define criteria for both domains and subdomains. For example, using a wildcard like *.example.com allows a single filter to cover all associated subdomains, providing a more scalable solution than defining thousands of individual FQDN entries.
Improved security: URL filtering significantly enhances the security posture by protecting against sophisticated flaws like SNI header spoofing. By evaluating L7 headers before allowing access to an application, Cloud NGFW ensures that attackers cannot bypass security controls by simply spoofing lower-layer identifiers.

How Cloud NGFW URL filtering works

The URL filtering service functions by inspecting traffic at L7 using a distributed architecture.

Cloud NGFW URL filtering service

You can get started with URL filtering in three simple steps.

Deploy Cloud NGFW endpoints:

The first step is to create and deploy a Cloud NGFW endpoint in a zone. The NGFW endpoint is an organization level resource. Please ensure you have the right permission before deploying the endpoint.
Once the endpoint is deployed you can associate it to one or more VPCs of your choice.

Create security profiles and security profile groups:

The URL filtering security profile holds the URL filters with matcher strings and an action (allow or deny).
The security profile group acts as a container for these security profiles, which is then referenced by a firewall policy rule. Create URL filtering security profiles with desired URLs, wildcard FQDNs and add them to a security profile group.
Once the security profile group is created, you will need to reference the security profile group in firewall policies.

Policy enforcement:

You enable the service by configuring a hierarchical or global network firewall policy rule using the apply_security_profile_group action, specifying the name of your security profile group.

For more information about configuring a firewall policy rule, see the following:

Getting started

Get started with Cloud NGFW URL filtering by visiting our documentation and codelab.

Envoy: A future-ready foundation for agentic AI networking

Fri, 03 Apr 2026 16:00:00 +0000

In today's agentic AI environments, the network has a new set of responsibilities.

In a traditional application stack, the network mainly moves requests between services. But as discussed in a recent white paper, Cloud Infrastructure in the Agent-Native Era, in an agentic system the network sits in the middle of model calls, tool invocations, agent-to-agent interactions, and policy decisions that can shape what an agent is allowed to do. The rapid proliferation of agents, often built on diverse frameworks, necessitates a consistent enforcement of governance and security across all agentic paths at scale. To achieve this, the enforcement layer must shift from the application level to the underlying infrastructure. That means the network can no longer operate as a blind transport layer. It has to understand more, enforce better, and adapt faster. This shift is precisely where Envoy comes in.

As a high-performance distributed proxy and universal data plane, Envoy is built for massive scale. Trusted by demanding enterprise environments, including Google Cloud, it supports everything from single-service deployments to complex service meshes using Ingress, Egress, and Sidecar patterns. Because of its deep extensibility, robust policy integration, and operational maturity, Envoy is uniquely suited for an era where protocols change quickly and the cost of weak control is steep. For teams building agentic AI, Envoy is more than a concept: it's a practical, production-ready foundation.

Agentic AI changes the networking problem

Agentic workloads still often use HTTP as a transport, but they break some of the assumptions that traditional HTTP intermediaries rely on. Protocols such as Model Context Protocol (MCP) and Agent2agent (A2A) use JSON-RPC or gRPC over HTTP, adding protocol-level phases such as MCP initialization, where client and server exchange their capabilities, on top of standard HTTP request/response semantics. The key aspects of agentic systems that require intermediaries to adapt include:

Diverse enterprise governance imperatives. The primary challenge is satisfying the wide spectrum of non-negotiable enterprise requirements for safety, security, data privacy, and regulatory compliance. These needs often go beyond standard network policies and require deep integration with internal systems, custom logic, and the ability to rapidly adapt to new organizational rules or external regulations. This demands a highly extensible framework where enterprises can plug in their specific governance models.
Policy attributes live inside message bodies, not headers. Unlike traditional web traffic where policy inputs like paths and headers are readily accessible, agentic protocols frequently bury critical attributes (e.g., model names, tool calls, resource IDs) deep within JSON-RPC or gRPC payloads. This shift requires intermediaries to possess the ability to parse and understand message contents to apply context-aware policies.
Handling diverse and evolving protocol characteristics. Agentic protocols are not uniform. Some, like MCP with Streamable HTTP, can introduce stateful interactions requiring session management across distributed proxies (e.g., using Mcp-Session-Id). The need to support such varied behaviors, along with future protocol innovations, reinforces the necessity of an inherently adaptable and extensible networking foundation.

These factors mean enterprises need more than just connectivity. The network must now serve as a central point for enforcing the crucial governance needs mentioned earlier. This includes providing capabilities like centralized security, comprehensive auditability, fine-grained policy enforcement, and dynamic guardrails, all while keeping pace with the rapid evolution of protocols and agent behaviors. Put simply, agentic AI transforms the network from a mere transit path into a critical control point.

Why Envoy fits this shift

Envoy is a strong fit for agentic AI networking for three reasons. Envoy is:

Battle-tested. Enterprises already rely on Envoy in high-scale, security-sensitive environments, making it a credible platform to anchor a new generation of traffic management and policy enforcement.
Extensible. Envoy can be extended through native filters, Rust modules, WebAssembly (Wasm) modules, and external processing patterns. That gives platform teams room to adopt new protocols without having to rebuild their networking layer every time the ecosystem changes.
Operationally useful today. Envoy already acts as a gateway, enforcement point, observability layer, and integration surface for control planes. That makes it a practical choice for organizations that need to move now, not after the standards settle.

Building on these core strengths, Envoy has introduced specific architectural advancements to meet the unique demands of agentic networking:

1. Envoy understands agent traffic

The first requirement for agentic networking is simple: The gateway needs to understand what the agent is actually trying to do.

That’s harder than it sounds. In protocols such as MCP, A2A, and OpenAI-style APIs, important policy signals may live inside the request body. Traditional HTTP proxies are optimized to treat bodies as opaque byte streams. That design is efficient, but it limits what the proxy can enforce. For protocols that use JSON messages, a proxy may need to buffer the entire request body to locate attribute values needed for policy application — especially when those attributes appear at the end of the JSON message. Business logic specific to gen AI protocols, such as rate limiting based on consumed tokens, may also require parsing server responses.

Envoy addresses this by deframing protocol messages carried over HTTP and exposing useful attributes to the rest of the filter chain. The extensibility model for gen AI protocols was guided by two goals:

Easy reuse of existing HTTP extensions that work with gen AI protocols out of the box, such as RBAC or tracers.
Easy access to deframed messages for gen-AI-specific extensions, so that developers can focus on gen AI business logic without needing to deal with HTTP or JSON envelopes.

Based on these goals, new extensions for gen AI protocols are still built as HTTP extensions and configured in the HTTP filter chain. This provides flexibility to mix HTTP-native business logic, such as OAuth or mTLS authorization, with gen AI protocol logic in a single chain. A deframing extension parses the protocol messages carried by HTTP and provides an ambient context with extracted attributes, or even the entirety of parsed messages, to downstream extensions via well-known filter state and metadata values.

Instead of forcing every policy component to parse JSON envelopes or protocol-specific message formats on its own, Envoy makes those attributes available as structured metadata. Once the gateway has deframed protocol messages, existing Envoy extensions such as ext_authz or RBAC can read protocol properties to evaluate policies using protocol-specific attributes such as tool names for MCP, message attributes for A2A, or model names for OpenAI.

Access logs can include message attributes for enhanced monitoring and auditing. The protocol attributes are also available to the Common Expression Language (CEL) runtime, simplifying creation of complex policy expressions in RBAC or composite extensions.

Buffering and memory management
Envoy is designed to use as little memory as possible when proxying HTTP requests. However, parsing agentic protocols may require an arbitrary amount of buffer space, especially when extensions require the entire message to be in memory. The flexibility of allowing extensions to use larger buffers needs to be balanced with adequate protection from memory exhaustion, especially in the presence of untrusted traffic.

To achieve this, Envoy now provides a per-request buffer size limit. Buffers that hold request data are also integrated with the overload manager, enabling a full range of protective actions under memory pressure, such as reducing idle timeouts or resetting requests that consume the most memory for an extended duration. These changes pave the way for Envoy to serve as a gateway and policy-enforcement point for gen AI protocols without compromising its resource efficiency.

2. Envoy enforces policy on things that matter

Understanding traffic is only useful if the gateway can act on it.

In agentic systems, policy is not just about which service an agent can reach. It’s about which tools an agent can call, which models it can use, what identity it presents, how much it can consume, and what kinds of outputs require additional controls. Those are higher-value decisions than simple layer-4 or path-based controls, and they are exactly the kinds of controls enterprises care about when agents are allowed to take action on their behalf.

Envoy is well-positioned here because it can combine transport-level security with application-aware policy enforcement. Teams can authenticate workloads with mTLS and SPIFFE identities, then enforce protocol-specific rules with RBAC, external authorization, external processing, access logging, and CEL-based policy expressions.

This capability is crucial because it lets platform teams decouple agent development from enforcement. Developers can focus on building useful agents, while operators enforce a consistent zero-trust posture at the network layer, even as tools, models, and protocols continue to change.A prime example of this zero-trust decoupling is the critical "user-behind-agent" scenario, where an AI agent must execute tasks on a human user's behalf. Traditionally, handing user credentials directly to an application introduces severe security risks — if the agent is compromised or manipulated via prompt injection, an attacker could exfiltrate or misuse those credentials. By offloading identity management to Envoy, the proxy can automatically insert user delegation tokens into outbound requests at the infrastructure layer. Because the agent never directly holds the sensitive credential, the risk of a compromised agent misusing or leaking the token is completely neutralized, ensuring actions remain strictly bound to the user's actual permissions.

Case study: Restricting an agent to specific GitHub MCP tools
Consider an agent that triages GitHub issues.

The GitHub MCP server may expose dozens of tools, but the agent may only need a small read-only subset, such as list_issues, get_issue, and get_issue_comments. In most enterprises, that difference matters. A useful agent should not automatically become an unrestricted one.

With Envoy in front of the MCP server, the gateway can verify the agent identity using SPIFFE during the mTLS handshake, parse the MCP message via the deframing filter, extract the requested method and tool name, and enforce a policy that allows only the approved tool calls for that specific agent identity. RBAC uses metadata created by the MCP deframing filter to check the method and tool name in the MCP message:

code_block: <ListValue: [StructValue([('code', 'envoy.filters.http.rbac:\r\n "@type": type.googleapis.com/envoy.extensions.filters.http.rbac.v3.RBACPerRoute\r\n rbac:\r\n rules:\r\n policies:\r\n github-issue-reader-policy:\r\n permissions:\r\n - and_rules:\r\n rules:\r\n - sourced_metadata:\r\n metadata_matcher:\r\n filter: envoy.http.filters.mcp\r\n path: [{ key: "method" }]\r\n value: { string_match: { exact: "tools/call" } }\r\n - sourced_metadata:\r\n metadata_matcher:\r\n filter: envoy.http.filters.mcp\r\n path: [{ key: "params" }, { key: "name" }]\r\n value:\r\n or_match:\r\n value_matchers:\r\n - string_match: { exact: "list_issues" }\r\n - string_match: { exact: "get_issue" }\r\n - string_match: { exact: "get_issue_comments" }\r\n principals:\r\n - authenticated:\r\n principal_name:\r\n exact: "spiffe://cluster.local/ns/github-agents/sa/issue-triage-agent"'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f727146b310>)])]>

That’s the real value: Policy is enforced centrally, close to the traffic, and in terms that match the agent's actual behavior.

Beyond static rules: External authorization
A complex compliance policy that can’t be expressed using RBAC rules can be implemented in an external authorization service using the ext_authz protocol. Envoy provides MCP message attributes along with HTTP headers in the context of the ext_authz RPC. It can also forward the agent's SPIFFE identity from the peer certificate:

code_block: <ListValue: [StructValue([('code', 'http_filters:\r\n - name: envoy.filters.http.ext_authz\r\n typed_config:\r\n "@type": type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.ExtAuthz\r\n grpc_service:\r\n envoy_grpc:\r\n cluster_name: auth_service_cluster\r\n include_peer_certificate: true\r\n metadata_context_namespaces:\r\n - envoy.http.filters.mcp'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f7255317880>)])]>

This allows external services to make authorization decisions based on the full combination of agent identity, MCP method, tool name, and any other protocol attributes, without the agent or the MCP server needing to be aware of the policy layer.

Protocol-native error responses
When Envoy denies a request, the error should be meaningful to the calling agent. For MCP traffic, Envoy can use local_reply_config to map HTTP error codes to appropriate JSON-RPC error responses. For example, a 403 Forbidden can be mapped to a JSON-RPC response with isError: true and a human-readable message, ensuring the agent receives a protocol-appropriate denial rather than an opaque HTTP status code.

3. Envoy supports stateful agent interactions at scale

Not all agent traffic is stateless. Some protocols, including Streamable HTTP for MCP, can rely on session-oriented behavior. That creates a new challenge for intermediaries, especially when traffic flows through multiple gateway instances to achieve scale and resilience. An MCP session effectively binds the agent to the server that established it, and all intermediaries need to know this to direct incoming MCP connections to the correct server.

If a session is established on one backend, later requests in that conversation need to reach the right destination. That sounds straightforward for a single-proxy deployment, but it becomes more complicated in horizontally scaled systems, where multiple Envoy instances may handle different requests from the same agent.

Passthrough gateway
In the simpler passthrough mode, Envoy establishes one upstream connection for each downstream connection. Its primary use is enforcing centralized policies, such as client authorization, RBAC, rate limiting, and authentication, for external MCP servers. The session state transferred between intermediaries needs to include only the address of the server that established the session over the initial HTTP connection, so that all session-related requests are directed to that server.

Session state transfer between different Envoy instances is achieved by appending encoded session state to the MCP session ID provided by the MCP server. Envoy removes the session-state suffix from the session ID before forwarding the request to the destination MCP server. This session stickiness is enabled by configuring Envoy's envoy.http.stateful_session.envelope extension.

Aggregating gateway
In aggregating mode, Envoy acts as a single MCP server by aggregating the capabilities, tools, and resources of multiple backend MCP servers. In addition to enforcing policies, this simplifies agent configuration and unifies policy application for multiple MCP servers.

Session management in this mode is more complicated because the session state also needs to include mapping from tools and resources to the server addresses and session IDs that advertised them. The session ID that Envoy provides to the agent is created before tools or resources are known, and the mapping has to be established later, after the MCP initialization phases between Envoy and the backend MCP servers are complete.

One approach, currently implemented in Envoy, is to combine the name of a tool or resource with the identifier and session ID of its origin server. The exact tool or resource names are typically not meaningful to the agent and can carry this additional provenance information. If unmodified tool or resource names are desirable, another approach is to use an Envoy instance that does not have the mapping, and then recreate it by issuing a tools/list command before calling a specific tool. This trades latency for the complexity of deploying an external global store of MCP sessions, and is currently in planning based on user feedback.

This matters because it moves Envoy beyond simple traffic forwarding. It allows Envoy to serve as a reliable intermediary for real agent workflows, including those spanning multiple requests, tools, and backends.

4. Envoy supports agent discovery

Envoy is adding support for the A2A protocol and agent discovery via a well-known AgentCard endpoint. AgentCard, a JSON document with agent capabilities, enables discovery and multi-agent coordination by advertising skills, authentication requirements, and service endpoints. The AgentCard can be provisioned statically via direct response configuration or obtained from a centralized agent registry server via xDS or ext_proc APIs. A more detailed description of A2A implementation and agent discovery will be published in a forthcoming blog post.

5. Envoy is a complete solution for agentic networking challenges

Building on the same foundation that enabled policy application for MCP protocol in demanding deployments, Envoy is adding support for OpenAI and transcoding of agentic protocols into RESTful HTTP APIs. This transcoding capability simplifies the integration of gen AI agents with existing RESTful applications, with out-of-the-box support for OpenAPI-based applications and custom options via dynamic modules or Wasm extensions. In addition to transcoding, Envoy is being strengthened in critical areas for production readiness, such as advanced policy applications like quota management, comprehensive telemetry adhering to OpenTelemetry semantic conventions for generative AI systems, and integrated guardrails for secure agent operation.

Guardrails for safe agents
The next significant area of investment is centralized management and application of guardrails for all agentic traffic. Integrating policy enforcement points with external guardrails presently requires bespoke implementation and this problem area is ripe for standardization.

Control planes make this operational

The gateway is only part of the story. To achieve this policy management and rollout at scale, a separate control plane is required to dynamically configure the data plane using the xDS protocol, also known as the universal data plane API.

That is where control planes become important. Cloud Service Mesh, alongside open-source projects such as Envoy AI Gateway and kube-agentic-networking, uses Envoy as the data plane while giving operators higher-level ways to define and manage policy for agentic workloads.

This combination is powerful: Envoy provides the enforcement and extensibility in the traffic path, while control planes provide the operating model teams need to deploy that capability consistently.

Why this matters now

The shift towards agentic systems and gen AI protocols such as MCP, A2A, and OpenAI necessitates an evolution in network intermediaries. The primary complexities Envoy addresses include:

Deep protocol inspection. Protocol deframing extensions extract policy-relevant attributes (tool names, model names, resource paths) from the body of HTTP requests, enabling precise policy enforcement where traditional proxies would only see an opaque byte stream.
Fine-grained policy enforcement. By exposing these internal attributes, existing Envoy extensions like RBAC and ext_authz can evaluate policies based on protocol-specific criteria. This allows network operators to enforce a unified, zero-trust security posture, ensuring agents comply with access policies for specific tools or resources.
Stateful transport management. Envoy supports managing session state for the Streamable HTTP transport used by MCP, enabling robust deployments in both passthrough and aggregating gateway modes, even across a fleet of intermediaries.

Agentic AI protocols are still in their early stages, and the protocol landscape will continue to evolve. That’s exactly why the networking layer needs to be adaptable. Enterprises should not have to rebuild their security and traffic infrastructure every time a new agent framework, transport pattern, or tool protocol gains traction. They need a foundation that can absorb change without sacrificing control.

Envoy brings together three qualities that are hard to get in one place: proven production maturity, deep extensibility, and growing protocol awareness for agentic workloads. By leveraging Envoy as an agent gateway, organizations can decouple security and policy enforcement from agent development code.

That makes Envoy more than just a proxy that happens to handle AI traffic. It makes Envoy a future-ready foundation for agentic AI networking.

^{Special thanks to the additional co-authors of this blog: Boteng Yao, Software Engineer, Google and Tianyu Xia, Software Engineer, Google and Sisira Narayana, Sr Product Manager, Google.}

Activating Your Data Layer for Production-Ready AI

Thu, 02 Apr 2026 13:18:00 +0000

When discussing applications and systems using generative AI and the new opportunities they present, one component of the ecosystem is irreplaceable - data. Specifically, the data that companies gather, hold, and use daily. This data serves as the backbone for applications, analytics, knowledge bases, and much more. We use databases to store and work with this data, and most, if not all, AI-driven initiatives and new applications are going to use that data layer.

But how can we start to use the data in our AI systems? Let me introduce you to some of the labs showing how to prepare and use the data with AI models in Google databases.

Semantic Search: Text Embeddings in Database

Our journey starts by preparing our data for semantic search and running first tests to augment the Gen AI model's response by grounding it with your semantic search results. The grounding data is the basis for RAG (Retrieval Augmented Generation). Then, you can improve the performance of your search by indexing your embeddings using the latest indexing techniques.

One of the options is the Google AlloyDB database, which has direct integration with AI models and supports the most demanding workloads. The following lab guides us through all the steps, starting from creating an AlloyDB cluster, loading sample data, and generating embeddings, to using those embeddings to generate an augmented response from the Gen AI model.

aside_block: <ListValue: [StructValue([('title', 'Go to the lab!'), ('body', <wagtail.rich_text.RichText object at 0x7f7271d717f0>), ('btn_text', ''), ('href', ''), ('image', None)])]>

AI integration is not limited to AlloyDB. All Google Cloud databases have AI integration and are capable of generating and using embeddings for semantic search. For example, if you are using Cloud SQL, you can also generate and use embeddings for semantic search directly within your existing PostgreSQL or MySQL instances.

The next two labs are very similar to the previous one, but instead of Google AlloyDB for PostgreSQL, we are using Cloud SQL for PostgreSQL and Cloud SQL for MySQL to use semantic search as the grounding engine for the model's response. Some steps are of course different due to variations in SQL language and different database engines, but the main idea stays the same: use our data to ground the model response and improve output.

aside_block: <ListValue: [StructValue([('title', 'Go to the labs!'), ('body', <wagtail.rich_text.RichText object at 0x7f7271d71b50>), ('btn_text', ''), ('href', ''), ('image', None)])]>

Semantic search using text data is one of the cornerstones and important features making responses much more reliable and useful, but Google Gen AI models can offer much more. Let's talk about multimodal search.

Multimodal Embeddings: Bring Images to the Search

In real life, of course, we use all our senses, including vision, to evaluate the world around us. The Google multimodal embedding models bring an additional layer of understanding, improving search by using embeddings not only for text but also for images.

In the following lab, we use a catalog of products placed in AlloyDB and supplemented by images in Google Cloud Storage. In the lab, we show how we can use both text descriptions and images for semantic search, supplementing and replacing each other, naturally incorporating search based on image input into our response.

aside_block: <ListValue: [StructValue([('title', 'Go to the lab!'), ('body', <wagtail.rich_text.RichText object at 0x7f7271d71760>), ('btn_text', ''), ('href', ''), ('image', None)])]>

Preparing the data and making first steps are important for a general understanding of RAG and tools available for the search, but Google has other cases when direct AI integration can help with your data analysis without any data preparations.

AlloyDB AI Functions and Reranking

Google AlloyDB database comes with additional AI integrations that help you use some AI capabilities without data preparation. For example, the AI.IF function can perform semantic search on the fly, evaluating sentiment or comparing data in columns with a natural language query, returning results filtered by the query condition. Also, you can apply a ranking function to the search output, improving the final result. You can try some of the new functionality using the following lab and let us know if it can help in your use case.

aside_block: <ListValue: [StructValue([('title', 'Go to the lab!'), ('body', <wagtail.rich_text.RichText object at 0x7f7271d717c0>), ('btn_text', ''), ('href', ''), ('image', None)])]>

But what if somebody is not particularly savvy with SQL or not familiar with the data structure in your database? The AlloyDB NL2SQL can help you with that.

Generate SQL using AlloyDB AI Natural Language

The "alloydb_ai_nl" AlloyDB extension allows you not only to generate SQL queries based on default metadata available out-of-the-box but to build either automatic or custom context, helping to make the best of the query generation.

The NL2SQL functions can add a layer describing your data structure, relations between tables, and metadata based on real data samples from your tables without compromising the data itself, providing necessary information helping the AI model to understand how to build the best query. The following lab helps you to start with the new features and generate your first queries based on your data schema.

aside_block: <ListValue: [StructValue([('title', 'Go to the lab!'), ('body', <wagtail.rich_text.RichText object at 0x7f7271d71250>), ('btn_text', ''), ('href', ''), ('image', None)])]>

From Tests to Production

Those labs are part of the From Data Foundations to Advanced RAG module of our Production-Ready AI with Google Cloud program. Check the other modules and see if they can help you to adopt the AI capabilities provided by our Google Cloud services and tools. The end game goal is a high quality application using the full potential of modern technologies.

And stay tuned on release notes for ALloyDB and Cloud SQL - the engineering team is busy working on new features and improvements. Happy testing.

Create Expert Content: Architect A Personalized Multi-Agent System with Long-Term Memory

Tue, 31 Mar 2026 09:31:00 +0000

In support of our mission to accelerate the developer journey on Google Cloud, we built Dev Signal—a multi-agent system designed to transform raw community signals into reliable technical guidance by automating the path from discovery to expert creation.

In the first part of this series for the Dev Signal, we laid the essential groundwork for this system by establishing a project environment and equipping core capabilities through the Model Context Protocol (MCP). We standardized our external integrations, connecting to Reddit for trend discovery, Google Cloud Docs for technical grounding, and building a custom Nano Banana Pro MCP server for multimodal image generation. If you missed Part 1 or want to explore the code directly, you can find the complete project implementation in our GitHub repository.

Now, in Part 2, we focus on building the multi-agent architecture and integrating the Vertex AI memory bank to personalize these capabilities. We will implement a Root Orchestrator that manages three specialist agents: the Reddit Scanner, GCP Expert, and Blog Drafter, to provide a seamless flow from trend discovery to expert content creation. We will also integrate a long-term memory layer that enables the agent to learn from your feedback and persist your stylistic preferences across different conversations. This ensures that Dev Signal doesn't just process data, but actually learns to match your professional voice over time.

Infrastructure and Model Setup

First, we initialize the environment and the shared Gemini model.

Paste this code in dev_signal_agent/agent.py

code_block: <ListValue: [StructValue([('code', 'from google.adk.agents import Agent\r\nfrom google.adk.apps import App\r\nfrom google.adk.models import Gemini\r\nfrom google.adk.tools import google_search, AgentTool, load_memory_tool, preload_memory_tool\r\nfrom google.adk.tools.tool_context import ToolContext\r\nfrom google.genai import types\r\nfrom dev_signal_agent.app_utils.env import init_environment\r\nfrom dev_signal_agent.tools.mcp_config import (\r\n get_reddit_mcp_toolset, \r\n get_dk_mcp_toolset, \r\n get_nano_banana_mcp_toolset\r\n)\r\n\r\nPROJECT_ID, MODEL_LOC, SERVICE_LOC, SECRETS = init_environment()\r\n\r\n\r\nshared_model = Gemini(\r\n model="gemini-3-flash-preview", \r\n vertexai=True, \r\n project=PROJECT_ID, \r\n location=MODEL_LOC,\r\n retry_options=types.HttpRetryOptions(attempts=3),\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7239a018b0>)])]>

Memory Ingestion Logic

We want Dev Signal to do more than just follow instructions - we want it to learn from you. By capturing your preferences, such as specific technical interests on Reddit or a preferred blogging style, the agent can personalize its output for future use. To achieve this, we use the Vertex AI memory bank to persist session history across different conversations.

Long-term Memory

We automate this through the save_session_to_memory_callback function. This callback is configured to run automatically after every turn, ensuring that session details are captured and stored in the memory bank without manual intervention.

How Managed Memory Works:

Ingestion: The save_session_to_memory_callback sends the conversation data to Vertex AI.
Embedding: Vertex AI converts the text into numerical vectors (embeddings) that capture the semantic meaning of your preferences.
Storage: These vectors are stored in a managed index, enabling the agent to perform semantic searches and retrieve relevant history in future sessions.
Retrieval: The agent recalls this history using built-in ADK tools. The PreloadMemoryTool proactively brings in context at the start of an interaction, while the LoadMemoryTool allows the agent to fetch specific memories on an as-needed basis.

Paste this code in dev_signal_agent/agent.py

code_block: <ListValue: [StructValue([('code', 'async def save_session_to_memory_callback(*args, **kwargs) -> None:\r\n """\r\n Defensive callback to persist session history to the Vertex AI memory bank.\r\n """\r\n ctx = kwargs.get("callback_context") or (args[0] if args else None)\r\n \r\n # Check connection to Memory Service\r\n if ctx and hasattr(ctx, "_invocation_context") and ctx._invocation_context.memory_service:\r\n # Save the session!\r\n await ctx._invocation_context.memory_service.add_session_to_memory(\r\n ctx._invocation_context.session\r\n )'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7239a01220>)])]>

Short-term Memory

The add_info_to_state function serves as the agent's short-term working memory, allowing the gcp_expert to reliably hand off its detailed findings to the blog_drafter within the same session. This working memory and the conversation transcript are managed by the Vertex AI Session Service to ensure that active context survives server restarts or transient failures.

The boundary between session-based state and long-term persistence - It is important to note that while this service provides stability during an active interaction, this short-term memory does not persist between different sessions. Starting a fresh session ID effectively resets this working state, ensuring a clean slate for new tasks. Cross-session continuity, where the agent remembers your stylistic preferences or past feedback, is handled by the Vertex AI Memory Bank.

Paste this code in dev_signal_agent/agent.py

code_block: <ListValue: [StructValue([('code', 'def add_info_to_state(tool_context: ToolContext, key: str, data: str) -> dict:\r\n tool_context.state[key] = data\r\n return {"status": "success", "message": f"Saved \'{key}\' to state."}'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7239a019d0>)])]>

Specialist 1: Reddit Scanner (Discovery)

The Reddit scanner is our “Trend Spotter," it identifies high-engagement questions from the last 21 days (3 weeks) to ensure that all research findings remain both timely and relevant.

Memory Usage: It leverages load_memory to retrieve your past areas of interest and preferred topics from the Vertex AI memory bank If relevant history exists, the agent prioritizes those specific topics in its search to provide a personalized discovery experience.

Beyond simple retrieval, each sub-agent actively updates its memories by listening for new preferences and explicitly acknowledging them during the chat. This process captures relevant information in the session history, where an automated callback then persists it to the long-term Vertex AI memory bank for future use.

This memory management is supported by two distinct retrieval patterns within the Google Agent Development Kit (ADK). The first is the PreloadMemoryTool, which proactively brings in historical context at the beginning of every interaction to ensure the agent is fully briefed before addressing the current request. The second is the LoadMemoryTool, which the agent uses on an as-needed basis, calling upon it only when it decides that deeper past knowledge would be beneficial for the current step in the workflow.

Paste this code in dev_signal_agent/agent.py

code_block: <ListValue: [StructValue([('code', '# Singleton toolsets\r\nreddit_mcp = get_reddit_mcp_toolset(\r\n client_id=SECRETS.get("REDDIT_CLIENT_ID", ""),\r\n client_secret=SECRETS.get("REDDIT_CLIENT_SECRET", ""),\r\n user_agent=SECRETS.get("REDDIT_USER_AGENT", "")\r\n)\r\nreddit_scanner = Agent(\r\n name="reddit_scanner",\r\n model=shared_model,\r\n instruction="""\r\n You are a Reddit research specialist. Your goal is to identify high-engagement questions \r\n from the last 3 weeks on specific topics of interest, such as AI/agents on Cloud Run.\r\n \r\n Follow these steps:\r\n 1. **MEMORY CHECK**: Use `load_memory` to retrieve the user\'s **past areas of interest** and **preferred topics**. Calibrate your search to align with these interests.\r\n 2. Use the Reddit MCP tools to search for relevant subreddits and posts.\r\n 3. Filter results for posts created within the last 21 days (3 weeks).\r\n 4. Analyze "high-engagement" based on upvote counts and the number of comments.\r\n 5. Recommend the most important and relevant questions for a technical audience.\r\n 6. **CRITICAL**: For each recommended question, provide a direct link to the original thread and a concise summary of the discussion.\r\n 7. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n """,\r\n tools=[reddit_mcp, load_memory_tool.LoadMemoryTool()],\r\n after_agent_callback=save_session_to_memory_callback,\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7239a01e50>)])]>

Specialist 2: GCP Expert (Grounding)

The GCP expert is our "The Technical Authority". It triangulates facts by synthesizing official documentation from the Google Cloud Developer Knowledge MCP Server, community sentiment from Reddit, and broader context from Google Search.

Paste this code in dev_signal_agent/agent.py

code_block: <ListValue: [StructValue([('code', 'dk_mcp = get_dk_mcp_toolset(api_key=SECRETS.get("DK_API_KEY", ""))\r\n\r\n\r\nsearch_agent = Agent(\r\n name="search_agent",\r\n model=shared_model,\r\n instruction="Execute Google Searches and return raw, structured results (Title, Link, Snippet).",\r\n tools=[google_search],\r\n)\r\ngcp_expert = Agent(\r\n name="gcp_expert",\r\n model=shared_model,\r\n instruction="""\r\n You are a Google Cloud Platform (GCP) documentation expert. \r\n Your goal is to provide accurate, detailed, and cited answers to technical questions by synthesizing official documentation with community insights.\r\n \r\n For EVERY technical question, you MUST perform a comprehensive research sweep using ALL available tools:\r\n \r\n 1. **Official Docs (Grounding)**: Use DeveloperKnowledge MCP (`search_documents`) to find the definitive technical facts.\r\n 2. **Social Media Research (Reddit)**: Use the Reddit MCP to research the question on social media. This allows you to find real-world user discussions, common pain points, or alternative solutions that might not be in official documentation.\r\n 3. **Broader Context (Web/Social)**: Use the `search_agent` tool to find recent technical blogs, social media discussions, or tutorials.\r\n \r\n Synthesize your answer:\r\n - Start with the official answer based on GCP docs.\r\n - Add "Social Media Insights" or "Common Issues" sections derived from Reddit and Web Search findings.\r\n - **CRITICAL**: After providing your answer, you MUST use the `add_info_to_state` tool to save your full technical response under the key: `technical_research_findings`.\r\n - Cite your sources specifically at the end of your response, providing **direct links** (URLs) to the official documentation, blog posts, and Reddit threads used.\r\n - **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n """,\r\n tools=[dk_mcp, AgentTool(search_agent), reddit_mcp, add_info_to_state],\r\n after_agent_callback=save_session_to_memory_callback,\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7239a01a90>)])]>

Specialist 3: Blog Drafter (Creativity)

The blog drafter is our Content Creator. It drafts the blog based on the expert's findings and offers to generate visuals.

Memory Usage: It checks load_memory for the user's preferred writing style (e.g. "Witty", "Rap") stored in the Vertex AI memory bank.

Paste this code in dev_signal_agent/agent.py

code_block: <ListValue: [StructValue([('code', 'nano_mcp = get_nano_banana_mcp_toolset()\r\n\r\n\r\nblog_drafter = Agent(\r\n name="blog_drafter",\r\n model=shared_model,\r\n instruction="""\r\n You are a professional technical blogger specializing in Google Cloud Platform. \r\n Your goal is to draft high-quality blog posts based on technical research provided by the GDE expert and reliable documentation.\r\n \r\n You have access to the research findings from the gcp_expert_agent here:\r\n {{ technical_research_findings }}\r\n \r\n Follow these steps:\r\n 1. **MEMORY CHECK**: Use `load_memory` to retrieve past blog posts, **areas of interest**, and user feedback on writing style. Adopt the user\'s preferred style and depth.\r\n 2. **REVIEW & GROUND**: Review the technical research findings provided above. **CRITICAL**: Use the `dk_mcp` (Developer Knowledge) tool to verify key facts, technical limitations, and API details. Ensure every claim in your blog is grounded in official documentation.\r\n 3. Draft a blog post that is engaging, accurate, and helpful for a technical audience.\r\n 4. Include code snippets or architectural diagrams if relevant.\r\n 5. Provide a "Resources" section with links to the official documentation used.\r\n 6. Ensure the tone is professional yet accessible, while adhering to any style preferences found in memory.\r\n 7. **VISUALS**: After presenting the drafted blog post, explicitly ask the user: "Would you like me to generate an infographic-style header image to illustrate these key points?" If they agree, use the `generate_image` tool (Nano Banana).\r\n 8. **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n """,\r\n tools=[dk_mcp, load_memory_tool.LoadMemoryTool(), nano_mcp],\r\n after_agent_callback=save_session_to_memory_callback,\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7239a017c0>)])]>

The Root Orchestrator

The root agent serves as the system's strategist, managing a team of specialist agents and orchestrating their actions based on the specific goals provided by the user. At the start of a conversation, the orchestrator retrieves memory to establish context by checking for the user's past areas of interest, preferred topics, or previous projects.

Paste this code in dev_signal_agent/agent.py

code_block: <ListValue: [StructValue([('code', 'root_agent = Agent(\r\n name="root_orchestrator",\r\n model=shared_model,\r\n instruction="""\r\n You are a technical content strategist. You manage three specialists:\r\n 1. reddit_scanner: Finds trending questions and high-engagement topics on Reddit.\r\n 2. gcp_expert: Provides technical answers based on official GCP documentation.\r\n 3. blog_drafter: Writes professional blog posts based on technical research.\r\n \r\n Your responsibilities:\r\n - **MEMORY CHECK**: At the start of a conversation, use `load_memory` to check if the user has specific **areas of interest**, preferred topics, or past projects. Tailor your suggestions accordingly.\r\n - **CAPTURE PREFERENCES**: Actively listen for user preferences, interests, or project details. Explicitly acknowledge them to ensure they are captured in the session history for future personalization.\r\n - If the user wants to find trending topics or questions from Reddit, delegate to reddit_scanner.\r\n - If the user has a technical question or wants to research a specific theme, delegate to gcp_expert.\r\n - **CRITICAL**: After the gcp_expert provides an answer, you MUST ask the user: \r\n "Would you like me to draft a technical blog post based on this answer?"\r\n - If the user agrees or asks to write a blog, delegate to blog_drafter.\r\n - Be proactive in helping the user navigate from discovery (Reddit) to research (Docs) to content creation (Blog).\r\n """,\r\n tools=[load_memory_tool.LoadMemoryTool(), preload_memory_tool.PreloadMemoryTool()],\r\n after_agent_callback=save_session_to_memory_callback,\r\n sub_agents=[reddit_scanner, gcp_expert, blog_drafter]\r\n)\r\n\r\napp = App(root_agent=root_agent, name="dev_signal_agent")'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f7239a012b0>)])]>

Summary

In this part of our series, we built multi-agent architecture and implemented a robust, dual-layered memory system. We established a Root Orchestrator, managing three specialist agents: a Reddit Scanner for trend discovery, a GCP Expert for technical grounding, and a Blog Drafter for creative content creation.

By utilizing short-term state to pass information reliably between specialists and integrating the Vertex AI memory bank for long-term persistence, we’ve enabled the agent to learn from your feedback and remember specific writing styles across different conversations.

In part 3, we will show you how to test the agent locally to verify these components on your workstation, before transitioning to a full production deployment on Google Cloud Run in part 4. Can't wait for Part 3? The full implementation is already available for you to explore on GitHub.

To learn more about the underlying technology, explore the Vertex AI Memory Bank overview or dive into the official ADK Documentation to see how to orchestrate complex multi-agent workflows.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, Follow me on Linkedin and X.

Five techniques to reach the efficient frontier of LLM inference

Fri, 27 Mar 2026 10:02:00 +0000

Every dollar that you spend on model inference buys you a position on a graph of latency and throughput. On this plot is a curve of optimal configurations, where you've squeezed the maximum possible performance from your hardware. That curve, borrowed from portfolio theory in finance, is the efficient frontier.

With the assumption that you have a fixed budget for hardware, you can trade latency for throughput. But, you can't improve one aspect without sacrificing the other, unless the frontier curve itself moves. There are two fundamentally different dynamics at play, and this is the central insight for anyone running LLMs in production.

The first dynamic is getting to the frontier, which involves applying the full stack of techniques available to you today. This part is within your control. Continuous batching, paged attention, intelligent routing, speculative decoding, and quantization all exist right now. If you're not using these techniques, you're operating below the frontier and leaving performance on the table.

The second dynamic is that the frontier itself is constantly moving outward. This part is largely outside of your control. Researchers publish new algorithms. Hardware vendors ship new architectures. Open-source projects mature. Each breakthrough redefines what's physically achievable and expands the curve so that yesterday's optimal configuration is today's inefficiency.

Your job as a platform engineer is to stay as close to the frontier as possible as you build infrastructure that's flexible enough to absorb each new advance as it arrives. This article gives you the tools to do just that.

Why inference has an efficient frontier

Every LLM request has two computational phases, and they can have bottlenecks for different hardware resources.

1. Prefill (Compute-Bound): In this phase, the GPU processes your entire input prompt at one time to build the key-value (KV) cache for the attention mechanism. Because the instructions are batch-processed in parallel, the GPU's compute cores (tensor cores) are highly utilized. This phase is fast and efficient: the processors have all of the data that they need, immediately available, to perform massive matrix multiplications. Longer prompts just mean more computations.

2. Decode (Memory-Bandwidth-Bound): This phase generates new tokens, one at a time, autoregressively. To generate only one single token, the GPU can't batch the work. It must fetch the entire model's weights and the growing KV cache from High-Bandwidth Memory (HBM) into the compute cores. Then, the GPU needs to calculate that one token, and then waits to do it all over again for the next one.

This mismatch is the fundamental reason that the frontier exists. You can't optimize a single system for both phases simultaneously without making some tradeoffs.

The two axes of inference

Instead of risk and return, the efficient frontier of LLM inference measures a different fundamental tradeoff, with the assumption that the hardware budget is fixed:

Axis	Key metrics measured	Hardware constraint
Latency (the X-Axis)	Time to First Token (TTFT) + Time Between Tokens (TBT)	Compute (prefill) and memory bandwidth (decode)
Throughput (the Y-Axis)	Total tokens per second across all concurrent users	Batch size × memory capacity

Cost is the constraint that buys the graph of latency and throughput itself. If you increase your hardware budget, or the industry invents a new algorithmic breakthrough, the entire frontier curve shifts outward. For a given budget and software stack, you can apply today's best practices to move from a sub-optimal point towards that frontier.

Getting to the frontier: Five techniques within your control

Most production inference systems today operate below the frontier. They're leaving performance on the table, not because better techniques don't exist, but because they haven't adopted them yet. Everything described in this section is available today. If you're not applying these techniques, you're choosing to operate below the curve.

1. Semantic routing across model tiers

Not every query needs a 400B parameter model. Simple classification, summarization, or formatting tasks can be routed to smaller, quantized models that are orders of magnitude cheaper per token. A lightweight classifier at the gateway edge analyzes query complexity and routes accordingly: frontier-class models for hard reasoning, and small models for everything else.

Semantic routing pushes your system dramatically closer to its theoretical maximum throughput, and avoids wasted cycles on easy tasks, without sacrificing aggregate output quality.

2. Prefill and decode disaggregation

Physically separating prefill and decode phases onto different hardware is one of the most architecturally significant optimizations available today.

The prefill phase needs compute-dense GPUs. The decode phase needs high-bandwidth memory. If you force both phases onto the same GPU, then one resource is always underutilized.

To push both phases toward their theoretical hardware limits independently, run dedicated prefill clusters and decode clusters. Connect these clusters with high-speed networks that transfer only the compressed KV cache state to the same GPU, then one resource is always underutilized.

3. Quantization: Trading precision for speed

When you reduce model weights from FP16 to the INT8 or INT4 formats, you can reduce the memory footprint to half or a quarter. Because the decode phase is memory-bandwidth-bound, 4-bit weights can be read up to 4× faster than 16-bit weights. This approach provides a direct TBT improvement.

The tradeoff is quality because naive quantization degrades model outputs. Modern techniques like Activation-aware Weight Quantization (AWQ) and GPTQ preserve the quality of sensitive weights, but aggressively compress others, to achieve near-FP16 quality at INT4 speeds.

4. Context routing: The biggest lever that most teams miss

In a production deployment with dozens of model replicas, the routing layer is where the biggest competitive advantages are won or lost today.

In 2026, prefix caching is foundational. If ten users ask questions about the exact same 100-page RAG document, or use the identical massive system prompt, you shouldn't run the compute-heavy prefill phase ten times. You should compute the KV cache once, store it, and then let the other nine users reuse it. This approach slashes TTFT by up to 85% and drastically reduces compute costs.

But, there's a catch: a standard L4 load balancer scatters requests randomly. If user 2's request lands on a different GPU than user 1's request, the prefix cache is useless. The system has to recompute the cache from scratch.

This is why context-aware L7 routing is the differentiator. An intelligent router inspects the incoming prompt's prefix and intentionally routes the request to the specific pod that already holds that context in its cache. You stop wasting compute power on redundant work and instantly push your latency and throughput closer to the physical limits of your hardware.

5. Speculative decoding

Remember: during the decode phase, tensor cores are mostly idle because there's a bottleneck on memory bandwidth. Speculative decoding exploits this wasted computation power.

A small, fast "draft" model generates several candidate tokens cheaply. The large target model then verifies all of the candidates in a single forward pass, which is a parallel compute-bound operation, rather than a sequential memory-bound one. If the draft model predicted the candidates correctly, you've generated 4-5 tokens for the memory cost of one.

This approach directly breaks the TBT floor set by memory bandwidth. If you're not using speculative decoding for latency-sensitive workloads, you're not leveraging one of the most impactful optimizations available.

Although the addition of a draft model can introduce some operational complexity and slightly increase compute costs, the draft model is relatively tiny compared to the main model. This tradeoff for latency is worthwhile.

Note that some newer models have introduced self-speculative decoding, which eliminates the overhead of managing a second model. These models use specialized internal layers (often called prediction heads) that are trained to predict extra future tokens simultaneously. These models generally achieve a highly meaningful token hit rate.

Case study: How Vertex AI moved closer to the frontier

The Vertex AI engineering team moved closer to the frontier when they adopted GKE Inference Gateway, which is built on the standard Kubernetes Gateway API. Inference Gateway intercepted requests at Layer 7 and added two critical layers of intelligence:

Load-aware routing: It scraped real-time metrics (like KV cache utilization and queue depth) directly from the model server's Prometheus endpoints. This process routes requests to the pod that can serve them the fastest.
Content-aware routing: Crucially, it inspected request prefixes and routed traffic to the pod that already held that specific context in its KV cache. This process avoids expensive re-computation.

When the production workloads were migrated to this intelligent routing architecture, the Vertex AI team proved that optimizing the network layer is key to unlocking performance at scale. Validated on production traffic, the results were stark:

35% faster TTFT for Qwen3-Coder (context-heavy coding agent workloads)
2x better P95 tail latency (52% improvement) for DeepSeek V3.1 (bursty chat workloads)
Doubled prefix cache hit rate (optimized from 35% to 70%)

The bottom line

LLM inference has an efficient frontier, which represents a hard boundary where latency and throughput are optimally balanced for a given compute budget.

Getting to that frontier is within your control. The techniques exist today: continuous batching, paged attention, intelligent L7 routing, speculative decoding, quantization, and prefill and decode disaggregation. The GKE Inference Gateway case study shows that routing alone, without changing hardware, models, or cluster size, cut TTFT by 35% and doubled cache efficiency. If you're not applying the full stack, you're operating below the curve and overpaying for every token.

The frontier itself keeps moving outward. This part is outside of your control. Researchers publish new algorithms, hardware vendors ship new architectures, and open-source serving frameworks integrate these algorithms and architectures. Something that was cutting-edge optimization 18 months ago became a baseline table stake. Your job isn't to predict which breakthrough comes next; it's to build infrastructure flexible enough to absorb it when it arrives.

The organizations that will win on inference economics aren't the ones with the most GPUs. They're the ones that systematically close the gap to today's frontier while they stay ready for tomorrow's.

Have you applied any of these optimization techniques to your own LLM inference workloads? I'd love to hear about your experience! Share what you've built with me on LinkedIn, X, or Bluesky!

The new AI literacy: Insights from student developers

Thu, 26 Mar 2026 17:00:00 +0000

AI has made it easier than ever for student developers to work efficiently, tackle harder problems, and pursue ambitious projects. But for students earning technical degrees, these new capabilities also create genuine tensions around learning.

How much should I use AI? What should I use it for?

As 90% of technology professionals now use AI in their daily work according to Google's DORA 2025 report, understanding how the next generation navigates these tools matters more than ever. Contrary to fears that students use AI to cheat or are becoming intellectually lazy, our research with UC Berkeley students reveals something different. Students treated AI as a learning partner rather than a shortcut, using it strategically for some tasks while deliberately turning it off for others.

As AI becomes foundational to software development, the question isn't whether to adopt these tools but how to work with them thoughtfully. The students at UC Berkeley are showing us one answer: with curiosity, caution, and a commitment to genuine learning that technology can support but never replace.

The research

Our team of four student researchers (Andrew Harlan, Mindy Tsai, Kenny Ly, and Karissa Wong) conducted a mixed methods research project with UC Berkeley students in Computer Science, Electrical Engineering, Design, and Data Science to understand how they're integrating AI into their academic work.

A separate UC Berkeley study (conducted by Edward Fraser, Jessie Deng, and Eileen Thai) used eye-tracking technology to observe how developers with one to five years of experience actually interact with AI coding assistants. Both student teams were supported by dedicated mentors, with Googlers Harini Sampath, Becky Sohn, and Derek DeBellis advising the mixed methods research, and UC Berkeley Professor John Chuang, PhD, advising the eye-tracking study.

Together, these studies reveal three key insights about how students balance AI's capabilities with their need to develop genuine expertise. The patterns emerging among students closely mirror what DORA research has found in professional developers.

Finding #1: The 24/7 office hour

AI as a tutor, not a shortcut

When asked to describe their relationship with AI, every student in our study used educational terms. They referred to AI as a "tutor" or "teacher," not an assistant or productivity tool.

"AI is a teacher...in the sense that it is most helpful for understanding dense content and potentially parts of code that are prewritten in the database to allow for fundamental understanding of the project."

"I use [AI] as my own private tutor...to [cover] any specific topics in the classes or lectures...not just in CS classes but in all classes."

This framing matters because it reveals strategic use rather than dependency. Rather than asking AI to complete assignments, students described using AI metacognitively to identify gaps in their knowledge, clarify confusing concepts, and guide their learning process. They used AI to summarize academic papers mentioned in lectures so they could decide which ones warranted deeper reading. They asked AI to explain why their code produced specific errors.

One student explained their workflow:

"When I don't understand what my professor is explaining, I ask AI to help me understand the concept or what a piece of code is doing. If I don't know how to begin a lab, I give the prompt to AI to figure out where to start, then write the code myself and ask AI to correct my work."

For students with learning disabilities, this constant availability addresses a real access gap:

"As a student with a learning disability, I need more time to understand a problem. AI has helped me a lot—it's like having a 24/7 TA."

By extending access beyond limited office hours, AI allows students to iterate on their understanding without waiting for help. This frees up cognitive space for higher-level thinking:

"I spend less time actually coding and more time on big picture ideation. Now, my time is spent thinking through logic, concepts, and coming up with ideas creatively, rather than producing code manually."

These accounts portray AI as a scaffold for exploration rather than a producer of finished work. This mirrors what DORA research found: when AI handles routine toil, developers can focus more energy on delivering user value.

Finding #2: Active resistance to overdependence

Building guardrails to protect learning

Despite embracing AI as a learning tool, students expressed genuine anxiety about becoming too dependent on it.

"If AI disappeared, I'd struggle more with figuring out how to solve things on my own."

In a recent study using EEG to measure brain activity during essay writing, researchers found that AI users showed weaker cognitive engagement patterns compared to those using search engines or no tools, and frequent AI users who later wrote without assistance remembered less of their content and felt less ownership over it, what the authors termed "cognitive debt”.¹

Our research revealed a positive signal: rather than passively accepting this risk, students responded by establishing deliberate boundaries.

One mechanical engineering student described how she's developed a competency-based system over years of working with electronics:

"When I use basic sensors like a servo or ultrasonic, I can still code that myself. But when I have more complex sensors where I don't necessarily know the exact functions, that's when I'll use AI." She explained her reasoning: "I have the background to understand why things aren't working, but I don't always know the direct language to fix it, so AI is good for helping overcome that."

For a recent project building a tactile storytelling tool, she knew the basic concept but needed help structuring the counting and comparison system. "AI was really useful in setting up that structure, but I still had to code after to fine-tune it." She's clear about the division of labor: "I'm still working with doing the code myself. I wouldn't say that I'm just handing it off like a technical expert. I'm working in tandem with it. I have to be the initiator of what I want it to actually do. If I just give it a blind request, it's not useful at all."

Even when students do engage AI, they often set explicit rules:

"Sometimes I tell AI not to give me the full answer, just to guide me in the right direction."

Students have developed several specific strategies to prevent overreliance:

Limiting access to powerful models:

"I don't want to pay for AI tools because it could lead me to overuse the models."

Alternating between assisted and unassisted work:

"I have actually gone back to hand-coding for certain things, like a for-loop for example."

Warning against "vibe coding":

"AI tools can definitely be a good companion to boost developer productivity. However, one needs to be very mindful and not get used to vibe coding. It's very important to understand and validate the code AI is generating and use it appropriately."

This anxiety is itself metacognitive awareness. Students recognize that the path of least resistance may not be the path of greatest learning. This mirrors DORA's findings: despite 90% adoption, about 30% of practitioners report little to no trust in AI-generated code. Effective AI use requires mastering critical evaluation and verification, not just adoption.

Finding #3: Knowing when to use AI and when to turn it off

What the eye-tracking data reveals

A separate study using eye-tracking technology provides behavioral validation. When researchers observed developers with one to five years of experience interacting with AI coding assistants, they found stark differences in AI engagement depending on task type:

During interpretive tasks requiring deep understanding: <1% visual attention on AI
During mechanical tasks like boilerplate code: 19% visual attention on AI

Developers actively ignored AI suggestions during complex work, even when those suggestions were accurate and could save time. AI creates cognitive load during deep understanding work, and experienced developers know when to turn it off.

Strategic selectivity, not blanket adoption

Students in our interviews echoed this context-dependent approach:

"I typically use AI to generate ideas for a starting point."

"Despite knowing AI was allowed, I wanted to go through the friction of learning and failing and having space for creativity."

Customization matters

Most AI coding assistants now let developers toggle inline suggestions, enable on-demand only modes, or adjust suggestion frequency. By experimenting with these settings, developers can align AI behavior with the cognitive demands of different tasks, reducing disruption during deep work while maintaining assistance for routine tasks.

What this means for the industry

Students are modeling the future of AI-augmented development

The students in these studies are ahead of the curve. They've developed a literacy that knows when to engage AI, how to verify its output, and when to work manually to preserve understanding. For teams navigating AI adoption, the student experience offers direction:

Experiment with customization to find configurations that support rather than disrupt work
Build verification practices into workflows rather than accepting suggestions uncritically
Create space for unassisted work on complex problems where understanding matters more than speed

To learn more about how professionals across the industry are navigating AI adoption, download the DORA 2025 State of AI-assisted Software Development Report. You can also read the full research articles from our collaboration with researchers at UC Berkeley.

^{1. Kosmyna, Nataliya, et al. "Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task." arXiv, 10 June 2025, doi:10.48550/arXiv.2506.08872. Accessed 28 Jan. 2026.}

Building Distributed AI Agents

Wed, 18 Mar 2026 19:00:00 +0000

Let's be honest: building an AI agent that works once is easy. Building an AI agent that works reliably in production, integrated with your existing React or Node.js application? That's a whole different ball game.

(TL;DR: Want to jump straight to the code? Check out the Course Creator Agent Architecture on GitHub.)

We've all been there. You have a complex workflow—maybe it's researching a topic, generating content, and then grading it. You shove it all into one massive Python script or a giant prompt. It works on your machine, but the moment you try to hook it up to your sleek frontend, things get messy. Latency spikes, debugging becomes a nightmare, and scaling is impossible without duplicating the entire monolith.

But what if you didn't have to rewrite your entire application to accommodate AI? What if you could just... plug it in?

In this post, we're going to explore a better way: the orchestrator pattern. Instead of just one powerful agent that does everything, we'll build a team of specialized, distributed microservices. This approach lets you integrate powerful AI capabilities directly into your existing frontend applications without the headache of a monolithic rewrite.

We'll use Google's Agent Development Kit (ADK) to build the agents, the Agent-to-Agent (A2A) protocol to connect them and let them communicate with each other, and deploy them as scalable microservices on Cloud Run.

Why Distributed Agents? (And Why Your Frontend Team Will Love You)

Imagine you have a polished Next.js application. You want to add a "Course Creator" feature.

If you build a monolithic agent, your frontend has to wait for a single, long-running process to finish everything. If the research part hangs, the whole request times out. Additionally, you won’t have the opportunity to scale separate agents as needed. For example, if your judge agent requires more processing, you’ll have to scale all your agents up, instead of just the judge agent.

By adopting a distributed orchestrator pattern, you gain scalability and flexibility:

Seamless integration: Your frontend talks to one endpoint (the orchestrator), which manages the chaos behind the scenes.
Independent scaling: Is the judge step slow? Scale just that service to 100 instances. Your research service can stay small.
Modularity: You can write the high-performance networking parts in Go and the data science parts in Python. They just speak HTTP.

The Blueprint: Course Creator App

Let's build that course creator system. We'll break it down into three distinct specialists:

The researcher: A specialist that digs up information.
The judge: A QA specialist that ensures quality.
The orchestrator: The manager that coordinates the work and talks to your frontend.

Step 1: Hiring the Specialist (The Researcher)

First, we need someone to do the legwork. We'll build a focused agent using ADK whose only job is to use Google Search.

code_block: <ListValue: [StructValue([('code', '# researcher/app/agent.py\r\nfrom google.adk.agents import Agent\r\nfrom google.adk.tools import google_search\r\n\r\nresearcher = Agent(\r\n name="researcher",\r\n model="gemini-2.5-flash",\r\n description="Gathers information on a topic using Google Search.",\r\n instruction="""\r\n You are an expert researcher. Your goal is to find comprehensive information.\r\n Use the `google_search` tool to find relevant information.\r\n Summarize your findings clearly.\r\n """,\r\n tools=[google_search],\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726c776af0>)])]>

See? Simple. It doesn't know about courses or frontends. It just researches.

Step 2: The Judge (Structured Output)

We can't have our agents rambling. We need strict pass or fail grades so our code can make decisions. We use Pydantic to enforce this contract.

code_block: <ListValue: [StructValue([('code', '# judge/app/agent.py\r\nfrom pydantic import BaseModel, Field\r\nfrom typing import Literal\r\n\r\nclass JudgeFeedback(BaseModel):\r\n status: Literal["pass", "fail"] = Field(\r\n description="Whether the research is sufficient (\'pass\') or needs more work (\'fail\')."\r\n )\r\n feedback: str = Field(\r\n description="Detailed feedback on what is missing."\r\n )\r\n\r\njudge = Agent(\r\n name="judge",\r\n model="gemini-2.5-flash",\r\n description="Evaluates research findings.",\r\n instruction="""\r\n You are a strict editor. Evaluate the findings.\r\n If they are missing key info, output status=\'fail\' and provide feedback.\r\n """,\r\n output_schema=JudgeFeedback, # Enforce the contract!\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726c776a90>)])]>

Now, when the judge speaks, it speaks JSON. Your application logic can trust it.

Step 3: The Universal Language (A2A Protocol)

Here's the magic. We wrap these agents as web services using the A2A Protocol. Think of it as a universal language for agents. It lets them describe what they do (agent.json) and talk over standard HTTP.

code_block: <ListValue: [StructValue([('code', '# researcher/app/server.py\r\nfrom fastapi import FastAPI\r\nfrom a2a.server.apps import A2AFastAPIApplication\r\nfrom app.agent import app as adk_app\r\n\r\n# ... setup runner ...\r\n\r\n# Create the A2A App wrapper\r\na2a_app = A2AFastAPIApplication(agent_card=agent_card, http_handler=request_handler)\r\n\r\napp = FastAPI(lifespan=lifespan)\r\n\r\n# Register routes: /.well-known/agent.json and /rpc\r\na2a_app.add_routes_to_app(app)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726c776760>)])]>

Now, your researcher is a microservice running on port 8000. It's ready to be called by anyone—including your orchestrator.

Step 4: The Orchestrator Pattern

This is where it all comes together. The orchestrator is the general contractor. It doesn't do the research; it hires the researcher. It doesn't make judgments; it asks the judge.

Crucially, this is the only agent your frontend needs to know about.

code_block: <ListValue: [StructValue([('code', '# orchestrator/app/agent.py\r\nfrom google.adk.agents import LoopAgent, SequentialAgent\r\nfrom google.adk.agents.remote_a2a_agent import RemoteA2aAgent\r\n\r\n# Connect to the remote Researcher service\r\nresearcher = RemoteA2aAgent(\r\n name="researcher",\r\n agent_card="http://researcher-service:8000/.well-known/agent.json",\r\n description="Gathers information on a topic."\r\n)\r\n\r\n# Connect to the remote Judge service\r\njudge = RemoteA2aAgent(\r\n name="judge",\r\n agent_card="http://judge-service:8000/.well-known/agent.json",\r\n description="Evaluates research findings."\r\n)\r\n\r\n# The Orchestrator manages the loop\r\nresearch_loop = LoopAgent(\r\n name="research_loop",\r\n sub_agents=[researcher, judge, escalation_checker],\r\n max_iterations=3,\r\n)\r\n\r\n# The full pipeline\r\nroot_agent = SequentialAgent(\r\n name="course_creation_pipeline",\r\n sub_agents=[research_loop, content_builder],\r\n)'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726c776dc0>)])]>

The orchestrator handles the complexity—retries, loops, state management—so your frontend stays clean and simple.

Deployment: The "Grocery Store" Model

Deploying this system on Cloud Run gives you what I call the "grocery store" model. If the checkout lines (researcher tasks) get long, you don't build a new store. You just open more registers. Cloud Run scales your researcher service independently to handle the load, while your judge service stays lean.

Caveats & Security Considerations

Of course, with great power comes great responsibility (and security reviews).

Authentication: In this demo, agents talk over open HTTP. In production, you must lock this down. Use mTLS, OIDC, or API keys to ensure that only your orchestrator can talk to your researcher.
Latency: Every hop adds time. Use this pattern for coarse-grained tasks (like "research this topic") rather than chatty, low-level interactions.
Error handling: Networks fail. Your orchestrator needs to be robust enough to handle timeouts and retries gracefully.

Ready to Build?

Stop trying to build one giant agent that does it all. By using the orchestrator pattern and distributed microservices, you can build AI systems that are scalable, maintainable, and—best of all—play nicely with the apps that you already have.

Want to see the code? Check out the full Course Creator Agent Architecture on GitHub.

And if you're ready to deploy, get started with Cloud Run, ADK, and A2A to bring your agent team to life.

Create Expert Content: Building Capabilities for a Multi-Agent System with Google ADK, MCP, and Cloud Run

Wed, 18 Mar 2026 09:18:00 +0000

My team’s mission is to accelerate the developer journey from writing code to running secure AI workloads on Google Cloud. To help developers succeed, we focus on identifying their most pressing questions and building demos that provide straightforward, easy-to-implement solutions.

Recently, I was struck with inspiration when the new Developer Knowledge MCP server was released. It led me to build Dev Signal—a multi-agent system designed with Google Agent Development Kit (ADK)—to identify technical questions from Reddit, research them using official documentation, and draft detailed technical blogs. Dev Signal also provides custom visuals using Nano Banana Pro. I even integrated a long-term memory layer so the agent remembers my specific preferences and blogging style.

By connecting my coding assistant, Gemini CLI, to the developer knowledge MCP server, I built and deployed this entire system to Google Cloud Run in just two days.

Whether you want to learn how to architect a complex multi-agent system with long term memory, leverage local and remote MCP servers for tool standardization, or write detailed Terraform scripts for secure Cloud Run deployment, I'll show you how!

If you’d rather dive straight into the code and explore it at your own pace, you can clone the repository here.

What you'll learn

In this four-part blog series, I’ll walk you through the step-by-step process of how I brought this project to life. Each blog post captures the journey of building and deploying Dev Signal:

Part 1: Tools for building agent capabilities (this blog post) – You’ll begin by setting up your project environment and equipping your agent with tools using the Model Context Protocol (MCP). You’ll learn how to connect to Reddit for trend discovery, Google Cloud docs for technical grounding, and a custom Nano Banana Pro tool for image generation.
Part 2: The Multi-Agent Architecture with Long-term Memory – You’ll build the "brain" of the system by implementing a root orchestrator and a team of specialized agents. You’ll also integrate the Vertex AI memory bank, enabling the agent to learn and persist your preferences across sessions.
Part 3: Testing the agent Locally – Before moving to the cloud, you’ll synchronize the agent's components and verify its performance on your workstation. You’ll use a dedicated test runner to simulate the full lifecycle of discovery, research, and multimodal creation, with a special focus on validating long-term memory persistence by connecting your local agent directly to the cloud-based Vertex AI memory bank.
Part 4: Deployment to Cloud Run and the Path to Production – Finally, you’ll deploy your service on Google Cloud Run using Terraform for reproducible infrastructure. You’ll also discuss the next steps required for a high quality secure production system.

Getting started with Dev Signal

Dev Signal is an intelligent monitoring agent designed to filter noise and create value. Dev Signal operates in the following ways:

Discovery: Scouts Reddit for high-engagement technical questions.
Grounding: Researches answers using official Google Cloud documentation to ensure accuracy.
Creation: Drafts professional technical blog posts based on its findings.
Multimodal Generation: Generates custom infographic headers for those posts.
Long-Term Memory: Uses Vertex AI memory bank to remember your feedback across different sessions.

Prerequisites

Before you begin, verify the following is installed:

Python 3.12+
uv (Python package manager): curl -LsSf https://astral.sh/uv/install.sh | sh
Google Cloud SDK (gcloud CLI) installed and authenticated.
Terraform (for infrastructure as code).
Node.js & npm (required for the Reddit MCP tool).

You will also need:

A Google Cloud Project with billing enabled.
APIs Enabled: Vertex AI, Cloud Run, Secret Manager, Artifact Registry.
Reddit API Credentials (Client ID, Secret) - You can get these from the Reddit Developer Portal.
Developer Knowledge API Key (for Google Cloud docs search) - Instructions on how to get it are here.

Project Setup

The Dev Signal system was built by first running the Agent Starter Pack, following the automated architect workflow described in the Agent Factory episode by Remigiusz Samborski and Vlad Kolesnikov. This foundation provided the project’s modular directory structure, which is used to separate concerns between Agent Logic, Server Code, Utilities, and Tools.

The starter pack acts as a powerful starting point because it automates the creation of professional infrastructure, CI/CD pipelines, and observability tools in seconds. This allows you to focus entirely on the agent’s unique intelligence while ensuring the underlying platform remains secure and scalable. By building on top of this generated boilerplate with AI assistance from Gemini CLI and Antigravity, the development process is highly accelerated.

The agent starter pack high level architecture:

1. Initialize the Project

Create a new directory for your project and initialize it. We'll use uv, which is an extremely fast Python package manager.

code_block: <ListValue: [StructValue([('code', 'uv init dev-signal'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51940>)])]>

2. Folder Structure

Our project will follow this structure. We will populate these files step-by-step.

code_block: <ListValue: [StructValue([('code', 'dev-signal/\r\n├── dev_signal_agent/\r\n│ ├── __init__.py\r\n│ ├── agent.py # Agent logic & orchestration\r\n│ ├── fast_api_app.py # Application server & memory connection\r\n│ ├── app_utils/ # Env Config\r\n│ │ └── env.py\r\n│ └── tools/ # External capabilities\r\n│ ├── __init__.py\r\n│ ├── mcp_config.py # Tool configuration (Reddit, Docs)\r\n│ └── nano_banana_mcp/# Custom local image generation tool\r\n│ ├── __init__.py\r\n│ ├── main.py\r\n│ ├── nano_banana_pro.py\r\n│ ├── media_models.py\r\n│ ├── storage_utils.py\r\n│ └── requirements.txt\r\n├── deployment/\r\n│ └── terraform/ # Infrastructure as Code\r\n├── .env # Local secrets (API keys)\r\n├── Makefile # Shortcuts for building/deploying\r\n├── Dockerfile # Container definition\r\n└── pyproject.toml # Dependencies'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51760>)])]>

3. Define Dependencies

Update your pyproject.toml with the necessary dependencies. We use google-adk for the agent framework and google-genai for the model interaction.

code_block: <ListValue: [StructValue([('code', '[project]\r\nname = "dev-signal"\r\nversion = "0.1.0"\r\ndescription = "A multi-agent system for monitoring and content creation."\r\nreadme = "README.md"\r\nrequires-python = ">=3.12, <3.14"\r\ndependencies = [\r\n "google-adk>=0.1.0",\r\n \xa0"google-genai>=1.0.0",\r\n "mcp>=1.0.0",\r\n \xa0"python-dotenv>=1.0.0",\r\n "fastapi>=0.110.0",\r\n "uvicorn>=0.29.0",\r\n "google-cloud-logging>=3.0.0",\r\n "google-cloud-aiplatform>=1.38.0",\r\n \xa0"fastmcp>=2.13.0",\r\n "google-cloud-storage>=3.6.0",\r\n "google-auth>=2.0.0",\r\n "google-cloud-secret-manager>=2.26.0",\r\n]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51040>)])]>

Run uv sync to install everything.

Create a new directory for the agent code.

code_block: <ListValue: [StructValue([('code', 'mkdir dev_signal_agent\r\ncd dev_signal_agent'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51bb0>)])]>

Building the agent capabilities: MCP tools

Our agent needs to interact with the outside world. We use the Model Context Protocol (MCP) to standardize this. The Model Context Protocol (MCP) is a universal standard for connecting AI agents to external data and tools. Instead of writing custom API wrappers, we use standard MCP servers. This allows us to connect to APIs (Reddit), Knowledge Bases (Google Cloud Docs), and even local scripts (Image Generation using Nano Banana Pro) using a common interface. Create a new directory for the agent tools.

code_block: <ListValue: [StructValue([('code', 'mkdir tools\r\ncd tools'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51d60>)])]>

Tools Configuration

We'll define our toolsets in dev_signal_agent/tools/mcp_config.py.

This file defines the connection parameters for our three main tools.

Reddit: Connected via a local stdio subprocess.
Developer Knowledge: Connected via a remote HTTP endpoint.
Nano Banana: Connected via a local stdio subprocess (our custom Python script).

Reddit Search (Discovery Tool)

The Reddit MCP server acts as a bridge to the Reddit API, allowing your agent to discover trending posts and analyze engagement without you having to write complex API wrappers. To ensure portability, the code uses a "find or fetch" strategy: it first checks for a local installation and, if missing, automatically uses npx to download and run the server on demand.

Instead of a network connection, the agent launches the server as a local subprocess and communicates via standard input and output (stdio). Within the Google ADK, the McpToolset class acts as a universal wrapper that standardizes these connections, enabling your agent to interact with various tools, from community resources to custom scripts like the Nano Banana image generator, using a common interface. By securely passing API credentials through environment variables, the system ensures these "plug-and-play" modules function as a seamless bridge between the AI and external platforms.

Paste this code in dev_signal_agent/tools/mcp_config.py:

code_block: <ListValue: [StructValue([('code', 'import os\r\nimport shutil\r\nfrom mcp import StdioServerParameters\r\nfrom google.adk.tools import McpToolset\r\nfrom google.adk.tools.mcp_tool import StreamableHTTPConnectionParams, StdioConnectionParams\r\n\r\ndef get_reddit_mcp_toolset(client_id: str = "", client_secret: str = "", user_agent: str = ""):\r\n """\r\n Connects to the Reddit MCP server.\r\n This server runs as a local subprocess (stdio) and proxies requests to the Reddit API.\r\n """\r\n # Check if \'reddit-mcp\' is installed globally, otherwise use npx to run it\r\n cmd = "reddit-mcp" if shutil.which("reddit-mcp") else "npx"\r\n args = [] if shutil.which("reddit-mcp") else ["-y", "--quiet", "reddit-mcp"]\r\n \r\n # Inject secrets into the environment of the subprocess only\r\n env = {\r\n **os.environ, \r\n "DOTENV_CONFIG_SILENT": "true", \r\n "LANG": "en_US.UTF-8"\r\n }\r\n\r\n if client_id: env["REDDIT_CLIENT_ID"] = client_id\r\n if client_secret: env["REDDIT_CLIENT_SECRET"] = client_secret\r\n if user_agent: env["REDDIT_USER_AGENT"] = user_agent\r\n\r\n return McpToolset(\r\n connection_params=StdioConnectionParams(\r\n server_params=StdioServerParameters(\r\n command=cmd, \r\n args=args, \r\n env=env # Pass injected secrets directly to the subprocess\r\n ),\r\n timeout=120.0\r\n )\r\n )'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51b20>)])]>

Google Cloud Docs (Knowledge Tool)

The Developer Knowledge MCP server provides grounding for your agent by allowing it to search the entire corpus of official Google Cloud documentation. Unlike the local Reddit server, this is a managed service hosted by Google and accessed as a remote endpoint over the internet. It exposes specialized tools like google_developer_documentation_search for semantic queries and google_developer_documentation_fetch to retrieve full markdown content, ensuring that every technical claim the agent makes is supported by definitive, up-to-date facts.

Note: you can also connect your coding assistant tools such as Gemini CLI or Antigravity to the developer knowledge MCP server to empower them with handy up to date Google Cloud documentation. I used it when writing this blog!

To connect, the agent uses the McpToolset class with StreamableHTTPConnectionParams, pointing to a web URL instead of launching a local process. It securely authenticates using a DK_API_KEY (create your api key) passed in the request headers, allowing the agent to perform a "comprehensive research sweep" across official docs, community sentiment, and broader web context through a single standardized interface.

Paste this code in dev_signal_agent/tools/mcp_config.py:

code_block: <ListValue: [StructValue([('code', 'def get_dk_mcp_toolset(api_key: str = ""):\r\n """\r\n Connects to Developer Knowledge (Google Cloud Docs).\r\n This is a remote MCP server accessed via HTTP.\r\n """\r\n headers = {}\r\n if api_key:\r\n headers["X-Goog-Api-Key"] = api_key\r\n else:\r\n # Fallback to os.environ for local testing if not passed via API\r\n headers["X-Goog-Api-Key"] = os.getenv("DK_API_KEY", "")\r\n\r\n return McpToolset(\r\n connection_params=StreamableHTTPConnectionParams(\r\n url="https://developerknowledge.googleapis.com/mcp",\r\n headers=headers\r\n )\r\n )'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51610>)])]>

The Image Generator (Nano Banana MCP)

While we've used external MCP servers for Reddit and documentation, we can also build our own custom MCP server to wrap specific Python logic. In this case, we are creating an image generation tool powered by Gemini 3 Pro Image (also known as Nano Banana Pro). This demonstrates that any Python function can be standardized into a tool that any agent can understand.

How the image generation works:

FastMCP: We use the fastmcp library to drastically simplify server creation, allowing us to register Python functions as tools with just a few lines of code.
Gemini Integration: The server uses the Google GenAI SDK to call the gemini-3-pro-image-preview model, which converts the agent's descriptive prompts into raw image bytes.
GCS Upload & Hosting: Because agent interfaces typically require a URL to display images, the server automatically uploads the generated bytes to Google Cloud Storage (GCS) and returns a public link.

To connect this local tool, we use StdioConnectionParams because the server runs as a local subprocess communicating via standard input and output. This transport method directly matches the transport="stdio" configuration we will define in our server entrypoint, ensuring a seamless connection for your custom local scripts.

The following code defines the MCP connection in dev_signal_agent/tools/mcp_config.py. We use uv run to ensure the server starts in an isolated environment with all its dependencies correctly installed.

Paste this code in dev_signal_agent/tools/mcp_config.py:

code_block: <ListValue: [StructValue([('code', 'def get_nano_banana_mcp_toolset():\r\n """\r\n Connects to our local \'Nano Banana\' image generator.\r\n This demonstrates how to wrap a local Python script as an MCP tool.\r\n """\r\n path = os.path.join("dev_signal_agent", "tools", "nano_banana_mcp", "main.py")\r\n bucket = os.getenv("AI_ASSETS_BUCKET") \r\n return McpToolset(\r\n connection_params=StdioConnectionParams(\r\n server_params=StdioServerParameters(\r\n command="uv", \r\n args=["run", path], \r\n env={**os.environ, "AI_ASSETS_BUCKET": bucket}\r\n ),\r\n timeout=600.0 # Image generation can take time\r\n )\r\n )'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa516d0>)])]>

Implementing the Nano Banana Pro Server Logic

Now, we will implement the actual logic for this server. This implementation is based on the Agent Factory demo code by Remigiusz Samborski. While Remi's original code provides instructions for deploying the MCP server to Cloud Run, we will run it here as a local subprocess for faster development and testing.

To get started, create the directory for our new server:

code_block: <ListValue: [StructValue([('code', 'mkdir -p dev_signal_agent/tools/nano_banana_mcp\r\ncd dev_signal_agent/tools/nano_banana_mcp'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa513a0>)])]>

The Server Entrypoint (`main.py` )

This file acts as the "brain" that initializes and starts the MCP server.

FastMCP Initialization: We use the FastMCP library to create a server named "MediaGenerators" and register our generate_image function as a tool
Safe Logging: The _initialize_console_logging function is critical. It forces all logs to sys.stderr. This is because the MCP "stdio" transport uses sys.stdout for communication between the agent and the tool; standard logs sent to stdout would corrupt that protocol.
Execution: The mcp.run(transport="stdio") line starts the server as a local subprocess, allowing it to listen for requests from your agent via standard input.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/main.py

code_block: <ListValue: [StructValue([('code', 'import logging\r\nimport os\r\nimport sys\r\nfrom fastmcp import FastMCP\r\nfrom dotenv import load_dotenv\r\nfrom nano_banana_pro import generate_image\r\n\r\ndef _initialize_console_logging(min_level: int = logging.INFO):\r\n # Ensure logs go to STDERR so they don\'t break the MCP stdio protocol\r\n handler = logging.StreamHandler(sys.stderr)\r\n logging.basicConfig(level=min_level, handlers=[handler], force=True)\r\n\r\ntools = [generate_image]\r\nmcp = FastMCP(name="MediaGenerators", tools=tools)\r\n\r\nif __name__ == "__main__":\r\n load_dotenv()\r\n _initialize_console_logging()\r\n mcp.run(transport="stdio")'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa518b0>)])]>

The Generation Logic (`nano_banana_pro.py)`

This is where the actual image generation happens using Gemini.

GenAI Client: We initialize the genai.Client() to interact with Google's generative models.
Model Selection: It specifically targets the gemini-3-pro-image-preview model. We set the response_modalities to "IMAGE" to tell the model we want pixels, not just text.
Robustness: The code includes a MAX_RETRIES loop (set to 5) to handle any transient generation errors, ensuring the agent has multiple attempts to get a valid image.
Byte Processing: Once the model generates the image, it arrives as raw inline data. We extract these bytes and call our helper to move them to the cloud.
URI Conversion: Finally, it replaces the internal gs:// path with a browser-accessible https:// URL so the user can actually see the image.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/nano_banana_pro.py

code_block: <ListValue: [StructValue([('code', 'import logging\r\nfrom typing import Literal, Optional\r\nfrom google import genai\r\nfrom google.genai import types\r\nfrom media_models import MediaAsset\r\nfrom storage_utils import upload_data_to_gcs\r\n\r\nAUTHORIZED_URI = "https://storage.mtls.cloud.google.com/"\r\nMAX_RETRIES = 5\r\n\r\nasync def generate_image(\r\n prompt: str,\r\n aspect_ratio: Literal["16:9", "9:16"] = "16:9",\r\n) -> MediaAsset:\r\n """Generates an image using Gemini 3 Image model."""\r\n genai_client = genai.Client()\r\n content = types.Content(parts=[types.Part.from_text(text=prompt)], role="user")\r\n \r\n logging.info(f"Starting image generation for prompt: {prompt[:50]}...")\r\n asset = MediaAsset(uri="")\r\n \r\n for _ in range(MAX_RETRIES):\r\n response = genai_client.models.generate_content(\r\n model="gemini-3-pro-image-preview",\r\n contents=[content],\r\n config=types.GenerateContentConfig(\r\n response_modalities=["IMAGE"],\r\n image_config=types.ImageConfig(aspect_ratio=aspect_ratio)\r\n )\r\n )\r\n if response and response.parts:\r\n for part in response.parts:\r\n if part.inline_data and part.inline_data.data:\r\n # Upload the raw bytes to GCS\r\n gcs_uri = await upload_data_to_gcs(\r\n "mcp-tools",\r\n part.inline_data.data,\r\n part.inline_data.mime_type\r\n )\r\n asset = MediaAsset(uri=gcs_uri)\r\n break\r\n if asset.uri: break\r\n\r\n if not asset.uri:\r\n asset.error = "No image was generated."\r\n else:\r\n # Convert gs:// URI to an HTTP accessible URL if needed\r\n asset.uri = asset.uri.replace(\'gs://\', AUTHORIZED_URI)\r\n logging.info(f"Image URL: {asset.uri}")\r\n \r\n return asset'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa516a0>)])]>

GCS Upload Helper (`storage_utils.py)`

Since agents need a web link to display images, this utility handles the hosting on Google Cloud Storage (GCS).

Dynamic Bucket Selection: It looks for a bucket name in your environment variables, falling back from AI_ASSETS_BUCKET to LOGS_BUCKET_NAME to ensure it always has a place to save data.
Unique Filenames: We use an MD5 hash of the raw image data to create a unique filename. This prevents filename collisions and acts as a simple way to avoid duplicate uploads of the same image.
Cloud Upload: The blob.upload_from_string method pushes the raw image bytes directly to your GCS bucket.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/storage_utils.py

code_block: <ListValue: [StructValue([('code', 'import hashlib\r\nimport mimetypes\r\nimport os\r\nfrom google.cloud.storage import Client, Blob\r\nfrom dotenv import load_dotenv\r\n\r\nload_dotenv()\r\nstorage_client = Client()\r\nai_bucket_name = os.environ.get("AI_ASSETS_BUCKET") or os.environ.get("LOGS_BUCKET_NAME")\r\nai_bucket = storage_client.bucket(ai_bucket_name)\r\n\r\nasync def upload_data_to_gcs(agent_id: str, data: bytes, mime_type: str) -> str:\r\n file_name = hashlib.md5(data).hexdigest()\r\n ext = mimetypes.guess_extension(mime_type) or ""\r\n blob_name = f"assets/{agent_id}/{file_name}{ext}"\r\n blob = Blob(bucket=ai_bucket, name=blob_name)\r\n blob.upload_from_string(data, content_type=mime_type, client=storage_client)\r\n return f"gs://{ai_bucket_name}/{blob_name}"'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51fd0>)])]>

Data Model (`media_models.py`)

This file ensures that our data follows a strict structure (Schema).

Structured Output: By using a Pydantic BaseModel, we guarantee that the tool always returns a consistent JSON object containing a uri (the link) and an optional error message. This makes it much easier for the AI agent to understand and process the tool's result.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/media_models.py

code_block: <ListValue: [StructValue([('code', 'from typing import Optional\r\nfrom pydantic import BaseModel\r\n\r\nclass MediaAsset(BaseModel):\r\n uri: str\r\n error: Optional[str] = None'), ('language', 'lang-py'), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51df0>)])]>

Tool Dependencies (`requirements.txt)`

While we use uv to run our code, a requirements.txt file remains essential because it defines the specific dependencies uv needs to install for the Nano Banana server to function. This provides the necessary "ingredients" to set up the isolated environment before the server starts.

This file lists the three core libraries required for this tool:

google-cloud-storage: Used for hosting the generated images on the cloud.
google-genai: Provides the logic for the Gemini 3 Pro image generation.
fastmcp: The framework that turns our Python script into a standardized MCP tool.

Paste this code in dev_signal_agent/tools/nano_banana_mcp/requirements.txt

code_block: <ListValue: [StructValue([('code', 'google-cloud-storage==3.6.*\r\ngoogle-genai==1.52.*\r\nfastmcp==2.13.*'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x7f726fa51fa0>)])]>

Summary

In this first part of our series, we focused on establishing the agent's core capabilities by standardizing its external integrations through the Model Context Protocol (MCP). We initialized the project using uv for high-speed dependency management and successfully configured three critical toolsets: Reddit for trend discovery, Google Cloud Docs for technical grounding, and a custom Nano Banana MCP server for multimodal image generation. By utilizing the Google ADK’s McpToolset, we’ve abstracted away complex API logic into simple, plug-and-play modules, ensuring that our tools share a common interface that decouples integration from intelligence.

For a deeper look into our technical foundation, you can explore the Developer Knowledge MCP server to learn more about knowledge grounding or visit the Google ADK GitHub repository to explore the framework's core capabilities

With our toolset fully configured and ready for action, we can now move to Part 2, where we will build the multi-agent architecture and integrate the Vertex AI memory bank to orchestrate these capabilities. You can also jump ahead to Part 3, where we will show you how to test the agent locally to verify these components on your workstation. If you’d like to dive ahead, you can explore the complete code for the entire series in our GitHub repository.

Special thanks to Remigiusz Samborski for the helpful review and feedback on this article.

For more content like this, follow me on Linkedin and X.