Objectives
Gemma 4 is Google's most efficient open-weight model family, delivering strong reasoning and agentic capabilities. Long context, multimodality, reasoning, and tool calling let Gemma 4 handle complex logic, multi-step planning, coding, and agentic workflows.
This guide shows how to run LLM inference on Cloud Run GPUs with Gemma and Ollama, and has the following objectives:
- Deploy Ollama with the Gemma 4 model on a GPU-enabled Cloud Run service.
- Send prompts to the Ollama service on its private endpoint.
To learn an alternative way for deploy Gemma 4 open models on Cloud Run using a vLLM container, see Run Gemma 4 models on Cloud Run.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Cloud Run API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.- Install and initialize the gcloud CLI.
- Request
Total Nvidia RTX Pro 6000 GPU allocation, in milli GPU, without zonal redundancy, per project per regionquota under Cloud Run Admin API in the Quotas and system limits page to complete this tutorial.
Required roles
To get the permissions that you need to complete the tutorial, ask your administrator to grant you the following IAM roles on your project:
-
Cloud Run Admin (
roles/run.admin) -
Project IAM Admin (
roles/resourcemanager.projectIamAdmin) -
Service Account User (
roles/iam.serviceAccountUser) -
Service Usage Consumer (
roles/serviceusage.serviceUsageConsumer)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
Grant the roles
Console
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
- Click Grant access.
-
In the New principals field, enter your user identifier. This is typically the email address that is used to deploy the Cloud Run service.
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
gcloud
To grant the required IAM roles to your account on your project:
gcloud projects add-iam-policy-binding PROJECT_ID \ --member=PRINCIPAL \ --role=ROLE
Replace:
- PROJECT_NUMBER with your Google Cloud project number.
- PROJECT_ID with your Google Cloud project ID.
- PRINCIPAL with the account you are adding the binding for. This is typically the email address that is used to deploy the Cloud Run service.
- ROLE with the role you are adding to the deployer account.
Deploy the Ollama service for LLM inference
Deploy the service to Cloud Run:
gcloud beta run deploy SERVICE-NAME \
--image "ollama/ollama:latest" \
--project PROJECT_ID \
--region REGION \
--no-allow-unauthenticated \
--cpu 20 \
--memory 80Gi \
--gpu 1 \
--gpu-type nvidia-rtx-pro-6000 \
--no-gpu-zonal-redundancy \
--max-instances 1 \
--concurrency 16 \
--timeout 600 \
--set-env-vars=OLLAMA_NUM_PARALLEL=16 \
--set-env-vars=OLLAMA_HOST=0.0.0.0:8080 \
--set-env-vars=OLLAMA_DEBUG=false \
--set-env-vars=OLLAMA_KEEP_ALIVE=-1 \
--startup-probe tcpSocket.port=8080,initialDelaySeconds=240,failureThreshold=1,timeoutSeconds=240,periodSeconds=240 \
--command "bash" \
--args="-c,(sleep 15 && ollama pull MODEL_NAME) & ollama serve"Replace:
SERVICE-NAMEwith a unique name for the Cloud Run service.PROJECTwith your Google Cloud Project Id.REGIONwith a Google Cloud region wherenvidia-rtx-pro-6000GPUs are supported for Cloud Run, such asus-central1. For a full list of supported regions for GPU-enabled deployments, see GPU configuration.MODEL_NAMEwith the full name of a Gemma 4 variant.- Gemma 4 E2B:
gemma4:e2b - Gemma 4 E4B:
gemma4:e4b
- Gemma 4 E2B:
Gemma 4 26B and 31B require more advanced Cloud Run and vLLM configuration with Direct VPC Egress and Run:ai Model Streamer.
Note the following important flags in this command:
--concurrency 16is set to match the value of the environment variableOLLAMA_NUM_PARALLEL.--gpu 1with--gpu-type nvidia-rtx-pro-6000assigns 1 NVIDIA RTX PRO 6000 Blackwell GPU GPU to every Cloud Run instance in the service.--max-instances 1specifies the maximum number of instances to scale to. It has to be equal to or lower than your project's NVIDIA RTX Pro 6000 GPU (Total NVIDIA RTX Pro 6000 GPU allocation, in milli GPU, without zonal redundancy, per project per region) quota.--no-allow-unauthenticatedrestricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run's built-in Identity and Access Management (IAM) authentication for service-to-service communication. Refer to Managing access using IAM.--no-cpu-throttlingis required for enabling GPU.--no-gpu-zonal-redundancysets zonal redundancy options depending on your zonal failover requirements and available quota. See GPU zonal redundancy options for details.
Concurrency settings for optimal performance
This section provides context on the recommended concurrency settings. For optimal
request latency, ensure the --concurrency setting is equal to Ollama's
OLLAMA_NUM_PARALLEL environment variable.
OLLAMA_NUM_PARALLELdetermines how many request slots are available per each model to handle inference requests concurrently.--concurrencydetermines how many requests Cloud Run sends to an Ollama instance at the same time.
If --concurrency exceeds OLLAMA_NUM_PARALLEL, Cloud Run can send
more requests to a model in Ollama than it has available request slots for.
This leads to request queuing within Ollama, increasing request latency for the
queued requests. It also leads to less responsive auto scaling, as the queued
requests don't trigger Cloud Run to scale out and start new instances.
Ollama also supports serving multiple models from one GPU. To
avoid request queuing on the Ollama instance, set
--concurrency to match OLLAMA_NUM_PARALLEL.
Increasing OLLAMA_NUM_PARALLEL
also makes parallel requests take longer.
Optimize GPU utilization
For optimal GPU utilization, increase --concurrency, keeping it within
twice the value of OLLAMA_NUM_PARALLEL. While this leads to request queuing in Ollama, it can help improve utilization: Ollama instances can immediately process requests from their queue, and the queues help absorb traffic spikes.
Test the deployed Ollama service with curl
Now that you have deployed the Ollama service, you can send requests to it. However,
if you send a request directly, Cloud Run responds with HTTP 401 Unauthorized.
This is intentional, because an LLM inference API is intended for other services to
call, such as a frontend application. For more information on service-to-service
authentication on Cloud Run, refer to Authenticating service-to-service.
To send requests to the Ollama service, add a header with a valid OIDC token to the requests, for example using the Cloud Run developer proxy:
Start the proxy, and when prompted to install the
cloud-run-proxycomponent, chooseY:gcloud run services proxy SERVICE-NAME \ --project PROJECT_ID \ --region REGION \ --port=9090Send a request to it in a separate terminal tab, leaving the proxy running. Note that the proxy runs on
localhost:9090:curl http://localhost:9090/api/generate -d '{ "model": "MODEL_NAME", "prompt": "Why is the sky blue?", "stream": false }' | jq -r '.response'This command should provide streaming output similar to this:
This is one of the most beautiful and fundamental questions in physics! The reason the sky appears blue is due to a phenomenon called **Rayleigh Scattering**. ...
Clean up
To avoid additional charges to your Google Cloud account, delete all the resources you deployed with this tutorial.
Delete the project
If you created a new project for this tutorial, delete the project. If you used an existing project and need to keep it without the changes you added in this tutorial, delete resources that you created for the tutorial.
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete tutorial resources
Delete the Cloud Run service you deployed in this tutorial. Cloud Run services don't incur costs until they receive requests.
To delete your Cloud Run service, run the following command:
gcloud run services delete SERVICE-NAME
Replace SERVICE-NAME with the name of your service.
You can also delete Cloud Run services from the Google Cloud console.
Remove the
gclouddefault region configuration you added during tutorial setup:gcloud config unset run/regionRemove the project configuration:
gcloud config unset project