The google-batch API is in active development. Various features might not always work perfectly as we continue to make them better.
The google-batch provider utilizes the Google Batch API to queue a request for the following sequence of events.
- Create a Google Compute Engine Virtual Machine (VM) instance.
- Create a Google Compute Engine Persistent Disk and mount it as a "data disk".
- Localize files from Google Cloud Storage to the data disk.
- Run execute your
--scriptor--commandin your Docker container. - Delocalize files from the data disk to Google Cloud Storage.
- Destroy the VM
When the Batch jobs.create() API is called, it creates a Batch Job. The Batch API service will then create the VM and disk when your Cloud Project has sufficient Compute Engine quota.
Execution of dsub features is handled by a series of Docker containers on the
VM. The sequence of containers executed is:
logging(copy logs to GCS; run in background)prepare(prepare data disk and save your script to the data disk)localization(copy GCS objects to the data disk)user-command(execute the user command)delocalization(copy files from the data disk to GCS)final_logging(copy logs to GCS; always run)
The prepare step does the following:
- Create runtime directories (
script,tmp,workingdir). - Write the user
--scriptor--commandto a file and make it executable. - Create the directories for
--inputand--outputparameters.
The data disk path in the Docker containers is:
/mnt/disks/data
The /mnt/disks/data folder contains:
input: location of localized--inputand--input-recursiveparameters.output: location for your script to write files to be delocalized for--outputand--output-recursiveparameters.script: location of the your dsub--scriptor--commandscript.tmp: temporary directory for the your script.TMPDIRis set to this directory.workingdir: the working directory set before the your script runs.
The Batch API supports task states of:
| Task State | Description |
|---|---|
| STATE_UNSPECIFIED | Unknown state. |
| PENDING | The Task is created and waiting for resources. |
| ASSIGNED | The Task is assigned to at least one VM. |
| RUNNING | The Task is running. |
| FAILED | The Task has failed. |
| SUCCEEDED | The Task has succeeded. |
| UNEXECUTED | The Task has not been executed when the Job finishes. |
dsub interprets the above to provide task statuses of:
- RUNNING (
PENDING,ASSIGNED,RUNNING) - SUCCESS (
SUCCEEDED) - FAILURE (
FAILED) - CANCELED (
UNEXECUTED)
The google-batch provider saves 3 log files to Cloud Storage, every 5 minutes
to the --logging location specified to dsub:
[prefix].log: log generated by all containers running on the VM[prefix]-stdout.log: stdout from your Docker container[prefix]-stderr.log: stderr from your Docker container
Logging paths and the [prefix] are discussed further in Logging.
The google-batch provider supports many resource-related
flags to configure the Compute Engine VMs that tasks run on, such as
--machine-type or --min-cores and --min-ram, as well as --boot-disk-size
and --disk-size. Additional provider-specific parameters are available
and documented below.
The Docker container launched by the Batch API will use the host VM boot
disk for the system services needed to orchestrate the set of docker actions
defined by dsub. All other directories set up by dsub will be on the
data disk, including the TMPDIR (as discussed above). In general it should
be unnecessary for end-users to ever change the --boot-disk-size and they
should only need to set the --disk-size. One known exception is when very
large Docker images are used, as such images need to be pulled to the boot disk.
The steps for getting started are summarized below:
-
Sign up for a Google account and create a project.
-
Provide credentials so
dsubcan call Google APIs:gcloud auth application-default login -
Create a Google Cloud Storage bucket.
The dsub logs and output files will be written to a bucket. Create a bucket using the storage browser or run the command-line utility gsutil, included in the Cloud SDK.
gsutil mb gs://my-bucketChange
my-bucketto a unique name that follows the bucket-naming conventions.(By default, the bucket will be in the US, but you can change or refine the location setting with the
-loption.) -
Run a very simple "Hello World"
dsubjob and wait for completion.dsub \ --provider google-batch \ --project my-cloud-project \ --regions us-central1 \ --logging gs://my-bucket/logging/ \ --output OUT=gs://my-bucket/output/out.txt \ --command 'echo "Hello World" > "${OUT}"' \ --waitChange
my-cloud-projectto your Google Cloud project, andmy-bucketto the bucket you created above.The output of the script command will be written to the
OUTfile in Cloud Storage that you specify. -
View the output file.
gsutil cat gs://my-bucket/output/out.txt