This sample shows how to train and create an ML model to predict the molecular energy, including how to preprocess raw data files.
The dataset for this sample comes from this Kaggle Dataset. However, the number of preprocessed JSON files was too small, so this sample will download the raw data files directly from this FTP source (in SDF fromat instead of JSON). Here's a more detailed description of the MDL/SDF file format.
These are the rough steps:
- data-extractor.py: extracts the data files
- preprocess.py: runs an Apache Beam pipeline for element-wise transformations, and tf.Transform for full-pass transformations. This can be run in Google Cloud Dataflow
- trainer/task.py: trains and evaluates the (Tensorflow) model. This can be run in Google Cloud ML Engine
This model only does a very simple preprocessing. It uses Apache beam to parse the SDF files and count how many Carbon, Hydrogen, Oxygen, and Nitrogen atoms a molecule has. Then it uses tf.Transform to normalize to values between 0 and 1. Finally, the normalized counts are fed into a TensorFlow Deep Neural Network. There are much more interesting features that could be extracted that will make more accurate predictions.
NOTE: This requires
python2, Apache Beam does not currently supportpython3.
You can clone the github repository and then navigate to the molecules sample directory. The rest of the instructions assume that you are in that directory.
git clone https://github.com/GoogleCloudPlatform/cloudml-samples.git
cd cloudml-samples/moleculesUsing virtualenv to isolate your dependencies is recommended. To set up, make
sure you have the virtualenv package installed.
pip install --user virtualenvTo create and activate a new virtual environment, run the following commands:
python -m virtualenv env
source env/bin/activateTo deactivate the virtual environment, run:
deactivateSee virtualenv for details.
You can use the requirements.txt to install the dependencies.
pip install -r requirements.txtBy default, all the scripts will store temporary data into /tmp/cloudml-samples/molecules/. Also, by default, this will only use 5 data files, each of them containing 250,000 molecules.
# Extract the data files
python data-extractor.py
# Preprocess the datasets
python preprocess.py
# Train and evaluate the model
python trainer/task.py
# Get the model path
MODEL_DIR=$(ls -d -1 /tmp/cloudml-samples/molecules/model/export/molecules/* | sort -r | head -n 1)
echo "Model: $MODEL_DIR"
# Make local predictions
gcloud ml-engine local predict \
--model-dir $MODEL_DIR \
--json-instances sample-requests.jsonAn end-to-end script has been included for your convenience, where you can specify a different number of data files using the --total-data-files option, as well as a different working directory using the --work-dir option.
# Simple run
bash run-local
# Run in your home directory
bash run-local --work-dir ~/cloudml-samples/moleculesFor reference, this are the real energy values for the sample-requests.json file.
PREDICTIONS
[37.801]
[44.1107]
[19.4085]
[-0.1086]To run on Google Cloud, all the files must reside in Google Cloud Storage. We'll start by defining our work directory.
WORK_DIR=gs://<Your bucket name>/cloudml-samples/moleculesAfter specifying our work directory, we can then extract the data files, preprocess, and train in Google Cloud using that location.
# Extract the data files
DATA_DIR=$WORK_DIR/data
python data-extractor.py \
--data-dir $DATA_DIR \
--total-data-files 10
# Preprocess the datasets using Apache Beam's DataflowRunner
PROJECT=$(gcloud config get-value project)
TEMP_DIR=$WORK_DIR/temp
PREPROCESS_DATA=$WORK_DIR/PreprocessData
python preprocess.py \
--data-dir $DATA_DIR \
--temp-dir $TEMP_DIR \
--preprocess-data $PREPROCESS_DATA \
--runner DataflowRunner \
--project $PROJECT \
--temp_location $TEMP_DIR \
--setup_file ./setup.py
# Train and evaluate the model in Google ML Engine
JOB="cloudml_samples_molecules_$(date +%Y%m%d_%H%M%S)"
BUCKET=$(echo $WORK_DIR | egrep -o gs://[-_.a-z0-9]+)
EXPORT_DIR=$WORK_DIR/model
gcloud ml-engine jobs submit training $JOB \
--stream-logs \
--module-name trainer.task \
--package-path trainer \
--staging-bucket $BUCKET \
-- \
--preprocess-data $PREPROCESS_DATA \
--export-dir $EXPORT_DIR
# Get the model path
MODEL_DIR=$(gsutil ls -d $EXPORT_DIR/export/molecules/* | sort -r | head -n 1)
echo "Model: $MODEL_DIR"
# Create a model in Google Cloud ML Engine
MODEL=molecules
gcloud ml-engine models create $MODEL
# Create a model version
VERSION=$JOB
gcloud ml-engine versions create $VERSION \
--model $MODEL \
--origin $MODEL_DIR
# Make predictions
gcloud ml-engine predict \
--model $MODEL \
--version $VERSION \
--json-instances sample-requests.jsonThere's also an end-to-end script for a cloud run. You can also specify the number of data files with the --total-data-files option, and the --work-dir has to be specified to a Google Cloud Storage location.
WORK_DIR=gs://<Your bucket name>/cloudml-samples/molecules
# Simple run
bash run-cloud --work-dir $WORK_DIR
# Run using 10 data files
bash run-cloud --work-dir $WORK_DIR --total-data-files 10For reference, this are the real energy values for the sample-requests.json file.
PREDICTIONS
[37.801]
[44.1107]
[19.4085]
[-0.1086]