# Benchmarking Python Feature Server Here we provide tools for benchmarking Python-based feature server with 3 online stores: Redis, AWS DynamoDB, and GCP Datastore. Follow the instructions below to reproduce the benchmarks. _Tested with: `feast 0.25.1`_ ## Prerequisites You need to have the following installed: * Python `3.8+` * Feast `0.25+` * Docker * Docker Compose `v2.x` * Vegeta * `parquet-tools` All these benchmarks are run on an EC2 instance (c5.4xlarge, 16vCPU, 32GiB memory) or a GCP GCE instance (c2-standard-16, 16 vCPU, 64GiB memory), on the same region as the target online stores. **Note**: see [here](cloud_machines.md) for details on how to provision the cloud instances to run the tests. ## Generate Data For all of the following benchmarks, you'll need to generate the data using `data_generator.py` under the top-level directory of this repo. Just `cd` to the main directory and run `python data_generator.py`. ## Redis 1. Apply feature definitions to create a Feast repo. ``` cd python/feature_repos/redis feast apply ``` 2. Deploy Redis & feature servers using docker-compose ``` cd ../../docker/redis docker-compose up -d ``` If everything goes well, you should see an output like this: ``` Creating redis_redis_1 ... done Creating redis_feast_1 ... done Creating redis_feast_2 ... done Creating redis_feast_3 ... done Creating redis_feast_4 ... done Creating redis_feast_5 ... done Creating redis_feast_6 ... done Creating redis_feast_7 ... done Creating redis_feast_8 ... done Creating redis_feast_9 ... done Creating redis_feast_10 ... done Creating redis_feast_11 ... done Creating redis_feast_12 ... done Creating redis_feast_13 ... done Creating redis_feast_14 ... done Creating redis_feast_15 ... done Creating redis_feast_16 ... done ``` 3. Materialize data to Redis ``` cd ../../feature_repos/redis # This is unfortunately necessary because inside docker feature servers resolve # Redis host name as `redis`, but since we're running materialization from shell, # Redis is accessible on localhost: sed -i 's/redis:6379/localhost:6379/g' feature_store.yaml feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") # Make sure to change this back, since it can mess up with feature servers # if you run another docker-compose command later: sed -i 's/localhost:6379/redis:6379/g' feature_store.yaml ``` 4. Check that feature servers are working & they have materialized data ``` cd ../../.. parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 ``` This should return something like this: ``` +----------+ | entity | |----------| | 94 | | 1992 | | 4475 | ``` Put these numbers into an env variable with: ``` TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s` echo $TEST_ENTITY_IDS ``` (which should output something like `94 , 1992 , 4475 `) Query the feature server with ``` curl -X POST \ "http://127.0.0.1:6566/get-online-features" \ -H "accept: application/json" \ -d "{ \"feature_service\": \"feature_service_0\", \"entities\": { \"entity\": [$TEST_ENTITY_IDS] } }" | jq ``` In the output, make sure that `"values"` field contains none of the null values. It should look something like this: ``` { "values": [ 4475, 1551, 9889, ``` 5. Run Benchmarks ``` cd python ./run-benchmark.sh ``` ## AWS DynamoDB For this benchmark, you'll need to have AWS credentials configured in `~/.aws/credentials`. 1. Apply feature definitions to create a Feast repo. ``` cd feature_repos/dynamo feast apply ``` 2. Deploy feature servers using docker-compose ``` cd ../../docker/dynamo docker-compose up -d ``` If everything goes well, you should see an output like this: ``` Creating dynamo_feast_1 ... done Creating dynamo_feast_2 ... done Creating dynamo_feast_3 ... done Creating dynamo_feast_4 ... done Creating dynamo_feast_5 ... done Creating dynamo_feast_6 ... done Creating dynamo_feast_7 ... done Creating dynamo_feast_8 ... done Creating dynamo_feast_9 ... done Creating dynamo_feast_10 ... done Creating dynamo_feast_11 ... done Creating dynamo_feast_12 ... done Creating dynamo_feast_13 ... done Creating dynamo_feast_14 ... done Creating dynamo_feast_15 ... done Creating dynamo_feast_16 ... done ``` 3. Materialize data to DynamoDB ``` cd ../../feature_repos/dynamo feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") ``` 4. Check that feature servers are working & they have materialized data ``` cd ../../.. parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 ``` This should return something like this: ``` +----------+ | entity | |----------| | 94 | | 1992 | | 4475 | ``` Put these numbers into an env variable with: ``` TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s` echo $TEST_ENTITY_IDS ``` (which should output something like `94 , 1992 , 4475 `) Query the feature server with ``` curl -X POST \ "http://127.0.0.1:6566/get-online-features" \ -H "accept: application/json" \ -d "{ \"feature_service\": \"feature_service_0\", \"entities\": { \"entity\": [$TEST_ENTITY_IDS] } }" | jq ``` In the output, make sure that `"values"` field contains none of the null values. It should look something like this: ``` { "values": [ 4475, 1551, 9889, ``` 5. Run Benchmarks ``` cd python ./run-benchmark.sh ``` ## GCP Datastore For this benchmark, you need GCP credentials accessible. Here it is assumed that it's all in `${HOME}/.config/gcloud`, which will be available to the docker containers running the feature server. (Adjust as needed by inspecting the `docker-compose.yml`). 1. Apply feature definitions to create a Feast repo. ``` cd feature_repos/datastore feast apply ``` 2. Deploy feature servers using docker-compose ``` cd ../../docker/datastore docker-compose up -d ``` If everything goes well, you should see an output like this: ``` Creating datastore_feast_1 ... done Creating datastore_feast_2 ... done Creating datastore_feast_3 ... done Creating datastore_feast_4 ... done Creating datastore_feast_5 ... done Creating datastore_feast_6 ... done Creating datastore_feast_7 ... done Creating datastore_feast_8 ... done Creating datastore_feast_9 ... done Creating datastore_feast_10 ... done Creating datastore_feast_11 ... done Creating datastore_feast_12 ... done Creating datastore_feast_13 ... done Creating datastore_feast_14 ... done Creating datastore_feast_15 ... done Creating datastore_feast_16 ... done ``` > _Note_: The Python google package requires not only the credentials to be accessible > (in read-write mode, as can be seen in the Datastore docker-compose.yml), > but also the google cloud SDK to be installed. > For this reason there is an additional step in the Dockerfile for Datastore, > which handles the installation. [Reference](https://stackoverflow.com/questions/28372328/how-to-install-the-google-cloud-sdk-in-a-docker-image). 3. Materialize data to Datastore ``` cd ../../feature_repos/datastore feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") ``` 4. Check that feature servers are working & they have materialized data ``` cd ../../.. parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 ``` This should return something like this: ``` +----------+ | entity | |----------| | 94 | | 1992 | | 4475 | ``` Put these numbers into an env variable with: ``` TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s` echo $TEST_ENTITY_IDS ``` (which should output something like `94 , 1992 , 4475 `) Query the feature server with ``` curl -X POST \ "http://127.0.0.1:6566/get-online-features" \ -H "accept: application/json" \ -d "{ \"feature_service\": \"feature_service_0\", \"entities\": { \"entity\": [$TEST_ENTITY_IDS] } }" | jq ``` In the output, make sure that `"values"` field contains none of the null values. It should look something like this: ``` { "values": [ 4475, 1551, 9889, ``` 5. Run Benchmarks ``` cd python ./run-benchmark.sh ``` ## Cassandra This runs on a single-node Cassandra cluster running in Docker alongside the benchmarking containers. 1. Start the docker containers: ``` cd docker/cassandra docker-compose up -d ``` If everything goes well, you should see an output like this: ``` ⠿ Network cassandra_default Created 0.0s ⠿ Container cassandra-cassandra-1 Started 0.6s ⠿ Container cassandra-feast-16 Started 1.0s ⠿ Container cassandra-feast-1 Started 1.5s ⠿ Container cassandra-feast-8 Started 3.0s ⠿ Container cassandra-feast-4 Started 2.4s ⠿ Container cassandra-feast-2 Started 2.4s ⠿ Container cassandra-feast-14 Started 2.2s ⠿ Container cassandra-feast-5 Started 1.5s ⠿ Container cassandra-feast-3 Started 2.8s ⠿ Container cassandra-feast-13 Started 0.8s ⠿ Container cassandra-feast-9 Started 1.3s ⠿ Container cassandra-feast-11 Started 1.7s ⠿ Container cassandra-feast-15 Started 0.9s ⠿ Container cassandra-feast-6 Started 2.8s ⠿ Container cassandra-feast-12 Started 2.0s ⠿ Container cassandra-feast-7 Started 2.5s ⠿ Container cassandra-feast-10 Started 1.8s ``` Wait about 60-90 seconds for Cassandra to fully start. Then you can proceed (if not ready yet, the next command will error and you can retry it a little later). 2. Create the destination keyspace in Cassandra: check the output of this command to make sure `feast_test` is now here. ``` docker exec -it cassandra-cassandra-1 cqlsh -e \ "CREATE KEYSPACE feast_test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1}; DESC KEYSPACES;" ``` 3. From the host machine, provision the feature store: ``` cd ../../feature_repos/cassandra/ # This is unfortunately necessary because inside docker feature servers resolve # Cassandra host name as `cassandra`, but since we're running materialization from shell, # Cassandra is accessible on localhost: sed -i 's/- cassandra/- localhost/g' feature_store.yaml feast apply # Make sure to change this back, since it can mess up with feature servers # if you run another docker-compose command later: sed -i 's/- localhost/- cassandra/g' feature_store.yaml ``` 4. Similarly, materialize from the host machine: ``` # This is unfortunately necessary because inside docker feature servers resolve # Cassandra host name as `cassandra`, but since we're running materialization from shell, # Cassandra is accessible on localhost: sed -i 's/- cassandra/- localhost/g' feature_store.yaml feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") # Make sure to change this back, since it can mess up with feature servers # if you run another docker-compose command later: sed -i 's/- localhost/- cassandra/g' feature_store.yaml ``` 3b. A workaround for the Dockerized feast to work The Docker container have a _copy_ of the registry directory, including data/registry.db. But the image gets done before the `apply` step above (it is inevitable if we want to create the keyspace and have the Cassandra part of the `docker-compose`), so the Docker `feast`s have not the updated `registry.db`. For the time being, a workaround is as follows: ``` docker cp data/registry.db cassandra-feast-1:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-2:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-3:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-4:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-5:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-6:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-7:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-8:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-9:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-10:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-11:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-12:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-13:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-14:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-15:/feature_repo/data/registry.db docker cp data/registry.db cassandra-feast-16:/feature_repo/data/registry.db cd ../../docker/cassandra/ docker-compose restart cd ../../feature_repos/cassandra/ ``` 5. Check that feature servers are working & they have materialized data ``` cd ../../.. parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 ``` This should return something like this: ``` +----------+ | entity | |----------| | 94 | | 1992 | | 4475 | ``` Put these numbers into an env variable with: ``` TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s` echo $TEST_ENTITY_IDS ``` (which should output something like `94 , 1992 , 4475 `) Query the feature server with ``` curl -X POST \ "http://127.0.0.1:6566/get-online-features" \ -H "accept: application/json" \ -d "{ \"feature_service\": \"feature_service_0\", \"entities\": { \"entity\": [$TEST_ENTITY_IDS] } }" | jq ``` In the output, make sure that `"values"` field contains none of the null values. It should look something like this: ``` { "values": [ 4475, 1551, 9889, ``` 6. Run Benchmarks ``` cd python ./run-benchmark.sh ``` ## Astra DB Ensure you have an Astra DB instance in the same AWS region as your benchmarking client. To connect to it you need the Client ID and the Client Secret from a database token, as well as the "secure connect bundle" zip-file which should be placed inside the `python/feature_repos/astra_db/` directory. Adjust file `feature_store.yaml` in that directory to reflect Client ID, Client Secret, database keyspace name, AWS region name and secure-connect-bundle filename. **Note**: in order to be able to share the same `feature_store.yaml` from both the Dockerized `feast` instances and the one on the host machine, please put the secure connect bundle in the `python/feature_repos/astra_db/` directory itself and refer to it as `./secure-connect-DATABASENAME.zip` (i.e. with a relative path). 1. Apply feature definitions to create a Feast repo. ``` cd feature_repos/astra_db feast apply ``` 2. Deploy feature servers using docker-compose ``` cd ../../docker/astra_db docker-compose up -d ``` If everything goes well, you should see an output like this: ``` ⠿ Network astra_db_default Created 0.0s ⠿ Container astra_db-feast-1 Started 2.7s ⠿ Container astra_db-feast-16 Started 2.8s ⠿ Container astra_db-feast-3 Started 2.4s ⠿ Container astra_db-feast-5 Started 1.4s ⠿ Container astra_db-feast-11 Started 1.8s ⠿ Container astra_db-feast-4 Started 1.6s ⠿ Container astra_db-feast-2 Started 1.2s ⠿ Container astra_db-feast-6 Started 0.8s ⠿ Container astra_db-feast-12 Started 2.1s ⠿ Container astra_db-feast-7 Started 3.0s ⠿ Container astra_db-feast-8 Started 0.8s ⠿ Container astra_db-feast-10 Started 2.8s ⠿ Container astra_db-feast-14 Started 1.2s ⠿ Container astra_db-feast-15 Started 2.9s ⠿ Container astra_db-feast-13 Started 1.8s ⠿ Container astra_db-feast-9 Started 2.3s ``` 3. Materialize data to Astra DB ``` cd ../../feature_repos/astra_db feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") ``` 4. Check that feature servers are working & they have materialized data ``` cd ../../.. parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 ``` This should return something like this: ``` +----------+ | entity | |----------| | 94 | | 1992 | | 4475 | ``` Put these numbers into an env variable with: ``` TEST_ENTITY_IDS=`parquet-tools show --columns entity generated_data.parquet 2>/dev/null | head -n 6 | tail -n 3 | sed 's/|//g' | paste -d, -s` echo $TEST_ENTITY_IDS ``` (which should output something like `94 , 1992 , 4475 `) Query the feature server with ``` curl -X POST \ "http://127.0.0.1:6566/get-online-features" \ -H "accept: application/json" \ -d "{ \"feature_service\": \"feature_service_0\", \"entities\": { \"entity\": [$TEST_ENTITY_IDS] } }" | jq ``` In the output, make sure that `"values"` field contains none of the null values. It should look something like this: ``` { "values": [ 4475, 1551, 9889, ``` 5. Run Benchmarks ``` cd python ./run-benchmark.sh ```