Skip to content

Commit e00904e

Browse files
authored
chore: add internal benchmarking script (#760)
* chore: add internal benchmarking script and readme * archive benchwrapper to subdirectory * blacken lint * fix typo * update script with preconditions and upload from disk * download to file * update multiprocessing and readme * clean up * update benchmarking script * update checksumming options and default num processes * replace tempfile package usage
1 parent bf13a62 commit e00904e

File tree

7 files changed

+331
-12
lines changed

7 files changed

+331
-12
lines changed
Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,45 @@
1-
# storage benchwrapp
1+
# python-storage benchmarking
22

3-
main.py is a gRPC wrapper around the storage library for benchmarking purposes.
3+
**This is not an officially supported Google product**
44

5-
## Running
5+
This benchmarking script is used by Storage client library maintainers to benchmark various workloads and collect metrics in order to improve performance of the library.
6+
Currently the benchmarking runs a Write-1-Read-3 workload and measures the usual two QoS performance attributes, latency and throughput.
67

8+
## Run example:
9+
This runs 10K iterations of Write-1-Read-3 on 5KiB to 16KiB files, and generates output to a default csv file `benchmarking<TIMESTAMP>.csv`:
710
```bash
8-
$ export STORAGE_EMULATOR_HOST=http://localhost:8080
9-
$ pip install grpcio
10-
$ cd storage
11+
$ cd python-storage
1112
$ pip install -e . # install google.cloud.storage locally
1213
$ cd tests/perf
13-
$ python3 benchwrapper.py --port 8081
14+
$ python3 benchmarking.py --num_samples 10000 --max_size 16384
1415
```
1516

16-
## Re-generating protos
17+
## CLI parameters
1718

18-
```bash
19-
$ pip install grpcio-tools
20-
$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. *.proto
21-
```
19+
| Parameter | Description | Possible values | Default |
20+
| --------- | ----------- | --------------- |:-------:|
21+
| --min_size | minimum object size in bytes | any positive integer | `5120` (5 KiB) |
22+
| --max_size | maximum object size in bytes | any positive integer | `2147483648` (2 GiB) |
23+
| --num_samples | number of W1R3 iterations | any positive integer | `1000` |
24+
| --r | bucket region for benchmarks | any GCS region | `US` |
25+
| --p | number of processes (multiprocessing enabled) | any positive integer | 16 (recommend not to exceed 16) |
26+
| --o | file to output results to | any file path | `benchmarking<TIMESTAMP>.csv` |
27+
28+
29+
## Workload definition and CSV headers
30+
31+
For each invocation of the benchmark, write a new object of random size between `min_size` and `max_size` . After the successful write, download the object in full three times. For each of the 4 operations record the following fields:
32+
33+
| Field | Description |
34+
| ----- | ----------- |
35+
| Op | the name of the operations (WRITE, READ[{0,1,2}]) |
36+
| ObjectSize | the number of bytes of the object |
37+
| LibBufferSize | configured to use the [library default of 100 MiB](https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/blob.py#L135) |
38+
| Crc32cEnabled | bool: whether crc32c was computed for the operation |
39+
| MD5Enabled | bool: whether MD5 was computed for the operation |
40+
| ApiName | default to JSON|
41+
| ElapsedTimeUs | the elapsed time in microseconds the operation took |
42+
| Status | completion state of the operation [OK, FAIL] |
43+
| RunID | timestamp from the benchmarking run |
44+
| AppBufferSize | N/A |
45+
| CpuTimeUs | N/A |
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
# Copyright 2022 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""Performance benchmarking script. This is not an officially supported Google product."""
16+
17+
import argparse
18+
import csv
19+
import logging
20+
import multiprocessing
21+
import os
22+
import random
23+
import time
24+
import uuid
25+
26+
from functools import partial, update_wrapper
27+
28+
from google.cloud import storage
29+
30+
31+
##### DEFAULTS & CONSTANTS #####
32+
HEADER = [
33+
"Op",
34+
"ObjectSize",
35+
"AppBufferSize",
36+
"LibBufferSize",
37+
"Crc32cEnabled",
38+
"MD5Enabled",
39+
"ApiName",
40+
"ElapsedTimeUs",
41+
"CpuTimeUs",
42+
"Status",
43+
"RunID",
44+
]
45+
CHECKSUM = ["md5", "crc32c", None]
46+
TIMESTAMP = time.strftime("%Y%m%d-%H%M%S")
47+
DEFAULT_API = "JSON"
48+
DEFAULT_BUCKET_LOCATION = "US"
49+
DEFAULT_MIN_SIZE = 5120 # 5 KiB
50+
DEFAULT_MAX_SIZE = 2147483648 # 2 GiB
51+
DEFAULT_NUM_SAMPLES = 1000
52+
DEFAULT_NUM_PROCESSES = 16
53+
DEFAULT_LIB_BUFFER_SIZE = 104857600 # https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/blob.py#L135
54+
NOT_SUPPORTED = -1
55+
56+
57+
def log_performance(func):
58+
"""Log latency and throughput output per operation call."""
59+
# Holds benchmarking results for each operation
60+
res = {
61+
"ApiName": DEFAULT_API,
62+
"RunID": TIMESTAMP,
63+
"CpuTimeUs": NOT_SUPPORTED,
64+
"AppBufferSize": NOT_SUPPORTED,
65+
"LibBufferSize": DEFAULT_LIB_BUFFER_SIZE,
66+
}
67+
68+
try:
69+
elapsed_time = func()
70+
except Exception as e:
71+
logging.exception(
72+
f"Caught an exception while running operation {func.__name__}\n {e}"
73+
)
74+
res["Status"] = ["FAIL"]
75+
elapsed_time = NOT_SUPPORTED
76+
else:
77+
res["Status"] = ["OK"]
78+
79+
checksum = func.keywords.get("checksum")
80+
num = func.keywords.get("num", None)
81+
res["ElapsedTimeUs"] = elapsed_time
82+
res["ObjectSize"] = func.keywords.get("size")
83+
res["Crc32cEnabled"] = checksum == "crc32c"
84+
res["MD5Enabled"] = checksum == "md5"
85+
res["Op"] = func.__name__
86+
if res["Op"] == "READ":
87+
res["Op"] += f"[{num}]"
88+
89+
return [
90+
res["Op"],
91+
res["ObjectSize"],
92+
res["AppBufferSize"],
93+
res["LibBufferSize"],
94+
res["Crc32cEnabled"],
95+
res["MD5Enabled"],
96+
res["ApiName"],
97+
res["ElapsedTimeUs"],
98+
res["CpuTimeUs"],
99+
res["Status"],
100+
res["RunID"],
101+
]
102+
103+
104+
def WRITE(bucket, blob_name, checksum, size, **kwargs):
105+
"""Perform an upload and return latency."""
106+
blob = bucket.blob(blob_name)
107+
file_path = f"{os.getcwd()}/{uuid.uuid4().hex}"
108+
# Create random file locally on disk
109+
with open(file_path, "wb") as file_obj:
110+
file_obj.write(os.urandom(size))
111+
112+
start_time = time.monotonic_ns()
113+
blob.upload_from_filename(file_path, checksum=checksum, if_generation_match=0)
114+
end_time = time.monotonic_ns()
115+
116+
elapsed_time = round(
117+
(end_time - start_time) / 1000
118+
) # convert nanoseconds to microseconds
119+
120+
# Clean up local file
121+
cleanup_file(file_path)
122+
123+
return elapsed_time
124+
125+
126+
def READ(bucket, blob_name, checksum, **kwargs):
127+
"""Perform a download and return latency."""
128+
blob = bucket.blob(blob_name)
129+
if not blob.exists():
130+
raise Exception("Blob does not exist. Previous WRITE failed.")
131+
132+
file_path = f"{os.getcwd()}/{blob_name}"
133+
with open(file_path, "wb") as file_obj:
134+
start_time = time.monotonic_ns()
135+
blob.download_to_file(file_obj, checksum=checksum)
136+
end_time = time.monotonic_ns()
137+
138+
elapsed_time = round(
139+
(end_time - start_time) / 1000
140+
) # convert nanoseconds to microseconds
141+
142+
# Clean up local file
143+
cleanup_file(file_path)
144+
145+
return elapsed_time
146+
147+
148+
def cleanup_file(file_path):
149+
"""Clean up local file on disk."""
150+
try:
151+
os.remove(file_path)
152+
except Exception as e:
153+
logging.exception(f"Caught an exception while deleting local file\n {e}")
154+
155+
156+
def _wrapped_partial(func, *args, **kwargs):
157+
"""Helper method to create partial and propagate function name and doc from original function."""
158+
partial_func = partial(func, *args, **kwargs)
159+
update_wrapper(partial_func, func)
160+
return partial_func
161+
162+
163+
def _generate_func_list(bucket_name, min_size, max_size):
164+
"""Generate Write-1-Read-3 workload."""
165+
# generate randmon size in bytes using a uniform distribution
166+
size = random.randrange(min_size, max_size)
167+
blob_name = f"{TIMESTAMP}-{uuid.uuid4().hex}"
168+
169+
# generate random checksumming type: md5, crc32c or None
170+
idx_checksum = random.choice([0, 1, 2])
171+
checksum = CHECKSUM[idx_checksum]
172+
173+
func_list = [
174+
_wrapped_partial(
175+
WRITE,
176+
storage.Client().bucket(bucket_name),
177+
blob_name,
178+
size=size,
179+
checksum=checksum,
180+
),
181+
*[
182+
_wrapped_partial(
183+
READ,
184+
storage.Client().bucket(bucket_name),
185+
blob_name,
186+
size=size,
187+
checksum=checksum,
188+
num=i,
189+
)
190+
for i in range(3)
191+
],
192+
]
193+
return func_list
194+
195+
196+
def benchmark_runner(args):
197+
"""Run benchmarking iterations."""
198+
results = []
199+
for func in _generate_func_list(args.b, args.min_size, args.max_size):
200+
results.append(log_performance(func))
201+
202+
return results
203+
204+
205+
def main(args):
206+
# Create a storage bucket to run benchmarking
207+
client = storage.Client()
208+
if not client.bucket(args.b).exists():
209+
bucket = client.create_bucket(args.b, location=args.r)
210+
211+
# Launch benchmark_runner using multiprocessing
212+
p = multiprocessing.Pool(args.p)
213+
pool_output = p.map(benchmark_runner, [args for _ in range(args.num_samples)])
214+
215+
# Output to CSV file
216+
with open(args.o, "w") as file:
217+
writer = csv.writer(file)
218+
writer.writerow(HEADER)
219+
for result in pool_output:
220+
for row in result:
221+
writer.writerow(row)
222+
print(f"Succesfully ran benchmarking. Please find your output log at {args.o}")
223+
224+
# Cleanup and delete bucket
225+
try:
226+
bucket.delete(force=True)
227+
except Exception as e:
228+
logging.exception(f"Caught an exception while deleting bucket\n {e}")
229+
230+
231+
if __name__ == "__main__":
232+
parser = argparse.ArgumentParser()
233+
parser.add_argument(
234+
"--min_size",
235+
type=int,
236+
default=DEFAULT_MIN_SIZE,
237+
help="Minimum object size in bytes",
238+
)
239+
parser.add_argument(
240+
"--max_size",
241+
type=int,
242+
default=DEFAULT_MAX_SIZE,
243+
help="Maximum object size in bytes",
244+
)
245+
parser.add_argument(
246+
"--num_samples",
247+
type=int,
248+
default=DEFAULT_NUM_SAMPLES,
249+
help="Number of iterations",
250+
)
251+
parser.add_argument(
252+
"--p",
253+
type=int,
254+
default=DEFAULT_NUM_PROCESSES,
255+
help="Number of processes- multiprocessing enabled",
256+
)
257+
parser.add_argument(
258+
"--r", type=str, default=DEFAULT_BUCKET_LOCATION, help="Bucket location"
259+
)
260+
parser.add_argument(
261+
"--o",
262+
type=str,
263+
default=f"benchmarking{TIMESTAMP}.csv",
264+
help="File to output results to",
265+
)
266+
parser.add_argument(
267+
"--b",
268+
type=str,
269+
default=f"benchmarking{TIMESTAMP}",
270+
help="Storage bucket name",
271+
)
272+
args = parser.parse_args()
273+
274+
main(args)
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# storage benchwrapp
2+
3+
main.py is a gRPC wrapper around the storage library for benchmarking purposes.
4+
5+
## Running
6+
7+
```bash
8+
$ export STORAGE_EMULATOR_HOST=http://localhost:8080
9+
$ pip install grpcio
10+
$ cd storage
11+
$ pip install -e . # install google.cloud.storage locally
12+
$ cd tests/perf
13+
$ python3 benchwrapper.py --port 8081
14+
```
15+
16+
## Re-generating protos
17+
18+
```bash
19+
$ pip install grpcio-tools
20+
$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. *.proto
21+
```

packages/google-cloud-storage/tests/perf/benchwrapper.py renamed to packages/google-cloud-storage/tests/perf/benchwrapper/benchwrapper.py

File renamed without changes.

packages/google-cloud-storage/tests/perf/storage.proto renamed to packages/google-cloud-storage/tests/perf/benchwrapper/storage.proto

File renamed without changes.

packages/google-cloud-storage/tests/perf/storage_pb2.py renamed to packages/google-cloud-storage/tests/perf/benchwrapper/storage_pb2.py

File renamed without changes.

packages/google-cloud-storage/tests/perf/storage_pb2_grpc.py renamed to packages/google-cloud-storage/tests/perf/benchwrapper/storage_pb2_grpc.py

File renamed without changes.

0 commit comments

Comments
 (0)