Skip to content

perf: add microbenchmarks for crc32c and MRD reads#17392

Draft
chandra-siri wants to merge 14 commits into
mainfrom
perf-checksum-benchmarks
Draft

perf: add microbenchmarks for crc32c and MRD reads#17392
chandra-siri wants to merge 14 commits into
mainfrom
perf-checksum-benchmarks

Conversation

@chandra-siri
Copy link
Copy Markdown
Contributor

Benchmarks for crc32c execution time and MRD reads with checksum enabled vs disabled.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new microbenchmark scripts: "benchmark_crc32c.py" to measure the execution time of "google_crc32c.value", and "benchmark_mrd_reads.py" to benchmark GCS Object Range Downloads using "AsyncMultiRangeDownloader" (MRD) with and without checksums. Feedback on "benchmark_mrd_reads.py" highlights two potential issues: hardcoding the object size to 10 GiB (which can cause out-of-range errors if a smaller object is specified) and a potential crash in "random.randint" if the requested download size exceeds the object size. Suggestions were provided to dynamically fetch the object size and add a guard clause.

Comment on lines +113 to +116
# 10 GiB in bytes
object_size_bytes = 10 * 1024 * 1024 * 1024

grpc_client = AsyncGrpcClient()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding object_size_bytes to 10 GiB breaks the benchmark when a different --object is specified via the command line. If the specified object is smaller than 10 GiB, generating random offsets up to 10 GiB will result in out-of-range errors during download.

To fix this, dynamically retrieve the actual object size using the grpc_client.

Suggested change
# 10 GiB in bytes
object_size_bytes = 10 * 1024 * 1024 * 1024
grpc_client = AsyncGrpcClient()
grpc_client = AsyncGrpcClient()
try:
object_metadata = await grpc_client.get_object(args.bucket, args.object)
object_size_bytes = object_metadata.size
except Exception as e:
print(f"Warning: Could not retrieve object metadata ({e}). Defaulting to 10GiB.", file=sys.stderr)
object_size_bytes = 10 * 1024 * 1024 * 1024

Comment on lines +123 to +125
for size_str, size_bytes in sizes_to_test:
# Pre-generate random offsets so that both Enabled and Disabled configurations run on the exact same offsets
offsets = [random.randint(0, object_size_bytes - size_bytes) for _ in range(args.iterations)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If size_bytes is larger than object_size_bytes, random.randint(0, object_size_bytes - size_bytes) will raise a ValueError: empty range for randrange() and crash the script.

Adding a guard clause to skip sizes larger than the object size makes the benchmark much more robust.

    for size_str, size_bytes in sizes_to_test:
        if size_bytes > object_size_bytes:
            print(f"Skipping size {size_str} ({size_bytes} bytes) because it is larger than the object size ({object_size_bytes} bytes).", file=sys.stderr)
            continue

        # Pre-generate random offsets so that both Enabled and Disabled configurations run on the exact same offsets
        offsets = [random.randint(0, object_size_bytes - size_bytes) for _ in range(args.iterations)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant