fix(caching): derive sampling seed from data to prevent deterministic hash collisions#14610
fix(caching): derive sampling seed from data to prevent deterministic hash collisions#146103em0 wants to merge 4 commits intostreamlit:developfrom
Conversation
… hash collisions Fixes two classes of cache key collision in st.cache_data / st.cache_resource: 1. Fixed-seed sampling (NumPy, Pandas, Polars) The sampling step that skips full-data hashing for large arrays/DataFrames used a hardcoded seed (RandomState(0) / random_state=0 / seed=0). Because the seed is constant and the source code is public, an attacker can pre-compute exactly which indices will be sampled and craft an adversarial input that matches a legitimate input at every sampled position while differing arbitrarily elsewhere — a guaranteed, zero-uncertainty collision. Fix: derive the sampling seed from a fast hash of the first element(s) of the data so that the sampled positions change with the data content. 2. PIL P-mode palette ignored (PIL.Image.Image) For palette-indexed (P-mode) PNG images, tobytes() returns only the palette index per pixel; the 768-byte RGB color table is not included. Two images with identical index arrays but different palettes produced identical cache keys despite showing visually distinct content. Fix: when mode is "P" and a palette is present, prepend the palette bytes to the pixel-index bytes before hashing. Three regression tests added to hashing_test.py cover all three collision classes (adversarial NumPy, adversarial Pandas, PIL P-mode palette). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for contributing to Streamlit! 🎈 Please make sure you have read our Contributing Guide. You can find additional information about Streamlit development in the wiki. The review process:
We're receiving many contributions and have limited review bandwidth — please expect some delay. We appreciate your patience! 🙏 |
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
There was a problem hiding this comment.
Summary
This PR hardens cache-key generation in @st.cache_data / @st.cache_resource by addressing two classes of hash collision:
-
Deterministic sampling seed: Replaces the hardcoded
random_state=0/seed=0used when sampling large data structures (Pandas, Polars, NumPy) with a seed derived from the data itself, preventing an attacker from pre-computing sampled indices and crafting collisions at non-sampled positions. -
PIL P-mode palette omission: Prepends palette bytes to pixel-index bytes when hashing palette-indexed (P-mode) PIL images, since
tobytes()alone returns only palette indices and ignores the color table.
All three reviewers (claude-4.6-opus-high-thinking, gemini-3.1-pro, gpt-5.3-codex-high) agree the fix direction is sound and the overall code quality is high. Two reviewers approved; one requested changes due to a blocking regression risk in the seed derivation path.
Code Quality
The implementation is clean, well-scoped, and consistent with the existing hashing.py patterns. Comments explain the security rationale rather than narrating the code. The same 4-line explanatory comment is duplicated across all Pandas/Polars branches — a minor DRY opportunity but not a blocker.
Blocking concern (all reviewers agree upon verification): The hash_pandas_object() calls for seed derivation are placed before the existing try/except TypeError blocks. For Series or DataFrames containing unhashable objects (e.g., list/dict values), the seed derivation will raise TypeError that bypasses the pickle fallback path — a behavioral regression from the current code where sample(random_state=0) never raises. This must be fixed before merge. Details are in the inline comments.
Test Coverage
Three well-constructed adversarial tests directly validate the fixes:
test_PIL_pmode_palette_collision_prevention— includes a precondition assert confirming the vulnerability trigger.test_pandas_large_dataframe_non_sampled_rows_differ— adversarial construction against the old fixed-seed sampler.test_numpy_large_array_non_sampled_elements_differ— same adversarial pattern for NumPy.
Existing tests cover hash stability (same data → same hash, different data → different hash), so no regression risk there. Gap: No adversarial tests for Pandas Series, Polars Series, or Polars DataFrame sampling paths (acceptable for this PR, could be added in a follow-up). A test for large objects with unhashable payloads should be added alongside the fix for the blocking issue.
Backwards Compatibility
Cache keys will change for: all large Pandas DataFrames/Series (≥50K rows), all large Polars DataFrames/Series (≥50K rows), all large NumPy arrays (≥500K elements), and all P-mode PIL images with palettes. Since st.cache_data and st.cache_resource caches are ephemeral (in-memory, cleared on restart), this means a one-time cold-cache miss on upgrade. No public API changes. No breaking changes.
Security & Risk
All three reviewers agree:
- Fixed-seed attacks are eliminated: The primary vulnerability — zero-knowledge collision via known sampling indices — is fully addressed.
- Prefix-matching attacks remain theoretically possible: Since seeds are derived from a small data prefix (1 row for Pandas/Polars, 64 elements for NumPy), an attacker who knows the target data can match the prefix, recompute the seed, and target sampled positions. This significantly raises the bar but does not fully prevent targeted collision construction. True unpredictability would require a server-side secret (breaking cache persistence across restarts), so this is a reasonable trade-off.
- No new attack surface: No new dependencies, network calls, or code execution paths are introduced.
- Polars hash stability: The Polars
.hash()and.hash_rows()methods used for seed derivation should be verified for consistency across architectures and Polars versions, as instability would cause unexpected cache misses.
External test recommendation
- Recommend external_test: No
- Triggered categories: None
- Evidence: Changes are limited to
lib/streamlit/runtime/caching/hashing.py(pure internal Python hashing logic) andlib/tests/streamlit/runtime/caching/hashing_test.py(unit tests only). No routing, auth, WebSocket, embedding, asset serving, CORS, security headers, storage, or runtime integration changes. - Suggested external_test focus areas: N/A
- Confidence: High
- Assumptions and gaps: None — changes are fully contained in the caching layer with no external-facing behavioral impact.
Accessibility
No frontend changes. Not applicable.
Recommendations
- Fix the TypeError regression (blocking): Move the seed derivation for Pandas Series and DataFrame into the existing
try/except TypeErrorblock, or wrap it in its own guard that falls back to the old fixed seed (random_state=0). Verify the same pattern for Polars branches. - Add a regression test for unhashable payloads: Cover large Pandas objects with unhashable content (e.g., object-dtype columns with lists/dicts) to ensure the pickle fallback path is preserved.
- Guard
getpalette()return value: Add aNonecheck onpil_obj.getpalette()to preventTypeErrorfrombytes(None)in edge cases. - Consider strided seed derivation (follow-up): Use a strided sample (e.g.,
series_obj.iloc[::len(series_obj)//100]orflat[::len(flat)//64]) instead of a prefix for seed derivation, distributing the seed input across the entire data and making prefix-matching attacks harder. - Soften the security comment: The inline comment overstates the guarantee ("unpredictable to an attacker"). Consider rewording to reflect the actual property: indices are data-dependent rather than globally fixed.
Verdict
CHANGES_REQUESTED: The fix direction is sound and well-tested, but the seed derivation calls placed outside the existing try/except TypeError blocks introduce a regression for data containing unhashable objects (raised by gpt-5.3-codex-high, verified and confirmed). This must be fixed before merge. Two of three reviewers approved; however, the blocking issue is a genuine correctness regression that warrants resolution.
Reviewer agreement matrix
| Topic | claude-4.6-opus | gemini-3.1-pro | gpt-5.3-codex |
|---|---|---|---|
| Fix direction sound | Agree | Agree | Agree |
| Code quality good | Agree | Agree | Agree |
| Tests well-written | Agree | Agree | Agree |
| Prefix-matching limitation | Noted | Noted | — |
| TypeError regression risk | — | — | Blocking |
| Polars hash stability | — | Noted | — |
| getpalette() None guard | Noted | — | — |
| Verdict | APPROVED | APPROVED | CHANGES_REQUESTED |
This is a consolidated automated AI review by claude-4.6-opus-high-thinking, synthesizing reviews from claude-4.6-opus-high-thinking, gemini-3.1-pro, and gpt-5.3-codex-high.
This review also includes 8 inline comment(s) on specific code lines.
| # indices are unpredictable to an attacker. A fixed seed (e.g. 0) would | ||
| # allow deterministic collision construction by targeting only the | ||
| # non-sampled rows. | ||
| sample_seed = int(hash_pandas_object(series_obj.iloc[:1]).sum()) & 0xFFFF_FFFF |
There was a problem hiding this comment.
issue: hash_pandas_object(series_obj.iloc[:1]) is called before the try/except TypeError block (line 435). If the Series contains unhashable values (e.g. lists, dicts), this raises TypeError that bypasses the existing pickle fallback, changing prior behavior from graceful degradation to a hard failure.
Please move the seed derivation inside the try block or wrap it in its own try/except TypeError that falls back to the old fixed seed (random_state=0). The same issue applies to the DataFrame branch (line 457).
|
|
||
| if len(obj) >= _PANDAS_ROWS_LARGE: | ||
| obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, seed=0) | ||
| # Derive the sampling seed from the data itself so that the sampled |
There was a problem hiding this comment.
question: Does obj[:1].hash(seed=0)[0] return consistent values across different CPU architectures and Polars versions? If Polars' internal hashing algorithm changes between releases or is platform-dependent, this would cause unexpected cache misses after upgrades. The same applies to obj[:1].hash_rows(seed=0)[0] in the DataFrame branch.
| # indices are unpredictable to an attacker. A fixed seed (e.g. 0) would | ||
| # allow deterministic collision construction by targeting only the | ||
| # non-sampled rows. | ||
| sample_seed = int(hash_pandas_object(series_obj.iloc[:1]).sum()) & 0xFFFF_FFFF |
There was a problem hiding this comment.
thought: The sampling seed is derived from only the first row (iloc[:1]). An attacker who knows the target data can match the first row, recompute the seed, predict sampled indices, and craft a collision at those positions. This still meaningfully raises the bar over a fixed seed, but consider deriving the seed from a strided sample (e.g. series_obj.iloc[::len(series_obj)//100]) to distribute seed input across the entire dataset. Could be addressed in a follow-up.
| # indices are unpredictable to an attacker. A fixed seed (e.g. 0) would | ||
| # allow deterministic collision construction by targeting only the | ||
| # non-sampled elements. | ||
| flat = np_obj.ravel() |
There was a problem hiding this comment.
thought: Same prefix-based seed concern as the Pandas path: the seed is derived from only the first 64 elements of the flattened array. Consider using a strided sample (e.g. flat[::len(flat)//64]) instead of flat[:64] to distribute the seed derivation across the entire array, making targeted collision construction harder.
| # But their palettes differ, so their hashes must differ after the fix. | ||
| assert get_hash(img_safe) != get_hash(img_malicious) | ||
|
|
||
| def test_pandas_large_dataframe_non_sampled_rows_differ(self): |
There was a problem hiding this comment.
suggestion: Add a regression test for large Pandas objects containing unhashable payloads (e.g. object-dtype columns with lists/dicts) to verify that the seed derivation does not bypass the intended pickle fallback path. This would lock in the expected behavior and catch the regression described in the inline issue on hashing.py:431.
| # identical hashes despite being visually distinct. Prepend the palette bytes | ||
| # to prevent this class of collision. | ||
| if pil_obj.mode == "P" and pil_obj.palette is not None: | ||
| palette_np = np.frombuffer(bytes(pil_obj.getpalette()), dtype="uint8") |
There was a problem hiding this comment.
suggestion: pil_obj.getpalette() can return None in edge cases even when pil_obj.palette is not None. If that happens, bytes(None) will raise TypeError. Consider adding a defensive guard:
palette_data = pil_obj.getpalette()
if palette_data is not None:
palette_np = np.frombuffer(bytes(palette_data), dtype="uint8")
index_np = np.frombuffer(pil_obj.tobytes(), dtype="uint8")
np_array = np.concatenate([palette_np, index_np])
else:
np_array = np.frombuffer(pil_obj.tobytes(), dtype="uint8")|
|
||
| if len(series_obj) >= _PANDAS_ROWS_LARGE: | ||
| series_obj = series_obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0) | ||
| # Derive the sampling seed from the data itself so that the sampled |
There was a problem hiding this comment.
nitpick: The comment says the sampled indices are "unpredictable to an attacker," but since seed derivation is deterministic and public, an attacker who controls the input can reproduce the seed by matching the prefix. Consider softening to: "not globally fixed, making collision construction dependent on the actual data content." Same comment appears at lines 453, 485, 511, and 547.
| sampled_idx = set( | ||
| np.random.RandomState(0).choice(np.arange(total), size=_NP_SAMPLE_SIZE) | ||
| ) | ||
| mask = np.array([i not in sampled_idx for i in range(total)], dtype=bool) |
There was a problem hiding this comment.
nitpick: The list comprehension [i not in sampled_idx for i in range(total)] iterates over 510K elements in Python. A vectorized approach would be faster:
mask = np.ones(total, dtype=bool)
mask[list(sampled_idx)] = False|
Hi maintainers 👋 This PR is a security bugfix for two hash collision vulnerabilities in Could someone with write access please add:
This will allow the CI label check to pass and unblock the review process. Happy to provide any additional context needed. Thank you! |
…guard, unhashable test - Extract _pandas_sample_seed() helper with its own try/except so that seed derivation for unhashable DataFrames/Series falls back to seed=0 rather than raising TypeError before the existing pickle fallback path - Add getpalette() None guard for PIL P-mode images with missing palette - Vectorize the non-sampled-elements mask in the numpy hashing test - Add test_pandas_large_dataframe_unhashable_payload_uses_pickle_fallback to prevent regression of the pickle fallback path for unhashable objects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
140a170 to
dfd7edc
Compare
Response to AI Review (CHANGES_REQUESTED → addressed)Thank you for the detailed consolidated review. All blocking and non-blocking concerns from the previous commit have been addressed in Blocking: TypeError regression in seed derivation ✅ FixedThe Fix: Extracted def _pandas_sample_seed(first_row: Any) -> int:
try:
from pandas.util import hash_pandas_object
return int(hash_pandas_object(first_row).sum()) & 0xFFFF_FFFF
except TypeError:
return 0Verified manually: large DataFrame with list-valued column falls back to pickle and produces a stable, matching hash. Regression test for unhashable payloads ✅ Added
|
|
This pull request has had no activity for 14 days, so it has been marked as stale. If you still want to continue this work, please leave a comment or push a commit within 7 days. A maintainer can also apply the |
|
Hey @3em0 -- thanks for filing the original issue (#14622) and putting together this fix! The analysis of the fixed-seed collision vulnerability and the PIL P-mode palette issue was really thorough and clearly laid out. That said, we've got a competing PR (#14635) that addresses the same bugs with a very similar approach, and that one already has core team review engagement. To avoid duplicating effort and review cycles, we're going to close this one in favor of #14635. Your issue report was the catalyst for getting this fixed, and we appreciate the detailed reproduction cases -- they made it much easier for everyone to understand and verify the problem. Thanks again for the contribution! |
Summary
Fixes two classes of cache key collision in
@st.cache_data/@st.cache_resourceBug 1 — Fixed sampling seed (NumPy, Pandas, Polars)
The large-data sampling path used a hardcoded seed (
RandomState(0)/random_state=0/seed=0). Because the seed is constant and the source is public, an attacker can pre-compute exactly which indices will be sampled and craft an input that is identical to a legitimate input at every sampled position while differing arbitrarily elsewhere — a guaranteed, reproducible cache key collision.Fix: derive the sampling seed from a fast hash of the first element(s) of the data so that the sampled positions are unpredictable and change with the data.
Affected locations in
hashing.py:Series— line 427 (random_state=0)DataFrame— line 449 (random_state=0)Series— line 476 (seed=0)DataFrame— line 499 (seed=0)ndarray— lines 530–531 (RandomState(0))Bug 2 — PIL P-mode palette ignored
For palette-indexed (P-mode) PNG images,
tobytes()returns only the palette index per pixel; the 768-byte RGB color table is not included. Two images with identical index arrays but different palettes produced identical cache keys despite visually distinct content.Fix: when
mode == "P"and a palette is present, prepend the palette bytes to the pixel-index bytes before hashing.Affected location:
hashing.pylines 536–544.Test plan
test_PIL_pmode_palette_collision_prevention— two P-mode images with same indices but different palettes must hash differentlytest_pandas_large_dataframe_non_sampled_rows_differ— adversarially crafted DataFrame (identical at old sampled rows,999.0elsewhere) must hash differentlytest_numpy_large_array_non_sampled_elements_differ— adversarially crafted ndarray (identical at old sampled indices,255elsewhere) must hash differently🤖 Generated with Claude Code