fix(caching): derive sampling seed from data to prevent deterministic hash collisions by 3em0 · Pull Request #14610 · streamlit/streamlit

3em0 · 2026-04-02T03:10:18Z

Summary

Fixes two classes of cache key collision in @st.cache_data / @st.cache_resource

Bug 1 — Fixed sampling seed (NumPy, Pandas, Polars)

The large-data sampling path used a hardcoded seed (RandomState(0) / random_state=0 / seed=0). Because the seed is constant and the source is public, an attacker can pre-compute exactly which indices will be sampled and craft an input that is identical to a legitimate input at every sampled position while differing arbitrarily elsewhere — a guaranteed, reproducible cache key collision.

Fix: derive the sampling seed from a fast hash of the first element(s) of the data so that the sampled positions are unpredictable and change with the data.

Affected locations in hashing.py:

Pandas Series — line 427 (random_state=0)
Pandas DataFrame — line 449 (random_state=0)
Polars Series — line 476 (seed=0)
Polars DataFrame — line 499 (seed=0)
NumPy ndarray — lines 530–531 (RandomState(0))

Bug 2 — PIL P-mode palette ignored

For palette-indexed (P-mode) PNG images, tobytes() returns only the palette index per pixel; the 768-byte RGB color table is not included. Two images with identical index arrays but different palettes produced identical cache keys despite visually distinct content.

Fix: when mode == "P" and a palette is present, prepend the palette bytes to the pixel-index bytes before hashing.

Affected location: hashing.py lines 536–544.

Test plan

test_PIL_pmode_palette_collision_prevention — two P-mode images with same indices but different palettes must hash differently
test_pandas_large_dataframe_non_sampled_rows_differ — adversarially crafted DataFrame (identical at old sampled rows, 999.0 elsewhere) must hash differently
test_numpy_large_array_non_sampled_elements_differ — adversarially crafted ndarray (identical at old sampled indices, 255 elsewhere) must hash differently
Existing tests for identical objects still pass (hash stability)

🤖 Generated with Claude Code

… hash collisions Fixes two classes of cache key collision in st.cache_data / st.cache_resource: 1. Fixed-seed sampling (NumPy, Pandas, Polars) The sampling step that skips full-data hashing for large arrays/DataFrames used a hardcoded seed (RandomState(0) / random_state=0 / seed=0). Because the seed is constant and the source code is public, an attacker can pre-compute exactly which indices will be sampled and craft an adversarial input that matches a legitimate input at every sampled position while differing arbitrarily elsewhere — a guaranteed, zero-uncertainty collision. Fix: derive the sampling seed from a fast hash of the first element(s) of the data so that the sampled positions change with the data content. 2. PIL P-mode palette ignored (PIL.Image.Image) For palette-indexed (P-mode) PNG images, tobytes() returns only the palette index per pixel; the 768-byte RGB color table is not included. Two images with identical index arrays but different palettes produced identical cache keys despite showing visually distinct content. Fix: when mode is "P" and a palette is present, prepend the palette bytes to the pixel-index bytes before hashing. Three regression tests added to hashing_test.py cover all three collision classes (adversarial NumPy, adversarial Pandas, PIL P-mode palette). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-02T03:10:26Z

Thanks for contributing to Streamlit! 🎈

Please make sure you have read our Contributing Guide. You can find additional information about Streamlit development in the wiki.

The review process:

Initial triage: A maintainer will apply labels, approve CI to run, and trigger AI-assisted reviews. Your PR may be flagged with status:needs-product-approval if the feature requires product team sign-off.
Code review: A core maintainer will start reviewing your PR once:
- It is marked as 'ready for review', not 'draft'
- It has status:product-approved (or doesn't need it)
- All CI checks pass
- All AI review comments are addressed

We're receiving many contributions and have limited review bandwidth — please expect some delay. We appreciate your patience! 🙏

snyk-io · 2026-04-02T03:10:27Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

github-actions

Summary

This PR hardens cache-key generation in @st.cache_data / @st.cache_resource by addressing two classes of hash collision:

Deterministic sampling seed: Replaces the hardcoded random_state=0 / seed=0 used when sampling large data structures (Pandas, Polars, NumPy) with a seed derived from the data itself, preventing an attacker from pre-computing sampled indices and crafting collisions at non-sampled positions.
PIL P-mode palette omission: Prepends palette bytes to pixel-index bytes when hashing palette-indexed (P-mode) PIL images, since tobytes() alone returns only palette indices and ignores the color table.

All three reviewers (claude-4.6-opus-high-thinking, gemini-3.1-pro, gpt-5.3-codex-high) agree the fix direction is sound and the overall code quality is high. Two reviewers approved; one requested changes due to a blocking regression risk in the seed derivation path.

Code Quality

The implementation is clean, well-scoped, and consistent with the existing hashing.py patterns. Comments explain the security rationale rather than narrating the code. The same 4-line explanatory comment is duplicated across all Pandas/Polars branches — a minor DRY opportunity but not a blocker.

Blocking concern (all reviewers agree upon verification): The hash_pandas_object() calls for seed derivation are placed before the existing try/except TypeError blocks. For Series or DataFrames containing unhashable objects (e.g., list/dict values), the seed derivation will raise TypeError that bypasses the pickle fallback path — a behavioral regression from the current code where sample(random_state=0) never raises. This must be fixed before merge. Details are in the inline comments.

Test Coverage

Three well-constructed adversarial tests directly validate the fixes:

test_PIL_pmode_palette_collision_prevention — includes a precondition assert confirming the vulnerability trigger.
test_pandas_large_dataframe_non_sampled_rows_differ — adversarial construction against the old fixed-seed sampler.
test_numpy_large_array_non_sampled_elements_differ — same adversarial pattern for NumPy.

Existing tests cover hash stability (same data → same hash, different data → different hash), so no regression risk there. Gap: No adversarial tests for Pandas Series, Polars Series, or Polars DataFrame sampling paths (acceptable for this PR, could be added in a follow-up). A test for large objects with unhashable payloads should be added alongside the fix for the blocking issue.

Backwards Compatibility

Cache keys will change for: all large Pandas DataFrames/Series (≥50K rows), all large Polars DataFrames/Series (≥50K rows), all large NumPy arrays (≥500K elements), and all P-mode PIL images with palettes. Since st.cache_data and st.cache_resource caches are ephemeral (in-memory, cleared on restart), this means a one-time cold-cache miss on upgrade. No public API changes. No breaking changes.

Security & Risk

All three reviewers agree:

Fixed-seed attacks are eliminated: The primary vulnerability — zero-knowledge collision via known sampling indices — is fully addressed.
Prefix-matching attacks remain theoretically possible: Since seeds are derived from a small data prefix (1 row for Pandas/Polars, 64 elements for NumPy), an attacker who knows the target data can match the prefix, recompute the seed, and target sampled positions. This significantly raises the bar but does not fully prevent targeted collision construction. True unpredictability would require a server-side secret (breaking cache persistence across restarts), so this is a reasonable trade-off.
No new attack surface: No new dependencies, network calls, or code execution paths are introduced.
Polars hash stability: The Polars .hash() and .hash_rows() methods used for seed derivation should be verified for consistency across architectures and Polars versions, as instability would cause unexpected cache misses.

External test recommendation

Recommend external_test: No
Triggered categories: None
Evidence: Changes are limited to lib/streamlit/runtime/caching/hashing.py (pure internal Python hashing logic) and lib/tests/streamlit/runtime/caching/hashing_test.py (unit tests only). No routing, auth, WebSocket, embedding, asset serving, CORS, security headers, storage, or runtime integration changes.
Suggested external_test focus areas: N/A
Confidence: High
Assumptions and gaps: None — changes are fully contained in the caching layer with no external-facing behavioral impact.

Accessibility

No frontend changes. Not applicable.

Recommendations

Fix the TypeError regression (blocking): Move the seed derivation for Pandas Series and DataFrame into the existing try/except TypeError block, or wrap it in its own guard that falls back to the old fixed seed (random_state=0). Verify the same pattern for Polars branches.
Add a regression test for unhashable payloads: Cover large Pandas objects with unhashable content (e.g., object-dtype columns with lists/dicts) to ensure the pickle fallback path is preserved.
Guard getpalette() return value: Add a None check on pil_obj.getpalette() to prevent TypeError from bytes(None) in edge cases.
Consider strided seed derivation (follow-up): Use a strided sample (e.g., series_obj.iloc[::len(series_obj)//100] or flat[::len(flat)//64]) instead of a prefix for seed derivation, distributing the seed input across the entire data and making prefix-matching attacks harder.
Soften the security comment: The inline comment overstates the guarantee ("unpredictable to an attacker"). Consider rewording to reflect the actual property: indices are data-dependent rather than globally fixed.

Verdict

CHANGES_REQUESTED: The fix direction is sound and well-tested, but the seed derivation calls placed outside the existing try/except TypeError blocks introduce a regression for data containing unhashable objects (raised by gpt-5.3-codex-high, verified and confirmed). This must be fixed before merge. Two of three reviewers approved; however, the blocking issue is a genuine correctness regression that warrants resolution.

Reviewer agreement matrix

Topic	claude-4.6-opus	gemini-3.1-pro	gpt-5.3-codex
Fix direction sound	Agree	Agree	Agree
Code quality good	Agree	Agree	Agree
Tests well-written	Agree	Agree	Agree
Prefix-matching limitation	Noted	Noted	—
TypeError regression risk	—	—	Blocking
Polars hash stability	—	Noted	—
getpalette() None guard	Noted	—	—
Verdict	APPROVED	APPROVED	CHANGES_REQUESTED

This is a consolidated automated AI review by claude-4.6-opus-high-thinking, synthesizing reviews from claude-4.6-opus-high-thinking, gemini-3.1-pro, and gpt-5.3-codex-high.

This review also includes 8 inline comment(s) on specific code lines.

github-actions · 2026-04-02T16:08:17Z

+                # indices are unpredictable to an attacker. A fixed seed (e.g. 0) would
+                # allow deterministic collision construction by targeting only the
+                # non-sampled rows.
+                sample_seed = int(hash_pandas_object(series_obj.iloc[:1]).sum()) & 0xFFFF_FFFF


issue: hash_pandas_object(series_obj.iloc[:1]) is called before the try/except TypeError block (line 435). If the Series contains unhashable values (e.g. lists, dicts), this raises TypeError that bypasses the existing pickle fallback, changing prior behavior from graceful degradation to a hard failure.

Please move the seed derivation inside the try block or wrap it in its own try/except TypeError that falls back to the old fixed seed (random_state=0). The same issue applies to the DataFrame branch (line 457).

github-actions · 2026-04-02T16:08:18Z


            if len(obj) >= _PANDAS_ROWS_LARGE:
-                obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, seed=0)
+                # Derive the sampling seed from the data itself so that the sampled


question: Does obj[:1].hash(seed=0)[0] return consistent values across different CPU architectures and Polars versions? If Polars' internal hashing algorithm changes between releases or is platform-dependent, this would cause unexpected cache misses after upgrades. The same applies to obj[:1].hash_rows(seed=0)[0] in the DataFrame branch.

github-actions · 2026-04-02T16:08:18Z

+                # indices are unpredictable to an attacker. A fixed seed (e.g. 0) would
+                # allow deterministic collision construction by targeting only the
+                # non-sampled rows.
+                sample_seed = int(hash_pandas_object(series_obj.iloc[:1]).sum()) & 0xFFFF_FFFF


thought: The sampling seed is derived from only the first row (iloc[:1]). An attacker who knows the target data can match the first row, recompute the seed, predict sampled indices, and craft a collision at those positions. This still meaningfully raises the bar over a fixed seed, but consider deriving the seed from a strided sample (e.g. series_obj.iloc[::len(series_obj)//100]) to distribute seed input across the entire dataset. Could be addressed in a follow-up.

github-actions · 2026-04-02T16:08:18Z

+                # indices are unpredictable to an attacker. A fixed seed (e.g. 0) would
+                # allow deterministic collision construction by targeting only the
+                # non-sampled elements.
+                flat = np_obj.ravel()


thought: Same prefix-based seed concern as the Pandas path: the seed is derived from only the first 64 elements of the flattened array. Consider using a strided sample (e.g. flat[::len(flat)//64]) instead of flat[:64] to distribute the seed derivation across the entire array, making targeted collision construction harder.

github-actions · 2026-04-02T16:08:18Z

+        # But their palettes differ, so their hashes must differ after the fix.
+        assert get_hash(img_safe) != get_hash(img_malicious)
+
+    def test_pandas_large_dataframe_non_sampled_rows_differ(self):


suggestion: Add a regression test for large Pandas objects containing unhashable payloads (e.g. object-dtype columns with lists/dicts) to verify that the seed derivation does not bypass the intended pickle fallback path. This would lock in the expected behavior and catch the regression described in the inline issue on hashing.py:431.

github-actions · 2026-04-02T16:08:18Z

+            # identical hashes despite being visually distinct. Prepend the palette bytes
+            # to prevent this class of collision.
+            if pil_obj.mode == "P" and pil_obj.palette is not None:
+                palette_np = np.frombuffer(bytes(pil_obj.getpalette()), dtype="uint8")


suggestion: pil_obj.getpalette() can return None in edge cases even when pil_obj.palette is not None. If that happens, bytes(None) will raise TypeError. Consider adding a defensive guard:

palette_data = pil_obj.getpalette() if palette_data is not None: palette_np = np.frombuffer(bytes(palette_data), dtype="uint8") index_np = np.frombuffer(pil_obj.tobytes(), dtype="uint8") np_array = np.concatenate([palette_np, index_np]) else: np_array = np.frombuffer(pil_obj.tobytes(), dtype="uint8")

github-actions · 2026-04-02T16:08:18Z


            if len(series_obj) >= _PANDAS_ROWS_LARGE:
-                series_obj = series_obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
+                # Derive the sampling seed from the data itself so that the sampled


nitpick: The comment says the sampled indices are "unpredictable to an attacker," but since seed derivation is deterministic and public, an attacker who controls the input can reproduce the seed by matching the prefix. Consider softening to: "not globally fixed, making collision construction dependent on the actual data content." Same comment appears at lines 453, 485, 511, and 547.

github-actions · 2026-04-02T16:08:18Z

+        sampled_idx = set(
+            np.random.RandomState(0).choice(np.arange(total), size=_NP_SAMPLE_SIZE)
+        )
+        mask = np.array([i not in sampled_idx for i in range(total)], dtype=bool)


nitpick: The list comprehension [i not in sampled_idx for i in range(total)] iterates over 510K elements in Python. A vectorized approach would be faster:

mask = np.ones(total, dtype=bool) mask[list(sampled_idx)] = False

3em0 · 2026-04-03T05:38:07Z

Hi maintainers 👋

This PR is a security bugfix for two hash collision vulnerabilities in @st.cache_data / @st.cache_resource (CWE-345 / CWE-706). The do-not-merge label was auto-applied because the required triage labels are missing.

Could someone with write access please add:

change:bugfix
impact:users

This will allow the CI label check to pass and unblock the review process. Happy to provide any additional context needed. Thank you!

…guard, unhashable test - Extract _pandas_sample_seed() helper with its own try/except so that seed derivation for unhashable DataFrames/Series falls back to seed=0 rather than raising TypeError before the existing pickle fallback path - Add getpalette() None guard for PIL P-mode images with missing palette - Vectorize the non-sampled-elements mask in the numpy hashing test - Add test_pandas_large_dataframe_unhashable_payload_uses_pickle_fallback to prevent regression of the pickle fallback path for unhashable objects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

3em0 · 2026-04-03T05:45:18Z

Response to AI Review (CHANGES_REQUESTED → addressed)

Thank you for the detailed consolidated review. All blocking and non-blocking concerns from the previous commit have been addressed in dfd7edcbb.

Blocking: TypeError regression in seed derivation ✅ Fixed

The hash_pandas_object() call was placed before the try/except TypeError block, which would have caused unhashable DataFrame/Series objects (e.g., columns containing lists/dicts) to raise instead of falling through to the pickle fallback.

Fix: Extracted _pandas_sample_seed() as a module-level helper with its own try/except TypeError. When the first-row hash raises, the helper returns 0 (equivalent to the previous random_state=0), so the downstream try/except TypeError block sees exactly the same behavior as before for unhashable payloads:

def _pandas_sample_seed(first_row: Any) -> int:
    try:
        from pandas.util import hash_pandas_object
        return int(hash_pandas_object(first_row).sum()) & 0xFFFF_FFFF
    except TypeError:
        return 0

Verified manually: large DataFrame with list-valued column falls back to pickle and produces a stable, matching hash.

Regression test for unhashable payloads ✅ Added

test_pandas_large_dataframe_unhashable_payload_uses_pickle_fallback asserts that hashing two identical large DataFrames with unhashable content (list-valued cells) returns the same hash — proving the pickle fallback path is preserved.

`getpalette()` None guard ✅ Added

Added a if palette_data is not None: check after pil_obj.getpalette() to prevent TypeError: bytes(None) in edge cases where a P-mode image has a palette attribute but getpalette() returns None.

Security comment wording ✅ Softened

Replaced "unpredictable to an attacker" with "not globally fixed, making collision construction dependent on the actual data content" across all Pandas, Polars, and NumPy branches. This accurately reflects the property without overclaiming.

Vectorized mask in NumPy test ✅ Applied

Replaced the Python list comprehension [i not in sampled_idx for i in range(total)] with a vectorized mask = np.ones(total, dtype=bool); mask[sampled_idx] = False.

Note on prefix-matching limitation: The reviewers correctly note that seeds derived from a data prefix still allow targeted collision if the attacker can match the prefix. This is a known trade-off acknowledged in the fix scope — true unpredictability would require a server-side secret (breaking cache persistence). The prefix approach significantly raises the bar for opportunistic attacks, which is the primary threat model here.

Note on do-not-merge label: This label was added automatically by require-labels.yml because change:bugfix and impact:users labels are absent. As an external contributor without label-write access, I'd appreciate if a maintainer could add these so the workflow requirement is satisfied.

…-palette

github-actions · 2026-04-22T07:07:47Z

This pull request has had no activity for 14 days, so it has been marked as stale. If you still want to continue this work, please leave a comment or push a commit within 7 days. A maintainer can also apply the never-stale label to opt out.

kmcgrady · 2026-04-22T22:54:28Z

Hey @3em0 -- thanks for filing the original issue (#14622) and putting together this fix! The analysis of the fixed-seed collision vulnerability and the PIL P-mode palette issue was really thorough and clearly laid out.

That said, we've got a competing PR (#14635) that addresses the same bugs with a very similar approach, and that one already has core team review engagement. To avoid duplicating effort and review cycles, we're going to close this one in favor of #14635.

Your issue report was the catalyst for getting this fixed, and we appreciate the detailed reproduction cases -- they made it much easier for everyone to understand and verify the problem. Thanks again for the contribution!

github-actions Bot reviewed Apr 2, 2026

View reviewed changes

github-actions Bot added the do-not-merge PR is blocked from merging label Apr 2, 2026

3em0 force-pushed the fix/cache-hash-deterministic-seed-and-pil-palette branch from 140a170 to dfd7edc Compare April 3, 2026 05:44

Merge branch 'develop' into fix/cache-hash-deterministic-seed-and-pil…

4b266a7

…-palette

3em0 mentioned this pull request Apr 3, 2026

bug: @st.cache_data hash collision via fixed sampling seed and PIL P-mode palette omission #14622

Open

Merge branch 'develop' into fix/cache-hash-deterministic-seed-and-pil…

cfb622e

…-palette

github-actions Bot added the stale label Apr 22, 2026

kmcgrady closed this Apr 22, 2026

Conversation

3em0 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1 — Fixed sampling seed (NumPy, Pandas, Polars)

Bug 2 — PIL P-mode palette ignored

Test plan

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

snyk-io Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Summary

Code Quality

Test Coverage

Backwards Compatibility

Security & Risk

External test recommendation

Accessibility

Recommendations

Verdict

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

3em0 commented Apr 3, 2026

Uh oh!

3em0 commented Apr 3, 2026

Response to AI Review (CHANGES_REQUESTED → addressed)

Blocking: TypeError regression in seed derivation ✅ Fixed

Regression test for unhashable payloads ✅ Added

getpalette() None guard ✅ Added

Security comment wording ✅ Softened

Vectorized mask in NumPy test ✅ Applied

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

kmcgrady commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

3em0 commented Apr 2, 2026 •

edited

Loading

snyk-io Bot commented Apr 2, 2026 •

edited

Loading

`getpalette()` None guard ✅ Added