Skip to content

fix(caching): derive sampling seed from data to prevent deterministic hash collisions#14610

Closed
3em0 wants to merge 4 commits intostreamlit:developfrom
3em0:fix/cache-hash-deterministic-seed-and-pil-palette
Closed

fix(caching): derive sampling seed from data to prevent deterministic hash collisions#14610
3em0 wants to merge 4 commits intostreamlit:developfrom
3em0:fix/cache-hash-deterministic-seed-and-pil-palette

Conversation

@3em0
Copy link
Copy Markdown

@3em0 3em0 commented Apr 2, 2026

Summary

Fixes two classes of cache key collision in @st.cache_data / @st.cache_resource

Bug 1 — Fixed sampling seed (NumPy, Pandas, Polars)

The large-data sampling path used a hardcoded seed (RandomState(0) / random_state=0 / seed=0). Because the seed is constant and the source is public, an attacker can pre-compute exactly which indices will be sampled and craft an input that is identical to a legitimate input at every sampled position while differing arbitrarily elsewhere — a guaranteed, reproducible cache key collision.

Fix: derive the sampling seed from a fast hash of the first element(s) of the data so that the sampled positions are unpredictable and change with the data.

Affected locations in hashing.py:

  • Pandas Series — line 427 (random_state=0)
  • Pandas DataFrame — line 449 (random_state=0)
  • Polars Series — line 476 (seed=0)
  • Polars DataFrame — line 499 (seed=0)
  • NumPy ndarray — lines 530–531 (RandomState(0))

Bug 2 — PIL P-mode palette ignored

For palette-indexed (P-mode) PNG images, tobytes() returns only the palette index per pixel; the 768-byte RGB color table is not included. Two images with identical index arrays but different palettes produced identical cache keys despite visually distinct content.

Fix: when mode == "P" and a palette is present, prepend the palette bytes to the pixel-index bytes before hashing.

Affected location: hashing.py lines 536–544.

Test plan

  • test_PIL_pmode_palette_collision_prevention — two P-mode images with same indices but different palettes must hash differently
  • test_pandas_large_dataframe_non_sampled_rows_differ — adversarially crafted DataFrame (identical at old sampled rows, 999.0 elsewhere) must hash differently
  • test_numpy_large_array_non_sampled_elements_differ — adversarially crafted ndarray (identical at old sampled indices, 255 elsewhere) must hash differently
  • Existing tests for identical objects still pass (hash stability)

🤖 Generated with Claude Code

… hash collisions

Fixes two classes of cache key collision in st.cache_data / st.cache_resource:

1. Fixed-seed sampling (NumPy, Pandas, Polars)
   The sampling step that skips full-data hashing for large arrays/DataFrames
   used a hardcoded seed (RandomState(0) / random_state=0 / seed=0). Because
   the seed is constant and the source code is public, an attacker can
   pre-compute exactly which indices will be sampled and craft an adversarial
   input that matches a legitimate input at every sampled position while
   differing arbitrarily elsewhere — a guaranteed, zero-uncertainty collision.

   Fix: derive the sampling seed from a fast hash of the first element(s) of
   the data so that the sampled positions change with the data content.

2. PIL P-mode palette ignored (PIL.Image.Image)
   For palette-indexed (P-mode) PNG images, tobytes() returns only the palette
   index per pixel; the 768-byte RGB color table is not included. Two images
   with identical index arrays but different palettes produced identical cache
   keys despite showing visually distinct content.

   Fix: when mode is "P" and a palette is present, prepend the palette bytes
   to the pixel-index bytes before hashing.

Three regression tests added to hashing_test.py cover all three collision
classes (adversarial NumPy, adversarial Pandas, PIL P-mode palette).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Thanks for contributing to Streamlit! 🎈

Please make sure you have read our Contributing Guide. You can find additional information about Streamlit development in the wiki.

The review process:

  1. Initial triage: A maintainer will apply labels, approve CI to run, and trigger AI-assisted reviews. Your PR may be flagged with status:needs-product-approval if the feature requires product team sign-off.

  2. Code review: A core maintainer will start reviewing your PR once:

    • It is marked as 'ready for review', not 'draft'
    • It has status:product-approved (or doesn't need it)
    • All CI checks pass
    • All AI review comments are addressed

We're receiving many contributions and have limited review bandwidth — please expect some delay. We appreciate your patience! 🙏

@snyk-io
Copy link
Copy Markdown
Contributor

snyk-io Bot commented Apr 2, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR hardens cache-key generation in @st.cache_data / @st.cache_resource by addressing two classes of hash collision:

  1. Deterministic sampling seed: Replaces the hardcoded random_state=0 / seed=0 used when sampling large data structures (Pandas, Polars, NumPy) with a seed derived from the data itself, preventing an attacker from pre-computing sampled indices and crafting collisions at non-sampled positions.

  2. PIL P-mode palette omission: Prepends palette bytes to pixel-index bytes when hashing palette-indexed (P-mode) PIL images, since tobytes() alone returns only palette indices and ignores the color table.

All three reviewers (claude-4.6-opus-high-thinking, gemini-3.1-pro, gpt-5.3-codex-high) agree the fix direction is sound and the overall code quality is high. Two reviewers approved; one requested changes due to a blocking regression risk in the seed derivation path.

Code Quality

The implementation is clean, well-scoped, and consistent with the existing hashing.py patterns. Comments explain the security rationale rather than narrating the code. The same 4-line explanatory comment is duplicated across all Pandas/Polars branches — a minor DRY opportunity but not a blocker.

Blocking concern (all reviewers agree upon verification): The hash_pandas_object() calls for seed derivation are placed before the existing try/except TypeError blocks. For Series or DataFrames containing unhashable objects (e.g., list/dict values), the seed derivation will raise TypeError that bypasses the pickle fallback path — a behavioral regression from the current code where sample(random_state=0) never raises. This must be fixed before merge. Details are in the inline comments.

Test Coverage

Three well-constructed adversarial tests directly validate the fixes:

  • test_PIL_pmode_palette_collision_prevention — includes a precondition assert confirming the vulnerability trigger.
  • test_pandas_large_dataframe_non_sampled_rows_differ — adversarial construction against the old fixed-seed sampler.
  • test_numpy_large_array_non_sampled_elements_differ — same adversarial pattern for NumPy.

Existing tests cover hash stability (same data → same hash, different data → different hash), so no regression risk there. Gap: No adversarial tests for Pandas Series, Polars Series, or Polars DataFrame sampling paths (acceptable for this PR, could be added in a follow-up). A test for large objects with unhashable payloads should be added alongside the fix for the blocking issue.

Backwards Compatibility

Cache keys will change for: all large Pandas DataFrames/Series (≥50K rows), all large Polars DataFrames/Series (≥50K rows), all large NumPy arrays (≥500K elements), and all P-mode PIL images with palettes. Since st.cache_data and st.cache_resource caches are ephemeral (in-memory, cleared on restart), this means a one-time cold-cache miss on upgrade. No public API changes. No breaking changes.

Security & Risk

All three reviewers agree:

  • Fixed-seed attacks are eliminated: The primary vulnerability — zero-knowledge collision via known sampling indices — is fully addressed.
  • Prefix-matching attacks remain theoretically possible: Since seeds are derived from a small data prefix (1 row for Pandas/Polars, 64 elements for NumPy), an attacker who knows the target data can match the prefix, recompute the seed, and target sampled positions. This significantly raises the bar but does not fully prevent targeted collision construction. True unpredictability would require a server-side secret (breaking cache persistence across restarts), so this is a reasonable trade-off.
  • No new attack surface: No new dependencies, network calls, or code execution paths are introduced.
  • Polars hash stability: The Polars .hash() and .hash_rows() methods used for seed derivation should be verified for consistency across architectures and Polars versions, as instability would cause unexpected cache misses.

External test recommendation

  • Recommend external_test: No
  • Triggered categories: None
  • Evidence: Changes are limited to lib/streamlit/runtime/caching/hashing.py (pure internal Python hashing logic) and lib/tests/streamlit/runtime/caching/hashing_test.py (unit tests only). No routing, auth, WebSocket, embedding, asset serving, CORS, security headers, storage, or runtime integration changes.
  • Suggested external_test focus areas: N/A
  • Confidence: High
  • Assumptions and gaps: None — changes are fully contained in the caching layer with no external-facing behavioral impact.

Accessibility

No frontend changes. Not applicable.

Recommendations

  1. Fix the TypeError regression (blocking): Move the seed derivation for Pandas Series and DataFrame into the existing try/except TypeError block, or wrap it in its own guard that falls back to the old fixed seed (random_state=0). Verify the same pattern for Polars branches.
  2. Add a regression test for unhashable payloads: Cover large Pandas objects with unhashable content (e.g., object-dtype columns with lists/dicts) to ensure the pickle fallback path is preserved.
  3. Guard getpalette() return value: Add a None check on pil_obj.getpalette() to prevent TypeError from bytes(None) in edge cases.
  4. Consider strided seed derivation (follow-up): Use a strided sample (e.g., series_obj.iloc[::len(series_obj)//100] or flat[::len(flat)//64]) instead of a prefix for seed derivation, distributing the seed input across the entire data and making prefix-matching attacks harder.
  5. Soften the security comment: The inline comment overstates the guarantee ("unpredictable to an attacker"). Consider rewording to reflect the actual property: indices are data-dependent rather than globally fixed.

Verdict

CHANGES_REQUESTED: The fix direction is sound and well-tested, but the seed derivation calls placed outside the existing try/except TypeError blocks introduce a regression for data containing unhashable objects (raised by gpt-5.3-codex-high, verified and confirmed). This must be fixed before merge. Two of three reviewers approved; however, the blocking issue is a genuine correctness regression that warrants resolution.

Reviewer agreement matrix
Topic claude-4.6-opus gemini-3.1-pro gpt-5.3-codex
Fix direction sound Agree Agree Agree
Code quality good Agree Agree Agree
Tests well-written Agree Agree Agree
Prefix-matching limitation Noted Noted
TypeError regression risk Blocking
Polars hash stability Noted
getpalette() None guard Noted
Verdict APPROVED APPROVED CHANGES_REQUESTED

This is a consolidated automated AI review by claude-4.6-opus-high-thinking, synthesizing reviews from claude-4.6-opus-high-thinking, gemini-3.1-pro, and gpt-5.3-codex-high.

This review also includes 8 inline comment(s) on specific code lines.

# indices are unpredictable to an attacker. A fixed seed (e.g. 0) would
# allow deterministic collision construction by targeting only the
# non-sampled rows.
sample_seed = int(hash_pandas_object(series_obj.iloc[:1]).sum()) & 0xFFFF_FFFF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: hash_pandas_object(series_obj.iloc[:1]) is called before the try/except TypeError block (line 435). If the Series contains unhashable values (e.g. lists, dicts), this raises TypeError that bypasses the existing pickle fallback, changing prior behavior from graceful degradation to a hard failure.

Please move the seed derivation inside the try block or wrap it in its own try/except TypeError that falls back to the old fixed seed (random_state=0). The same issue applies to the DataFrame branch (line 457).


if len(obj) >= _PANDAS_ROWS_LARGE:
obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, seed=0)
# Derive the sampling seed from the data itself so that the sampled
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Does obj[:1].hash(seed=0)[0] return consistent values across different CPU architectures and Polars versions? If Polars' internal hashing algorithm changes between releases or is platform-dependent, this would cause unexpected cache misses after upgrades. The same applies to obj[:1].hash_rows(seed=0)[0] in the DataFrame branch.

# indices are unpredictable to an attacker. A fixed seed (e.g. 0) would
# allow deterministic collision construction by targeting only the
# non-sampled rows.
sample_seed = int(hash_pandas_object(series_obj.iloc[:1]).sum()) & 0xFFFF_FFFF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: The sampling seed is derived from only the first row (iloc[:1]). An attacker who knows the target data can match the first row, recompute the seed, predict sampled indices, and craft a collision at those positions. This still meaningfully raises the bar over a fixed seed, but consider deriving the seed from a strided sample (e.g. series_obj.iloc[::len(series_obj)//100]) to distribute seed input across the entire dataset. Could be addressed in a follow-up.

# indices are unpredictable to an attacker. A fixed seed (e.g. 0) would
# allow deterministic collision construction by targeting only the
# non-sampled elements.
flat = np_obj.ravel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Same prefix-based seed concern as the Pandas path: the seed is derived from only the first 64 elements of the flattened array. Consider using a strided sample (e.g. flat[::len(flat)//64]) instead of flat[:64] to distribute the seed derivation across the entire array, making targeted collision construction harder.

# But their palettes differ, so their hashes must differ after the fix.
assert get_hash(img_safe) != get_hash(img_malicious)

def test_pandas_large_dataframe_non_sampled_rows_differ(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add a regression test for large Pandas objects containing unhashable payloads (e.g. object-dtype columns with lists/dicts) to verify that the seed derivation does not bypass the intended pickle fallback path. This would lock in the expected behavior and catch the regression described in the inline issue on hashing.py:431.

# identical hashes despite being visually distinct. Prepend the palette bytes
# to prevent this class of collision.
if pil_obj.mode == "P" and pil_obj.palette is not None:
palette_np = np.frombuffer(bytes(pil_obj.getpalette()), dtype="uint8")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: pil_obj.getpalette() can return None in edge cases even when pil_obj.palette is not None. If that happens, bytes(None) will raise TypeError. Consider adding a defensive guard:

palette_data = pil_obj.getpalette()
if palette_data is not None:
    palette_np = np.frombuffer(bytes(palette_data), dtype="uint8")
    index_np = np.frombuffer(pil_obj.tobytes(), dtype="uint8")
    np_array = np.concatenate([palette_np, index_np])
else:
    np_array = np.frombuffer(pil_obj.tobytes(), dtype="uint8")


if len(series_obj) >= _PANDAS_ROWS_LARGE:
series_obj = series_obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
# Derive the sampling seed from the data itself so that the sampled
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: The comment says the sampled indices are "unpredictable to an attacker," but since seed derivation is deterministic and public, an attacker who controls the input can reproduce the seed by matching the prefix. Consider softening to: "not globally fixed, making collision construction dependent on the actual data content." Same comment appears at lines 453, 485, 511, and 547.

sampled_idx = set(
np.random.RandomState(0).choice(np.arange(total), size=_NP_SAMPLE_SIZE)
)
mask = np.array([i not in sampled_idx for i in range(total)], dtype=bool)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: The list comprehension [i not in sampled_idx for i in range(total)] iterates over 510K elements in Python. A vectorized approach would be faster:

mask = np.ones(total, dtype=bool)
mask[list(sampled_idx)] = False

@github-actions github-actions Bot added the do-not-merge PR is blocked from merging label Apr 2, 2026
@3em0
Copy link
Copy Markdown
Author

3em0 commented Apr 3, 2026

Hi maintainers 👋

This PR is a security bugfix for two hash collision vulnerabilities in @st.cache_data / @st.cache_resource (CWE-345 / CWE-706). The do-not-merge label was auto-applied because the required triage labels are missing.

Could someone with write access please add:

  • change:bugfix
  • impact:users

This will allow the CI label check to pass and unblock the review process. Happy to provide any additional context needed. Thank you!

…guard, unhashable test

- Extract _pandas_sample_seed() helper with its own try/except so that
  seed derivation for unhashable DataFrames/Series falls back to seed=0
  rather than raising TypeError before the existing pickle fallback path
- Add getpalette() None guard for PIL P-mode images with missing palette
- Vectorize the non-sampled-elements mask in the numpy hashing test
- Add test_pandas_large_dataframe_unhashable_payload_uses_pickle_fallback
  to prevent regression of the pickle fallback path for unhashable objects

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@3em0 3em0 force-pushed the fix/cache-hash-deterministic-seed-and-pil-palette branch from 140a170 to dfd7edc Compare April 3, 2026 05:44
@3em0
Copy link
Copy Markdown
Author

3em0 commented Apr 3, 2026

Response to AI Review (CHANGES_REQUESTED → addressed)

Thank you for the detailed consolidated review. All blocking and non-blocking concerns from the previous commit have been addressed in dfd7edcbb.

Blocking: TypeError regression in seed derivation ✅ Fixed

The hash_pandas_object() call was placed before the try/except TypeError block, which would have caused unhashable DataFrame/Series objects (e.g., columns containing lists/dicts) to raise instead of falling through to the pickle fallback.

Fix: Extracted _pandas_sample_seed() as a module-level helper with its own try/except TypeError. When the first-row hash raises, the helper returns 0 (equivalent to the previous random_state=0), so the downstream try/except TypeError block sees exactly the same behavior as before for unhashable payloads:

def _pandas_sample_seed(first_row: Any) -> int:
    try:
        from pandas.util import hash_pandas_object
        return int(hash_pandas_object(first_row).sum()) & 0xFFFF_FFFF
    except TypeError:
        return 0

Verified manually: large DataFrame with list-valued column falls back to pickle and produces a stable, matching hash.

Regression test for unhashable payloads ✅ Added

test_pandas_large_dataframe_unhashable_payload_uses_pickle_fallback asserts that hashing two identical large DataFrames with unhashable content (list-valued cells) returns the same hash — proving the pickle fallback path is preserved.

getpalette() None guard ✅ Added

Added a if palette_data is not None: check after pil_obj.getpalette() to prevent TypeError: bytes(None) in edge cases where a P-mode image has a palette attribute but getpalette() returns None.

Security comment wording ✅ Softened

Replaced "unpredictable to an attacker" with "not globally fixed, making collision construction dependent on the actual data content" across all Pandas, Polars, and NumPy branches. This accurately reflects the property without overclaiming.

Vectorized mask in NumPy test ✅ Applied

Replaced the Python list comprehension [i not in sampled_idx for i in range(total)] with a vectorized mask = np.ones(total, dtype=bool); mask[sampled_idx] = False.


Note on prefix-matching limitation: The reviewers correctly note that seeds derived from a data prefix still allow targeted collision if the attacker can match the prefix. This is a known trade-off acknowledged in the fix scope — true unpredictability would require a server-side secret (breaking cache persistence). The prefix approach significantly raises the bar for opportunistic attacks, which is the primary threat model here.

Note on do-not-merge label: This label was added automatically by require-labels.yml because change:bugfix and impact:users labels are absent. As an external contributor without label-write access, I'd appreciate if a maintainer could add these so the workflow requirement is satisfied.

@sfc-gh-nbellante sfc-gh-nbellante added feature:cache Related to `st.cache_data` and `st.cache_resource` change:bugfix PR contains bug fix implementation area:backend Related to Python backend impact:users PR changes affect end users feature:cache-hash-func Related to cache hashing functions area:security Related to security concerns and removed area:security Related to security concerns labels Apr 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has had no activity for 14 days, so it has been marked as stale. If you still want to continue this work, please leave a comment or push a commit within 7 days. A maintainer can also apply the never-stale label to opt out.

@github-actions github-actions Bot added the stale label Apr 22, 2026
@kmcgrady
Copy link
Copy Markdown
Collaborator

Hey @3em0 -- thanks for filing the original issue (#14622) and putting together this fix! The analysis of the fixed-seed collision vulnerability and the PIL P-mode palette issue was really thorough and clearly laid out.

That said, we've got a competing PR (#14635) that addresses the same bugs with a very similar approach, and that one already has core team review engagement. To avoid duplicating effort and review cycles, we're going to close this one in favor of #14635.

Your issue report was the catalyst for getting this fixed, and we appreciate the detailed reproduction cases -- they made it much easier for everyone to understand and verify the problem. Thanks again for the contribution!

@kmcgrady kmcgrady closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:backend Related to Python backend change:bugfix PR contains bug fix implementation do-not-merge PR is blocked from merging feature:cache Related to `st.cache_data` and `st.cache_resource` feature:cache-hash-func Related to cache hashing functions impact:users PR changes affect end users stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants