Conversation
mythical-fred
left a comment
There was a problem hiding this comment.
Draft: two design questions, see inline.
datagenoriginal is v0.281.0 and roaring branch is the current platform runtime on 92e4a95. enable_roaring=true was set on all runs. Bench Summary
Filter Hit/Miss Comparison:
Takeaways:
postgresoriginal is v0.281.0 and roaring branch is the platform runtime on 92e4a95. I loaded all four Postgres tables to 200,000,000 rows each. It's very slow. Bench Summary
Filter Read
Takeaways:
deltalakeoriginal is v0.281.0 and roaring branch is the platform runtime on the current manager. These were 300-second fda bench runs, and all four runs had enable_roaring=true. Bench Summary
For fixed 300s runs, absolute storage at stop time is influenced by how much work got processed in that window. Filter Miss Table Miss rate is misses / (hits + misses), so higher is better.
Read
kafka
Bench Summary
Filter Miss Table
Main read:
|
a1e6733 to
84819c1
Compare
| } | ||
|
|
||
| let mut intermediate = factories.keys_factory().default_box(); | ||
| let mut merged_cursor = CursorList::new( |
There was a problem hiding this comment.
Needs some input from @ryzhyk that this CursorList here does the preserves the same semantics as the previous code which used SpineCursor
|
This is currently configured as an enabled by default feature with option to disable it in dev-tweaks. Thoughts? |
mythical-fred
left a comment
There was a problem hiding this comment.
Three blockers. The new dev_tweaks knob needs docs in docs.feldera.com, the roaring path should start default-off rather than enabled-by-default with a kill switch, and the [ci] apply automatic fixes commit needs to be squashed because dirty history is still a hard no.
| pub fn enable_roaring(&self) -> bool { | ||
| // Roaring is enabled by default, but `enable_roaring = false` remains | ||
| // available as a kill switch while the feature is still being tuned. | ||
| self.enable_roaring.unwrap_or(true) |
There was a problem hiding this comment.
On the feature-flag question from the PR thread: this should start default-off. An incompatible storage-format feature that is still being tuned wants a real rollout gate, not an opt-out kill switch after it is already on for everyone. enable_roaring = true should selectively enable it; the default should stay false until we have more production mileage.
0631ab4 to
73d0445
Compare
This PR adds three things:
1st commit: A bitmap filter based on the roaring crate with
2nd commit: A benchmark that validates the predictor function we use to choose between roaring and bloom filters (2k lines, this code can be mostly ignored for the PR review)
3rd commit: Code that adjusts storage files by setting an incompatible feature flag
Describe Manual Test Plan
Ran many pipelines
Checklist
Breaking Changes?
Ideally no
Describe Incompatible Changes
File format has new incompatible feature, old versions will refuse to read the files with a roaring filter.
Users should not downgrade once they go beyond this version without clearing storage.