Skip to content

Add checkpoint info to /stats API endpoint, add checkpoint info UI#6003

Open
Karakatiza666 wants to merge 6 commits intomainfrom
checkpoint-ui
Open

Add checkpoint info to /stats API endpoint, add checkpoint info UI#6003
Karakatiza666 wants to merge 6 commits intomainfrom
checkpoint-ui

Conversation

@Karakatiza666
Copy link
Copy Markdown
Contributor

@Karakatiza666 Karakatiza666 commented Apr 7, 2026

Testing:
Tested the UI using the mock API value simulator. Did not test the adapters and pipeline-manager changes

image image image image image

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two design questions, see inline.


/// Converts epoch milliseconds to a `DateTime<Utc>`, falling back to
/// `Utc::now()` if the value is 0 or invalid.
fn epoch_ms_to_datetime(epoch_ms: i64) -> DateTime<Utc> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If epoch_ms is 0/invalid, returning Utc::now() will make the UI show a fresh timestamp even when we don’t actually know the start time. Consider returning an Option (and omitting the field) or a sentinel so the UI can show "unknown" instead of a misleading "just now".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The epoch_ms should not happen at runtime; updated the comments and the code to reflect that. Since this is a hard rule I opted for an explicit crash if this occurs, LMK if you think it's better to bubble it up as an error

let epoch_ms = self.checkpoint_started_epoch_ms();
let started_at = epoch_ms_to_datetime(epoch_ms);
CheckpointActivity::InProgress {
sequence_number: 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequence_number: 0 looks like a placeholder. If we can’t plumb the real seq here, consider making this field optional or dropping it from CheckpointActivity::InProgress for now to avoid exposing bogus data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped sequence_number because it is not available in all cases, and we don't have a business case to display it for an in-progress checkpoint anyway

@gz
Copy link
Copy Markdown
Contributor

gz commented Apr 7, 2026

please engage with anna for the UX

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented Apr 7, 2026

Some feedback on UI:

  • "2 checkpoints at 780MiB" -> "2 checkpoints, total size: 780MiB"
  • Checkpoint list: include the timestamp, including time zone (UTC), when the checkpoint was made with each checkpoint (can be extracted from ULID). This should be visible even in the collapsed state.
  • "Make checkpoint" - if we expose this, it shouldn't be hiding in the checkpoint list.

Copy link
Copy Markdown
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the Rust code and it seems OK. I'll leave approval for whoever reviews rest of the PR when it becomes ready for review.

/// Is the circuit thread still restoring from a checkpoint (this includes the journal replay phase)?
restoring: AtomicBool,

/// Wall-clock epoch millis when the checkpoint entered the Delayed or
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is premature optimization to use AtomicI64 for these. I think that Mutex<Option<DateTime<Utc>>> would be cleaner.

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
…timestamp

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
…rted: Mutex<Option<DateTime<Utc>>>

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
@Karakatiza666 Karakatiza666 marked this pull request as ready for review April 13, 2026 21:26
@Karakatiza666
Copy link
Copy Markdown
Contributor Author

The current design is intended to be merged as-is; the nicer design from our design team will be implemented when ready and accepted.

@Karakatiza666
Copy link
Copy Markdown
Contributor Author

Currently the OSS release will have this displayed all the time:
image
LMK if this should be hidden.

&self,
suspend_error: Result<(), SuspendError>,
checkpoint_activity: CheckpointActivity,
permanent_checkpoint_errors: Option<Vec<PermanentSuspendError>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this type is awfully specific, is this the only type of error possible?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PermanentSuspendError ? Yes, it covers a few possibilities. This is a type of errors that don't ever allow checkpointing the pipeline.

/// Wall-clock time when the checkpoint entered the Delayed or
/// Barriers state. `None` means "not delayed". Set by the circuit thread
/// in `set_checkpoint_coordination`, read by the HTTP thread.
checkpoint_delay_started: Mutex<Option<DateTime<Utc>>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in Rust concurrency, someone else will have to comment whether this is the right type to use here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a reasonable choice.

next_seq: u64,

/// Sequence number of the currently in-flight checkpoint, if any.
/// This is only valid when exactly one checkpoint runs at a time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for many checkpoints to be running at the same time? I thought it's not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only write one checkpoint at a time.

///
/// This tracks transient checkpoint failures (e.g. I/O errors during
/// writing). A subsequent successful checkpoint will not clear this
/// field — it always reflects the *last* failure that occurred.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the field should be named "last_failure", but I don't know if you can rename it

Delayed {
/// Why the checkpoint cannot proceed yet.
reasons: Vec<TemporarySuspendError>,
/// When the delay started (ISO 8601).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What ISO 8601? It's a DateTime, 8601 is a serialization format. Maybe you want to say "serialized as ISO 8601"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is applied as a description in openapi.json, but adding "serialized as" makes it less ambiguous either way

Used storage: {humanSize(metrics.at(-1)?.s.toNumber() ?? 0)}
<div class="flex flex-nowrap items-center justify-between px-4 pb-2">
<span>
<span class="hidden sm:inline">Used storage:</span>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what these two spans are, in the pictures you attached I only see one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is shown on desktop screens, one (shorter) is shown when on mobile

if (anchorOffscreen) {
const focusOffscreen =
tracked.focus.row < visible.min || tracked.focus.row > visible.max
const focusOffscreen = tracked.focus.row < visible.min || tracked.focus.row > visible.max
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these formatting changes?
Isn't the formatting made by a tool?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. This one probably was committed without running through the formatter


/**
* Optional settings for tweaking Feldera internals.
*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks to me that some of these changes should have been in other PRs.
Is this a sign that our tooling for generating these gen.ts files is not complete?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When people update openapi.json they don't usually re-generate typescript bindings from it; so when I periodically re-generate it it often contains cumulative additions from multiple PRs

})
}
fetchCheckpoints()
const interval = setInterval(fetchCheckpoints, 2_000)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other polling loops that can be combined with this one?
I don't know if we tested scaling of the manager and ui with lots of pipelines and ui clients.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to combine this loop with any other. This is the only access for this endpoint in web-console


/** Extracts the timestamp from a UUID v7 (first 48 bits = unix ms).
* Returns null for non-v7 UUIDs. */
const uuidV7Timestamp = (uuid: string): string | null => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you have a library with utility functions?
this would belong there

Copy link
Copy Markdown
Contributor Author

@Karakatiza666 Karakatiza666 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sometimes don't put utiliities there when there's only a single use for them. Will move

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blockers.

  1. This PR changes behavior in both the adapters/API layer and the web-console, but it still adds no unit or integration coverage. The new checkpoint activity/status paths, the extra polling loop, and the new UI states all need tests before this is safe to merge. For the UI, please add Vitest + @testing-library/svelte coverage rather than relying only on manual screenshots.

  2. This is a user-visible API/UI change (new checkpoint fields, new checkpoint status presentation, new checkpoint actions), but there is still no corresponding docs update under docs.feldera.com/. That is a hard rule for user-visible changes.

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
@blp
Copy link
Copy Markdown
Member

blp commented Apr 14, 2026

Thanks for the improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants