Improve the metrics documentation autogenerator.

blp · blp · commit 546a58469a3f · 2025-08-06T16:45:19.000Z
This uses a template file instead of pure Python, which makes it easier to
read the template.  It also uses a regular expression, instead of a prefix
string, to choose the metrics for each section, which makes it possible to
merge metrics that don't start with the same prefix, in turn making the
documentation easier to understand.

Signed-off-by: Ben Pfaff &lt;blp@feldera.com&gt;
diff --git a/docs.feldera.com/docs/operations/metrics.md b/docs.feldera.com/docs/operations/metrics.md
@@ -1,14 +1,9 @@
-<!-- This file is automatically generated.  Do not edit!
-
-To regenerate this file, start a Feldera pipeline (any one will do)
-and obtain Prometheus metrics for it with, for example, `fda metrics
---format=prometheus` or `curl https://server/v0/metrics`, then run
-`metrics.py` from the same directory as this, like so:
+# Pipeline Metrics
 
-metrics.py < metrics.txt > metrics.md
--->
+<!-- This file is automatically generated.  Do not edit!
 
-# Metrics
+To update the documentation, please edit metrics.md.in instead and
+then regenerate this file using the instructions in that file. -->
 
 This reference lists all of the metrics that Feldera exports through
 its `/metrics` endpoint in [Prometheus exposition format].  It is
@@ -19,10 +14,15 @@ All of the metrics exported by a particular Feldera pipeline are
 labeled with the pipeline's UUID as `pipeline`.  Some metrics have
 additional labels, as documented below.
 
-[Prometheus exposition format]: https://prometheus.io/docs/instrumenting/exposition_formats
+See [Monitoring and Profiling] for a guide to setting up Prometheus
+and Grafana with Feldera.  The [Feldera template dashboard] is a
+sample Grafana dashboard for Feldera.
 
+[Prometheus exposition format]: https://prometheus.io/docs/instrumenting/exposition_formats
+[Monitoring and Profiling]: /tutorials/monitoring
+[Feldera template dashboard]: https://raw.githubusercontent.com/feldera/feldera/main/deploy/grafana_dashboard.json
 
-## Process Metrics
+# Process Metrics
 
 These metrics report statistics for a running Feldera pipeline
 process.  When a pipeline process is killed and restarts from a
@@ -64,18 +64,19 @@ which Feldera is built.
 
 | Name | Type | Description |
 | :--- | :--- | :---------- |
+| `compaction_stall_duration_seconds` |counter | Time in seconds a worker was stalled waiting for more merges to complete. |
 | `dbsp_operator_checkpoint_latency_seconds` |histogram | Latency of individual operator checkpoint operations in seconds. (Because checkpoints run in parallel across workers, these will not add to `feldera_checkpoint_latency_seconds`.) |
 | `dbsp_step_latency_seconds` |histogram | Latency of DBSP steps over the last 60 seconds or 1000 steps, whichever is less, in seconds |
 | `dbsp_steps_total` |counter | Total number of DBSP steps executed. |
 
 ## Record Processing
 
-These metrics report the status of record input, processing, and
-output as a whole.  They are maintained consistently across checkpoint
-and resume.
+These metrics report overall counts of records as they pass through
+the pipeline.  They accumulate across checkpoint and resume.
 
 | Name | Type | Description |
 | :--- | :--- | :---------- |
+| `output_buffered_batches` |gauge | Number of batches of records currently buffered by the output connector. |
 | `records_input_buffered` |gauge | Total number of records currently buffered by all endpoints. |
 | `records_input_total` |counter | Total number of input records received from all connectors. |
 | `records_late_total` |counter | Number of records dropped due to LATENESS annotations. |
@@ -88,6 +89,8 @@ to work with data larger than memory.
 
 | Name | Type | Description |
 | :--- | :--- | :---------- |
+| `files_created_total` |counter | Total number of files created. |
+| `files_deleted_total` |counter | Total number of files deleted. |
 | `storage_read_block_bytes` |histogram | Sizes in bytes of blocks read from storage. |
 | `storage_read_latency_seconds` |histogram | Read latency for storage blocks in seconds |
 | `storage_sync_latency_seconds` |histogram | Sync latency in seconds |
@@ -107,8 +110,8 @@ These metrics report the status of the pipeline.
 These metrics are per-input connector, labeled with `endpoint` set to
 the name of the input connector, which is either the name assigned in
 the SQL program or automatically generated as `unnamed-<number>`,
-where `<number>` is 1 for the first connector for a given table, 2 for
-the second, and so on.
+where `<number>` counts starting from 1 for the first connector for a
+given table.
 
 | Name | Type | Description |
 | :--- | :--- | :---------- |
@@ -123,8 +126,8 @@ the second, and so on.
 These metrics are per-output connector, labeled with `endpoint` set to
 the name of the output connector, which is either the name assigned in
 the SQL program or automatically generated as `unnamed-<number>`,
-where `<number>` is 1 for the first connector for a given view,
-2 for the second, and so on.
+where `<number>` counts starting from 1 for the first connector for a
+given view.
 
 | Name | Type | Description |
 | :--- | :--- | :---------- |
@@ -134,28 +137,3 @@ where `<number>` is 1 for the first connector for a given view,
 | `output_connector_errors_transport_total` |counter | Total number of errors encountered at the transport layer sending records. |
 | `output_connector_records_total` |counter | Total number of records sent by the output connector. |
 
-## Merge Status
-
-These metrics reports the status of the merger.
-
-| Name | Type | Description |
-| :--- | :--- | :---------- |
-| `compaction_stall_duration_seconds` |counter | Time in seconds a worker was stalled waiting for more merges to complete. |
-
-## Output Batches
-
-These metrics report output buffering status.
-
-| Name | Type | Description |
-| :--- | :--- | :---------- |
-| `output_buffered_batches` |gauge | Number of batches of records currently buffered by the output connector. |
-
-## File metrics
-
-These report use of files within Feldera storage.
-
-| Name | Type | Description |
-| :--- | :--- | :---------- |
-| `files_created_total` |counter | Total number of files created. |
-| `files_deleted_total` |counter | Total number of files deleted. |
-
diff --git a/docs.feldera.com/docs/operations/metrics.md.in b/docs.feldera.com/docs/operations/metrics.md.in
@@ -0,0 +1,99 @@
+<!-- This is a template file for metrics documentation.
+
+To regenerate the metrics documentation, start a Feldera pipeline (any
+one will do) and obtain Prometheus metrics for it with, for example,
+`fda metrics --format=prometheus` or `curl https://server/v0/metrics`.
+Save them to a file named `metrics.txt`, then then run `metrics.py`
+from this directory, like so:
+
+metrics.py metrics.txt
+-->
+
+# Pipeline Metrics
+
+This reference lists all of the metrics that Feldera exports through
+its `/metrics` endpoint in [Prometheus exposition format].  It is
+automatically generated using the documentation embedded in Prometheus
+output.
+
+All of the metrics exported by a particular Feldera pipeline are
+labeled with the pipeline's UUID as `pipeline`.  Some metrics have
+additional labels, as documented below.
+
+See [Monitoring and Profiling] for a guide to setting up Prometheus
+and Grafana with Feldera.  The [Feldera template dashboard] is a
+sample Grafana dashboard for Feldera.
+
+[Prometheus exposition format]: https://prometheus.io/docs/instrumenting/exposition_formats
+[Monitoring and Profiling]: /tutorials/monitoring
+[Feldera template dashboard]: https://raw.githubusercontent.com/feldera/feldera/main/deploy/grafana_dashboard.json
+
+# Process Metrics
+
+These metrics report statistics for a running Feldera pipeline
+process.  When a pipeline process is killed and restarts from a
+checkpoint, the new process's metrics are for it alone, not cumulative
+with any previous instantiations.
+
+These metrics are intended to match the standard [Prometheus
+definitions].
+
+[Prometheus definitions]: https://prometheus.io/docs/instrumenting/writing_clientlibs/#process-metrics
+
+{{process_}}
+
+## Feldera metrics
+
+These metrics report statistics for Feldera operations.
+
+{{feldera_}}
+
+## DBSP metrics
+
+These metrics report statistics for [DBSP], the low-level mechanism on
+which Feldera is built.
+
+[DBSP]: https://docs.rs/dbsp/latest/dbsp/
+
+{{dbsp_|compaction_}}
+
+## Record Processing
+
+These metrics report overall counts of records as they pass through
+the pipeline.  They accumulate across checkpoint and resume.
+
+{{records_|output_buffered_}}
+
+## Storage Performance
+
+These metrics report the performance of storage, which allows Feldera
+to work with data larger than memory.
+
+{{storage_|files_}}
+
+## Pipeline Status
+
+These metrics report the status of the pipeline.
+
+{{pipeline_}}
+
+## Input Connectors
+
+These metrics are per-input connector, labeled with `endpoint` set to
+the name of the input connector, which is either the name assigned in
+the SQL program or automatically generated as `unnamed-<number>`,
+where `<number>` counts starting from 1 for the first connector for a
+given table.
+
+{{input_connector_}}
+
+## Output Connectors
+
+These metrics are per-output connector, labeled with `endpoint` set to
+the name of the output connector, which is either the name assigned in
+the SQL program or automatically generated as `unnamed-<number>`,
+where `<number>` counts starting from 1 for the first connector for a
+given view.
+
+{{output_connector_}}
+
diff --git a/docs.feldera.com/docs/operations/metrics.py b/docs.feldera.com/docs/operations/metrics.py
@@ -1,125 +1,60 @@
 #!/usr/bin/env python3
 
 import fileinput
+import re
 import sys
 
+# Read Prometheus metrics from the files provided on the command line,
+# or on stdin, and save their types and descriptions into `metrics`.
 metrics = {}
-for line in fileinput.input(encoding='utf-8'):
+for line in fileinput.input(encoding="utf-8"):
     comment, keyword, name, args = line.strip().split(maxsplit=3)
-    if comment == '#' and keyword in ['TYPE', 'HELP']:
+    if comment == "#" and keyword in ["TYPE", "HELP"]:
         metrics.setdefault(name, {})[keyword] = args
 
-print("""<!-- This file is automatically generated.  Do not edit!
-
-To regenerate this file, start a Feldera pipeline (any one will do)
-and obtain Prometheus metrics for it with, for example, `fda metrics
---format=prometheus` or `curl https://server/v0/metrics`, then run
-`metrics.py` from the same directory as this, like so:
-
-metrics.py < metrics.txt > metrics.md
--->
-
-# Metrics
-
-This reference lists all of the metrics that Feldera exports through
-its `/metrics` endpoint in [Prometheus exposition format].  It is
-automatically generated using the documentation embedded in Prometheus
-output.
-
-All of the metrics exported by a particular Feldera pipeline are
-labeled with the pipeline's UUID as `pipeline`.  Some metrics have
-additional labels, as documented below.
-
-[Prometheus exposition format]: https://prometheus.io/docs/instrumenting/exposition_formats
-
+# Read Markdown template from metrics.md.in, making some substitutions:
+#
+# - Delete lines up to the first `#`, and copy that line to the output.
+#
+# - Add a warning about the file being automatically generated, so
+#   that people don't edit it.  (This has to go *after* the first `#`
+#   because docusaurus won't skip the comment when it goes looking
+#   for the page title.)
+#
+# - Copy the rest of the file to the output, substituting lines that
+#   are bracketed by {{}} by autogenerated metrics documentation.
+template = open("metrics.md.in", "r")
+output = open("metrics.md", "w")
+for line in template:
+    line = line.rstrip()
+    if line.startswith("#"):
+        break
+output.write(f"""{line}
+
+<!-- This file is automatically generated.  Do not edit!
+
+To update the documentation, please edit metrics.md.in instead and
+then regenerate this file using the instructions in that file. -->
 """)
-
-def document_section(name, heading):
-    print(heading.strip())
-    print()
-
-    section_metrics = sorted([key for key in metrics.keys() if key.startswith(f"{name}_")])
-    assert section_metrics != []
-    print("| Name | Type | Description |")
-    print("| :--- | :--- | :---------- |")
-    for metric in section_metrics:
-        type_ = metrics[metric]['TYPE']
-        help = metrics[metric]['HELP']
-        print(f"| `{metric}` |{type_} | {help} |")
-        del metrics[metric]
-    print()
-
-document_section("process", """## Process Metrics
-
-These metrics report statistics for a running Feldera pipeline
-process.  When a pipeline process is killed and restarts from a
-checkpoint, the new process's metrics are for it alone, not cumulative
-with any previous instantiations.
-
-These metrics are intended to match the standard [Prometheus
-definitions].
-
-[Prometheus definitions]: https://prometheus.io/docs/instrumenting/writing_clientlibs/#process-metrics""")
-
-document_section("feldera", """## Feldera metrics
-
-These metrics report statistics for Feldera operations.""")
-
-document_section("dbsp", """## DBSP metrics
-
-These metrics report statistics for [DBSP], the low-level mechanism on
-which Feldera is built.
-
-[DBSP]: https://docs.rs/dbsp/latest/dbsp/""")
-
-document_section("records", """## Record Processing
-
-These metrics report the status of record input, processing, and
-output as a whole.  They are maintained consistently across checkpoint
-and resume.
-
-""")
-
-document_section("storage", """## Storage Performance
-
-These metrics report the performance of storage, which allows Feldera
-to work with data larger than memory.""")
-
-document_section("pipeline", """## Pipeline Status
-
-These metrics report the status of the pipeline.
-
-""")
-
-document_section("input_connector", """## Input Connectors
-
-These metrics are per-input connector, labeled with `endpoint` set to
-the name of the input connector, which is either the name assigned in
-the SQL program or automatically generated as `unnamed-<number>`,
-where `<number>` is 1 for the first connector for a given table, 2 for
-the second, and so on.""")
-
-document_section("output_connector", """## Output Connectors
-
-These metrics are per-output connector, labeled with `endpoint` set to
-the name of the output connector, which is either the name assigned in
-the SQL program or automatically generated as `unnamed-<number>`,
-where `<number>` is 1 for the first connector for a given view,
-2 for the second, and so on.""")
-
-document_section("compaction", """## Merge Status
-
-These metrics reports the status of the merger.
-""")
-
-document_section("output_buffered", """## Output Batches
-
-These metrics report output buffering status.""")
-
-document_section("files", """## File metrics
-
-These report use of files within Feldera storage.""")
+for line in template:
+    line = line.rstrip()
+    m = re.match(r"{{(.*)}}", line)
+    if m:
+        regex = re.compile(m.group(1))
+        matching_metrics = sorted([key for key in metrics.keys() if regex.match(key)])
+        assert matching_metrics != []
+        output.write("| Name | Type | Description |\n")
+        output.write("| :--- | :--- | :---------- |\n")
+        for metric in matching_metrics:
+            type_ = metrics[metric]["TYPE"]
+            help = metrics[metric]["HELP"]
+            output.write(f"| `{metric}` |{type_} | {help} |\n")
+            del metrics[metric]
+    else:
+        output.write(f"{line}\n")
 
 if len(metrics) > 0:
-    sys.stderr.write(f"error: the following metrics need documentation sections: {metrics.keys()}\n")
+    sys.stderr.write(
+        f"error: the following metrics need to be included in documentation: {metrics.keys()}\n"
+    )
     sys.exit(1)
diff --git a/docs.feldera.com/docs/tutorials/monitoring/index.md b/docs.feldera.com/docs/tutorials/monitoring/index.md