[MISC] Add --active-only filter to backfill_metrics command#1824
[MISC] Add --active-only filter to backfill_metrics command#1824Deepak-Kesavan merged 6 commits intomainfrom
Conversation
Allow filtering organizations during backfill to only process those with active status. Adds graceful fallback for OSS mode. Also improves progress output with [n/N] counter, extracts org resolution into _resolve_org_ids(), and fixes logger formatting to use lazy %s instead of f-strings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary by CodeRabbit
WalkthroughAdds an Changes
sequenceDiagram
participant CLI as CLI
participant Cmd as BackfillCommand
participant Sub as SubscriptionModel
participant Org as OrganizationModel
participant Collector as MetricsCollector
participant Log as Logger
CLI->>Cmd: run backfill_metrics (--active-only?, org_id?)
Cmd->>Cmd: parse args
Cmd->>Sub: conditional import check (HAS_SUBSCRIPTION)
Cmd->>Cmd: _resolve_org_ids(org_id, active_only)
alt HAS_SUBSCRIPTION and active_only
Cmd->>Sub: query active subscriptions -> org ids
else fallback / no subscription support
Cmd->>Org: query all org ids (or validate single org)
end
Cmd->>Cmd: iterate resolved org ids [i/total]
loop per org
Cmd->>Collector: collect metrics for current_org_id
Collector-->>Cmd: success / errors
alt error
Cmd->>Log: log metric name + current_org_id + error
end
end
Cmd->>Log: final summary (errors/counts)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/management/commands/backfill_metrics.py`:
- Around line 202-207: The single-org branch bypasses the --active-only filter,
so when --org-id is provided inactive orgs can be processed; update the logic in
the function that returns org IDs (the branch using
Organization.objects.filter(id=org_id).exists()) to honor the --active-only
flag: either make --org-id and --active-only mutually exclusive in argument
parsing, or (preferred) when active_only is true, check
Organization.objects.filter(id=org_id, is_active=True).exists() and error/return
if not active; reference Organization and the code path that currently returns
[org_id] for single-org mode.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b5bf7e7e-0729-48ff-bc0a-a5698a69f465
📒 Files selected for processing (1)
backend/dashboard_metrics/management/commands/backfill_metrics.py
Organization.id is the auto-increment int PK, but Subscription.organization_id and all service query methods use Organization.organization_id (the Auth0 org identifier string). The set intersection was always empty due to this type mismatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
Subscription.organization_id stores Auth0 org strings (org_*) while all downstream service queries and bulk upserts use Organization.id (int PK) via FK references. Map subscription org strings back to Organization PKs via organization_id__in lookup to ensure correct downstream behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
muhammad-ali-e
left a comment
There was a problem hiding this comment.
Code Review
Summary
PR adds --active-only flag to backfill_metrics management command, extracts org resolution into _resolve_org_ids(), adds progress counters, and fixes logger formatting. However, the refactoring introduces a critical data type mismatch: the new code passes Organization.organization_id (Auth0 string identifier) where the downstream code expects Organization.id (integer PK).
Automated Checks
- ruff check: PASS
- ruff format: PASS
Findings (confidence >= 80 only)
Critical [must fix]
| # | File:Line | Category | Confidence | Finding | Suggestion |
|---|---|---|---|---|---|
| 1 | backfill_metrics.py (_resolve_org_ids) | Bug | 98 | _resolve_org_ids() returns Organization.organization_id (Auth0 string like "org_abc123") but _collect_metrics() passes it to MetricsQueryService methods that expect Organization.id (integer PK). For example, get_pages_processed() does Organization.objects.get(id=organization_id) at services.py:141 — this will raise DoesNotExist or ValueError when given a string identifier instead of an integer PK. |
Return Organization PKs (values_list("id", flat=True)) from _resolve_org_ids(), not organization_id CharField values. Or update all MetricsQueryService methods to accept the Auth0 identifier — but that's a much larger change. |
| 2 | backfill_metrics.py (_resolve_org_ids) | Bug | 96 | Bulk upsert methods do EventMetricsHourly(organization_id=org_id, ...) where organization_id is the FK column (stores Organization PK). Passing Auth0 string identifiers will cause integrity errors or create orphaned records. |
Same fix as #1 — ensure _resolve_org_ids() returns Organization PKs. |
| 3 | backfill_metrics.py (_resolve_org_ids single-org) | Bug | 92 | Single-org mode: Organization.objects.filter(organization_id=org_id) validates using the Auth0 identifier, but returns it to be passed to _collect_metrics() which expects PK. The usage doc was changed from --org-id=12 to --org-id=org_abc123 but downstream code still expects PKs. This is a breaking change to the --org-id CLI semantics without updating the consumer. |
Either keep --org-id as PK and do Organization.objects.filter(id=org_id), or convert the Auth0 identifier back to PK before returning: Organization.objects.filter(organization_id=org_id).values_list("id", flat=True). |
Important [should fix]
| # | File:Line | Category | Confidence | Finding | Suggestion |
|---|---|---|---|---|---|
| 4 | backfill_metrics.py (handle loop) | Code Quality | 82 | Progress output only shows current_org_id (a raw string identifier) without the org display name. The old code showed org.display_name ({org_id_str}) which was more useful for operators. |
Fetch the Organization object (or at minimum the display_name) and include it in the progress message. |
Suggested Fix
Change _resolve_org_ids() to return PKs:
# All org mode
all_org_ids = set(Organization.objects.values_list("id", flat=True))
# Active filter - need to map Subscription.organization_id → Organization.id
if active_only and HAS_SUBSCRIPTION:
active_auth0_ids = set(
Subscription.objects.filter(is_active=True)
.values_list("organization_id", flat=True)
)
all_org_ids = set(
Organization.objects.filter(organization_id__in=active_auth0_ids)
.values_list("id", flat=True)
)
# Single org mode - decide: is --org-id a PK or Auth0 identifier? Document clearly.Confluence Checklist
- ✅ Appropriate PR title and description
- ✅ Code follows style guidelines (ruff passes)
⚠️ Self-review — the PK vs organization_id mismatch suggests the interaction with MetricsQueryService wasn't verified end-to-end- ❌ Tests — no test files added or modified
- ✅ No database migrations needed
- ✅ No new dependencies
- ✅ Backward compatible (when bug is fixed)
Verdict: REQUEST CHANGES
The core refactoring has a critical data type mismatch. _resolve_org_ids() returns Organization.organization_id (Auth0 string identifiers) but all downstream consumers — MetricsQueryService methods and bulk upsert calls — expect Organization.id (integer PK). This will cause runtime failures for every org processed.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/dashboard_metrics/management/commands/backfill_metrics.py (1)
216-226: Push dedup/filter work to the database to reduce Python memory churn.
set(...)materialization here works, but this can be done with DB-leveldistinct()and subquery composition, which scales better for larger org/subscription tables.Refactor sketch
- active_org_id_strings = set( - Subscription.objects.filter(is_active=True).values_list( - "organization_id", flat=True - ) - ) - # Map org_* strings back to Organization PKs - all_org_ids = set( - Organization.objects.filter( - organization_id__in=active_org_id_strings - ).values_list("id", flat=True) - ) + active_org_id_strings = ( + Subscription.objects.filter(is_active=True) + .values_list("organization_id", flat=True) + .distinct() + ) + # Map org_* strings back to Organization PKs + all_org_ids = ( + Organization.objects.filter(organization_id__in=active_org_id_strings) + .values_list("id", flat=True) + .distinct() + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/management/commands/backfill_metrics.py` around lines 216 - 226, The current code materializes Subscription and Organization IDs into Python sets (active_org_id_strings and all_org_ids) which causes unnecessary memory churn; instead push deduping to the DB by using QuerySet.distinct() and subquery composition: get the distinct subscription organization_id via Subscription.objects.filter(is_active=True).values_list("organization_id", flat=True).distinct() (or wrap as a Subquery), then use that QuerySet/Subquery in Organization.objects.filter(organization_id__in=...) to fetch Organization IDs (and call .values_list("id", flat=True).distinct()) so no large set(...) materialization is done in Python and the database handles deduplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/management/commands/backfill_metrics.py`:
- Around line 206-210: The org lookup can raise on malformed --org-id; wrap the
Organization.objects.filter(id=org_id).first() call in a try/except that catches
ValueError and django.core.exceptions.ValidationError (and optionally
TypeError), write a friendly error to self.stderr via self.style.ERROR (e.g.,
"Invalid organization id: {org_id}") and return [] on exception, then proceed to
the existing "not found" path if lookup returns None; update imports if needed
to include ValidationError.
- Around line 32-37: The import try/except around "from
pluggable_apps.subscription_v2.models import Subscription" should only swallow
module-not-found for the pluggable_apps package path and re-raise other import
errors; change the except to catch ModuleNotFoundError (not ImportError) and
only set HAS_SUBSCRIPTION = False when exc.name is "pluggable_apps" or
startswith "pluggable_apps."; otherwise re-raise the exception so import-time
errors inside subscription_v2 aren't silently ignored (update the block that
defines Subscription and HAS_SUBSCRIPTION).
---
Nitpick comments:
In `@backend/dashboard_metrics/management/commands/backfill_metrics.py`:
- Around line 216-226: The current code materializes Subscription and
Organization IDs into Python sets (active_org_id_strings and all_org_ids) which
causes unnecessary memory churn; instead push deduping to the DB by using
QuerySet.distinct() and subquery composition: get the distinct subscription
organization_id via
Subscription.objects.filter(is_active=True).values_list("organization_id",
flat=True).distinct() (or wrap as a Subquery), then use that QuerySet/Subquery
in Organization.objects.filter(organization_id__in=...) to fetch Organization
IDs (and call .values_list("id", flat=True).distinct()) so no large set(...)
materialization is done in Python and the database handles deduplication.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1cbb1d23-21b8-4e3c-8d41-81d38f6c0169
📒 Files selected for processing (1)
backend/dashboard_metrics/management/commands/backfill_metrics.py
This was addressed in earlier commit |
Wrap Organization PK lookup in try/except so non-integer values (e.g. --org-id=abc) show a friendly "not found" error instead of crashing with an unhandled ValueError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
backend/dashboard_metrics/management/commands/backfill_metrics.py (1)
207-210:⚠️ Potential issue | 🟡 MinorCatch
ValidationErrorfor malformed--org-id.This lookup can still raise
django.core.exceptions.ValidationError(commonly with malformed UUID input), so the command may abort before reaching the friendly “not found” path.Patch suggestion
+from django.core.exceptions import ValidationError ... if org_id: try: org = Organization.objects.filter(id=org_id).first() - except (ValueError, TypeError): + except (ValueError, TypeError, ValidationError): org = None#!/bin/bash set -euo pipefail # Verify Organization PK type and current exception handling fd "models.py" | xargs rg -n --type=py -C3 "class Organization\\(|id\\s*=\\s*models\\.(UUIDField|AutoField|BigAutoField|IntegerField)" rg -n --type=py -C3 "_resolve_org_ids|except \\(ValueError, TypeError\\)|ValidationError" backend/dashboard_metrics/management/commands/backfill_metrics.pyExpected result: if
Organization.idis aUUIDField,ValidationErrorshould be handled in this branch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/management/commands/backfill_metrics.py` around lines 207 - 210, The current try/except around Organization.objects.filter(id=org_id).first() only catches ValueError and TypeError but a malformed UUID can raise django.core.exceptions.ValidationError; update the exception handling in the block that resolves org_id (the call performing Organization.objects.filter(id=org_id).first()) to also catch ValidationError (imported as from django.core.exceptions import ValidationError) so malformed UUID inputs fall back to org = None and proceed to the existing “not found” handling.
🧹 Nitpick comments (2)
backend/dashboard_metrics/management/commands/backfill_metrics.py (2)
71-75: Clarify--org-idhelp text to indicate PK semantics.The current text is ambiguous with
Organization.organization_idalso in play. Explicitly stating “Organization.id (PK)” will reduce operator confusion.Patch suggestion
parser.add_argument( "--org-id", type=str, default=None, - help="Specific organization ID to backfill (default: all orgs)", + help="Specific organization PK (Organization.id) to backfill (default: all orgs)", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/management/commands/backfill_metrics.py` around lines 71 - 75, Update the help string for the "--org-id" argument in the management command (inside add_arguments of Command in backfill_metrics.py) to explicitly state that it expects the Organization primary key; replace the ambiguous text "Specific organization ID to backfill (default: all orgs)" with something like "Specific Organization.id (primary key) to backfill; if omitted, backfills all organizations" so operators know to supply the PK rather than Organization.organization_id.
142-145: Optional: include org display name in progress output.For long backfills, printing
<org_id> (<name>)makes operator monitoring much easier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/management/commands/backfill_metrics.py` around lines 142 - 145, Update the progress print in the backfill loop to include the org display name alongside the org id: inside the loop over org_ids (variables org_ids, current_org_id, i) fetch the Organization display name (e.g., via the Organization model or existing lookup used elsewhere in backfill_metrics.py) and change the self.stdout.write call to print "{org_id} ({display_name})" with a safe fallback to the id if the name is missing; ensure you reference org_ids/current_org_id and self.stdout.write when making the change so it works for long-running backfills.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@backend/dashboard_metrics/management/commands/backfill_metrics.py`:
- Around line 207-210: The current try/except around
Organization.objects.filter(id=org_id).first() only catches ValueError and
TypeError but a malformed UUID can raise django.core.exceptions.ValidationError;
update the exception handling in the block that resolves org_id (the call
performing Organization.objects.filter(id=org_id).first()) to also catch
ValidationError (imported as from django.core.exceptions import ValidationError)
so malformed UUID inputs fall back to org = None and proceed to the existing
“not found” handling.
---
Nitpick comments:
In `@backend/dashboard_metrics/management/commands/backfill_metrics.py`:
- Around line 71-75: Update the help string for the "--org-id" argument in the
management command (inside add_arguments of Command in backfill_metrics.py) to
explicitly state that it expects the Organization primary key; replace the
ambiguous text "Specific organization ID to backfill (default: all orgs)" with
something like "Specific Organization.id (primary key) to backfill; if omitted,
backfills all organizations" so operators know to supply the PK rather than
Organization.organization_id.
- Around line 142-145: Update the progress print in the backfill loop to include
the org display name alongside the org id: inside the loop over org_ids
(variables org_ids, current_org_id, i) fetch the Organization display name
(e.g., via the Organization model or existing lookup used elsewhere in
backfill_metrics.py) and change the self.stdout.write call to print "{org_id}
({display_name})" with a safe fallback to the id if the name is missing; ensure
you reference org_ids/current_org_id and self.stdout.write when making the
change so it works for long-running backfills.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ec1c5954-0894-473d-ab0d-5e883cbba909
📒 Files selected for processing (1)
backend/dashboard_metrics/management/commands/backfill_metrics.py



What
--active-onlyflag tobackfill_metricsmanagement command to filter organizations during backfill_resolve_org_ids()method[1/N]to org processing output%sformatting instead of f-stringsWhy
--active-onlyreduces the set to only active orgs, significantly cutting backfill timeHow
--active-onlyCLI argument with graceful import-based fallback for OSS compatibility_resolve_org_ids()to cleanly handle single-org, all-org, and filtered-org modes[i/N]progress output for each org processedCan this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
--active-only) is unchanged. The flag is additive and the import fallback ensures OSS mode works as before.Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
backfill_metrics --days=7 --dry-run— verify all orgs are listedbackfill_metrics --days=7 --active-only --dry-run— verify filtered org count--active-only— verify backward-compatible behavior--active-onlywith warning messageChecklist
I have read and understood the Contribution Guidelines.