Skip to content

improvement(seo): optimize sitemaps and robots.txt across sim and docs#4170

Open
emir-karabeg wants to merge 2 commits intostagingfrom
improvement/sitemap
Open

improvement(seo): optimize sitemaps and robots.txt across sim and docs#4170
emir-karabeg wants to merge 2 commits intostagingfrom
improvement/sitemap

Conversation

@emir-karabeg
Copy link
Copy Markdown
Collaborator

Summary

  • Fix 6x duplicate URL bug in docs sitemap — convert from route handler to Next.js metadata convention with native hreflang and x-default
  • Add missing pages to sim sitemap: blog author pages, academy catalog/course pages
  • Remove changeFrequency/priority (Google ignores both), fix inaccurate lastModified timestamps
  • Consolidate 20+ redundant per-bot robots rules into single wildcard, add missing disallow paths

Context

SEO audit found several issues: the docs sitemap generated every page 6x (once per language) without hreflang alternates, the sim sitemap was missing public pages and using new Date() as lastModified on static content (which trains Google to distrust the signal), and robots.txt had 20+ identical bot-specific rules that added noise with no effect.

Changes

Sim sitemap (apps/sim/app/sitemap.ts)

  • Add blog author pages (/blog/authors/[id]) with lastModified derived from each author's latest post
  • Add academy pages (/academy, /academy/[courseSlug])
  • Fix lastModified accuracy — use real content dates for blog/models, omit for static JSON-derived pages
  • Remove changeFrequency and priority fields (confirmed ignored by Google)

Sim robots (apps/sim/app/robots.ts)

  • Replace 20+ identical per-bot rule blocks with single * wildcard
  • Add /form/ and /credential-account/ to disallow list
  • Reference image sitemap (/blog/sitemap-images.xml)
  • Remove deprecated host directive

Docs sitemap (apps/docs/app/sitemap.ts — new, replaces apps/docs/app/sitemap.xml/route.ts)

  • Convert from raw XML route handler to Next.js MetadataRoute.Sitemap convention
  • Use source.getLanguages() from Fumadocs to deduplicate pages by slug
  • Add proper alternates.languages with x-default for all 6 locales
  • Omit lastModified (no accurate source available without git plugin — absent is better than inaccurate)

Docs robots (apps/docs/app/robots.txt/route.ts)

  • Move disallow rules before allow under User-agent: *
  • Extract hardcoded baseUrl to env variable with production fallback

Type of Change

  • Bug fix
  • Improvement

Testing

  • Both apps pass TypeScript type-check with no errors
  • Pre-commit hooks (biome lint/format) pass
  • Verified all sitemap URLs reference existing routes
  • Verified blog canonical field uses absolute URLs (enforced by Zod z.string().url())
  • Verified authors field is always populated (throws if empty)
  • Verified integration slug, model href, and course slug values match their routes

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

- Add missing pages to sim sitemap: blog author pages, academy catalog and course pages
- Fix 6x duplicate URL bug in docs sitemap by deduplicating with source.getLanguages()
- Convert docs sitemap from route handler to Next.js metadata convention with native hreflang
- Add x-default hreflang alternate for docs multi-language pages
- Remove changeFrequency and priority fields (Google ignores both)
- Fix inaccurate lastModified timestamps — derive from real content dates, omit when unknown
- Consolidate 20+ redundant per-bot robots rules into single wildcard entry
- Add /form/ and /credential-account/ to sim robots disallow list
- Reference image sitemap in sim robots.txt
- Remove deprecated host directive from sim robots
- Move disallow rules before allow in docs robots for crawler compatibility
- Extract hardcoded docs baseUrl to env variable with production fallback
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 15, 2026 2:56am

Request Review

@cursor
Copy link
Copy Markdown

cursor bot commented Apr 15, 2026

PR Summary

Medium Risk
Changes sitemap/robots generation and URL discovery for both apps; mistakes could reduce crawling/indexing coverage or expose/disallow unintended routes. No auth/data-path logic is modified.

Overview
Updates SEO surface for both apps/docs and apps/sim. Docs now uses a Next.js MetadataRoute.Sitemap (apps/docs/app/sitemap.ts) instead of a custom sitemap.xml route, deduplicating pages by slug and emitting hreflang alternates (including x-default) while allowing the base URL to be configured via NEXT_PUBLIC_DOCS_URL.

Sim’s sitemap (apps/sim/app/sitemap.ts) now includes additional public URLs (blog author pages and academy course pages) and adjusts lastModified signals to be content-derived while dropping priority/changeFrequency. Sim’s robots rules (apps/sim/app/robots.ts) are consolidated to a single wildcard rule with expanded disallows and now advertise both the main sitemap and the blog image sitemap.

Reviewed by Cursor Bugbot for commit eeac0f8. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 15, 2026

Greptile Summary

This PR fixes SEO issues across two apps: the docs sitemap had a 6× duplicate-URL bug (one entry per locale with no hreflang), and the sim sitemap was missing public pages while using new Date() as lastModified for all entries. The robots.txt files are cleaned up by consolidating redundant per-bot rules and adding missing disallow paths.

Confidence Score: 5/5

Safe to merge — all findings are P2 suggestions that don't block correctness in production.

The two remaining findings are both P2: the homepage still using new Date() is an inconsistency with the PR's stated goal but doesn't break anything, and the missing empty-array guard on latestModelDate is theoretical given the static data always contains models. All structural changes are correct.

apps/sim/app/sitemap.ts — homepage lastModified and latestModelDate guard.

Important Files Changed

Filename Overview
apps/sim/app/sitemap.ts Adds blog author pages, academy pages; removes changeFrequency/priority; fixes most lastModified timestamps — but homepage retains new Date() and latestModelDate lacks an empty-array guard.
apps/sim/app/robots.ts Consolidates 20+ per-bot rules into a single wildcard, adds /form/ and /credential-account/ to the disallow list, and references the image sitemap. Clean simplification with no issues.
apps/docs/app/sitemap.ts Converts from raw XML route handler to Next.js MetadataRoute.Sitemap convention; correctly deduplicates by slug and adds proper alternates.languages with x-default for all locales.
apps/docs/app/robots.txt/route.ts Moves Disallow rules before Allow under User-agent:* (correct order for first-match parsers), extracts baseUrl to env variable, removes redundant multi-language comment block.
apps/docs/app/sitemap.xml/route.ts Deleted — replaced by the new apps/docs/app/sitemap.ts using the Next.js MetadataRoute.Sitemap convention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["sitemap()"] --> B["getAllPostMeta()"]
    B --> C{"posts.length > 0?"}
    C -- yes --> D["latestPostDate = max post date"]
    C -- no --> E["latestPostDate = undefined"]
    A --> F["latestModelDate = max model updatedAt (no guard)"]
    A --> G["staticPages: homepage uses new Date()"]
    A --> H["blogPages: per-post updated/date"]
    A --> I["authorPages: max post date per author"]
    A --> J["integrationPages: no lastModified"]
    A --> K["providerPages: max model date"]
    A --> L["modelEntries: per-model updatedAt"]
    A --> M["academyPages: no lastModified"]
    D --> G
    E --> G
    F --> G
    G --> N["MetadataRoute.Sitemap"]
    H --> N
    I --> N
    J --> N
    K --> N
    L --> N
    M --> N
Loading

Reviews (1): Last reviewed commit: "improvement(seo): optimize sitemaps and ..." | Re-trigger Greptile

Comment on lines 27 to 29
url: baseUrl,
lastModified: now,
changeFrequency: 'daily',
priority: 1.0,
lastModified: new Date(),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Homepage still uses new Date() as lastModified

The PR's stated rationale is that "new Date() as lastModified on static content trains Google to distrust the signal." All other static pages were either given accurate content-derived dates or had lastModified omitted — but the homepage still has new Date(), so it'll reflect the sitemap generation time rather than an actual page-change date. Without a revalidate export in this file, each ISR revalidation produces a new value.

Suggested change
url: baseUrl,
lastModified: now,
changeFrequency: 'daily',
priority: 1.0,
lastModified: new Date(),
},
{
url: baseUrl,
},

Or use a pinned date that reflects the last meaningful homepage change.

Comment on lines +17 to +23
const latestModelDate = new Date(
Math.max(
...MODEL_PROVIDERS_WITH_CATALOGS.flatMap((provider) =>
provider.models.map((model) => new Date(model.pricing.updatedAt).getTime())
)
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing empty-array guard — potential Invalid Date

latestPostDate is guarded with posts.length > 0, but latestModelDate is not. If MODEL_PROVIDERS_WITH_CATALOGS were empty (or all providers had zero models after the filter), flatMap produces [], Math.max(...[]) returns -Infinity, and new Date(-Infinity).toISOString() throws RangeError: Invalid time value, crashing sitemap generation. Applying the same guard pattern would be consistent:

Suggested change
const latestModelDate = new Date(
Math.max(
...MODEL_PROVIDERS_WITH_CATALOGS.flatMap((provider) =>
provider.models.map((model) => new Date(model.pricing.updatedAt).getTime())
)
)
)
const modelTimes = MODEL_PROVIDERS_WITH_CATALOGS.flatMap((provider) =>
provider.models.map((model) => new Date(model.pricing.updatedAt).getTime())
)
const latestModelDate = modelTimes.length > 0 ? new Date(Math.max(...modelTimes)) : undefined

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Changelog lastModified incorrectly derived from blog posts
    • Removed the incorrect lastModified from the changelog sitemap entry since it's driven by GitHub releases, not blog posts, matching the pattern used for /partners.

Create PR

Or push these changes by commenting:

@cursor push 473032693b
Preview (473032693b)
diff --git a/apps/sim/app/sitemap.ts b/apps/sim/app/sitemap.ts
--- a/apps/sim/app/sitemap.ts
+++ b/apps/sim/app/sitemap.ts
@@ -37,7 +37,6 @@
     },
     {
       url: `${baseUrl}/changelog`,
-      lastModified: latestPostDate,
     },
     {
       url: `${baseUrl}/integrations`,

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Reviewed by Cursor Bugbot for commit eeac0f8. Configure here.

url: `${baseUrl}/changelog`,
lastModified: now,
lastModified: latestPostDate,
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog lastModified incorrectly derived from blog posts

Medium Severity

The /changelog entry uses latestPostDate as its lastModified, but the changelog page is driven by GitHub releases (fetched from api.github.com/repos/simstudioai/sim/releases), not blog posts. This gives search engines an inaccurate modification date that reflects the latest blog post update rather than when the changelog actually changed. Given the PR's stated goal of fixing inaccurate lastModified timestamps, this entry would be better off omitting lastModified entirely (like /partners) since no accurate source is available at sitemap-generation time.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit eeac0f8. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant