Skip to content

feat(metadata): extract <link rel="alternate"> tags into metadata#3250

Open
firecrawl-spring[bot] wants to merge 1 commit intomainfrom
feat/metadata-alternate-links
Open

feat(metadata): extract <link rel="alternate"> tags into metadata#3250
firecrawl-spring[bot] wants to merge 1 commit intomainfrom
feat/metadata-alternate-links

Conversation

@firecrawl-spring
Copy link
Copy Markdown
Contributor

@firecrawl-spring firecrawl-spring Bot commented Mar 30, 2026

Summary

  • Adds alternateLinks field to scrape response metadata that captures all <link rel="alternate"> tags from HTML <head>
  • Each entry includes href, type, title, and hreflang attributes
  • Enables RSS/Atom feed discovery and hreflang detection without requiring rawHtml + manual parsing

Changes

  • Rust native extractor (html.rs): Added extraction of <link rel="alternate"> tags after DC terms metadata
  • TypeScript fallback (extractMetadata.ts): Added matching cheerio-based extraction
  • Types (v1/types.ts, v2/types.ts): Added alternateLinks field to Document metadata type

Example response

{
  "metadata": {
    "alternateLinks": [
      {
        "href": "https://www.saastr.com/feed/",
        "type": "application/rss+xml",
        "title": "SaaStr RSS Feed"
      },
      {
        "href": "https://www.saastr.com/feed/atom/",
        "type": "application/atom+xml",
        "title": "SaaStr Atom Feed"
      }
    ]
  }
}

Context

Customer request — currently users need to request rawHtml and parse <link rel="alternate"> tags themselves to discover RSS/Atom feeds or hreflang links. This makes it a first-class metadata field.

Test plan

  • Scrape saastr.com and verify alternateLinks contains RSS and Atom feed entries
  • Scrape a site with hreflang tags and verify hreflang attribute is captured
  • Scrape a site with no <link rel="alternate"> tags and verify field is absent (not empty array)
  • Verify Rust extractor produces same results as TypeScript fallback

Summary by cubic

Adds alternateLinks to document metadata by extracting all tags from the HTML head. This enables RSS/Atom feed discovery and hreflang detection without parsing rawHtml.

  • New Features
    • Captures href, type, title, and hreflang for each alternate link.
    • Implemented in the Rust native extractor and the TypeScript cheerio fallback; field is omitted when no matches exist.
    • Updated v1 and v2 Document types to include alternateLinks.

Written for commit 493f43f. Summary will update on new commits.

Add `alternateLinks` field to metadata that captures all
<link rel="alternate"> tags from HTML head, including href, type,
title, and hreflang attributes. This enables feed discovery (RSS/Atom)
and hreflang detection without requiring rawHtml + manual parsing.

Implemented in both the Rust native extractor and the TypeScript/cheerio
fallback, with corresponding type updates in v1 and v2 Document types.

Co-Authored-By: micahstairs <micah@sideguide.dev>
@firecrawl-spring firecrawl-spring Bot requested a review from mogery as a code owner March 30, 2026 13:48
@firecrawl-spring firecrawl-spring Bot requested a review from micahstairs March 30, 2026 13:48
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/api/src/scraper/scrapeURL/lib/extractMetadata.ts">

<violation number="1" location="apps/api/src/scraper/scrapeURL/lib/extractMetadata.ts:156">
P2: Match `rel` as a token, not an exact string. `rel` is space-separated; the current selector misses valid `<link rel="alternate ...">` tags that include additional tokens.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

title?: string;
hreflang?: string;
}[] = [];
soup('link[rel="alternate"]').each((_, elem) => {
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Match rel as a token, not an exact string. rel is space-separated; the current selector misses valid <link rel="alternate ..."> tags that include additional tokens.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/scraper/scrapeURL/lib/extractMetadata.ts, line 156:

<comment>Match `rel` as a token, not an exact string. `rel` is space-separated; the current selector misses valid `<link rel="alternate ...">` tags that include additional tokens.</comment>

<file context>
@@ -143,6 +146,32 @@ export async function extractMetadata(
+      title?: string;
+      hreflang?: string;
+    }[] = [];
+    soup('link[rel="alternate"]').each((_, elem) => {
+      const link: {
+        href?: string;
</file context>
Suggested change
soup('link[rel="alternate"]').each((_, elem) => {
soup('link[rel~="alternate"]').each((_, elem) => {
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants