feat(metadata): extract <link rel="alternate"> tags into metadata#3250
Open
firecrawl-spring[bot] wants to merge 1 commit intomainfrom
Open
feat(metadata): extract <link rel="alternate"> tags into metadata#3250firecrawl-spring[bot] wants to merge 1 commit intomainfrom
firecrawl-spring[bot] wants to merge 1 commit intomainfrom
Conversation
Add `alternateLinks` field to metadata that captures all <link rel="alternate"> tags from HTML head, including href, type, title, and hreflang attributes. This enables feed discovery (RSS/Atom) and hreflang detection without requiring rawHtml + manual parsing. Implemented in both the Rust native extractor and the TypeScript/cheerio fallback, with corresponding type updates in v1 and v2 Document types. Co-Authored-By: micahstairs <micah@sideguide.dev>
Contributor
There was a problem hiding this comment.
1 issue found across 4 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="apps/api/src/scraper/scrapeURL/lib/extractMetadata.ts">
<violation number="1" location="apps/api/src/scraper/scrapeURL/lib/extractMetadata.ts:156">
P2: Match `rel` as a token, not an exact string. `rel` is space-separated; the current selector misses valid `<link rel="alternate ...">` tags that include additional tokens.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| title?: string; | ||
| hreflang?: string; | ||
| }[] = []; | ||
| soup('link[rel="alternate"]').each((_, elem) => { |
Contributor
There was a problem hiding this comment.
P2: Match rel as a token, not an exact string. rel is space-separated; the current selector misses valid <link rel="alternate ..."> tags that include additional tokens.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/api/src/scraper/scrapeURL/lib/extractMetadata.ts, line 156:
<comment>Match `rel` as a token, not an exact string. `rel` is space-separated; the current selector misses valid `<link rel="alternate ...">` tags that include additional tokens.</comment>
<file context>
@@ -143,6 +146,32 @@ export async function extractMetadata(
+ title?: string;
+ hreflang?: string;
+ }[] = [];
+ soup('link[rel="alternate"]').each((_, elem) => {
+ const link: {
+ href?: string;
</file context>
Suggested change
| soup('link[rel="alternate"]').each((_, elem) => { | |
| soup('link[rel~="alternate"]').each((_, elem) => { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
alternateLinksfield to scrape response metadata that captures all<link rel="alternate">tags from HTML<head>href,type,title, andhreflangattributesrawHtml+ manual parsingChanges
html.rs): Added extraction of<link rel="alternate">tags after DC terms metadataextractMetadata.ts): Added matching cheerio-based extractionv1/types.ts,v2/types.ts): AddedalternateLinksfield to Document metadata typeExample response
{ "metadata": { "alternateLinks": [ { "href": "https://www.saastr.com/feed/", "type": "application/rss+xml", "title": "SaaStr RSS Feed" }, { "href": "https://www.saastr.com/feed/atom/", "type": "application/atom+xml", "title": "SaaStr Atom Feed" } ] } }Context
Customer request — currently users need to request
rawHtmland parse<link rel="alternate">tags themselves to discover RSS/Atom feeds or hreflang links. This makes it a first-class metadata field.Test plan
alternateLinkscontains RSS and Atom feed entrieshreflangattribute is captured<link rel="alternate">tags and verify field is absent (not empty array)Summary by cubic
Adds
alternateLinksto document metadata by extracting all tags from the HTML head. This enables RSS/Atom feed discovery and hreflang detection without parsingrawHtml.href,type,title, andhreflangfor each alternate link.cheeriofallback; field is omitted when no matches exist.v1andv2Documenttypes to includealternateLinks.Written for commit 493f43f. Summary will update on new commits.