Skip to content

fix: title normalization in duplicate detection (#948)#8654

Open
vonguyenkhang wants to merge 1 commit intoFreshRSS:edgefrom
vonguyenkhang:948-normalize-titles-duplicate-prevention
Open

fix: title normalization in duplicate detection (#948)#8654
vonguyenkhang wants to merge 1 commit intoFreshRSS:edgefrom
vonguyenkhang:948-normalize-titles-duplicate-prevention

Conversation

@vonguyenkhang
Copy link
Copy Markdown

Standardize entry titles (HTML, case, whitespace) before checking for duplicates in feeds and categories.
Without these changes, some users reported "hit or miss" behaviour after enabling Mark an article as read… if an identical title already exists in the top n newest articles of the category (#948 (comment), #948 (comment), #948 (comment)).

Closes #

Changes proposed in this pull request:

  • Added FreshRSS_Entry::normalizeTitle() to decode HTML entities, collapse whitespace, and lowercase strings.
  • Applied this normalization to the $titlesAsRead lookup in both Feed and Category scopes.

How to test the feature manually:

  1. Subscription management > Add a feed or category > enter "RNZ" under Add a category, click Add
Image
  1. Click the gear icon ⚙️ on the left of the RNZ category, under Filter actions tick Mark an article as read... ☑️ if an identical title already exists in the top n newest articles of the category, enter 100, click Submit
Image Image
  1. Back to RNZ category, click ➕ Add a feed
Image

then paste URL of "RNZ World Headlines" feed. Use default settings for Article unicity criteria.
Repeat for "RNZ New Zealand Headlines".

  1. Back to main page of FreshRSS, click on the Show read button (opened envelope icon) in the nav menu, I can see 1 out of 2 entries of the article "Iranian diaspora form human chain on Wellington waterfront" is correctly marked as as read.
Image
  1. Need to observe for some time to make sure no more "hit or miss" behaviour.

Pull request checklist:

  • clear commit messages
  • code manually tested
  • unit tests written (optional if too hard)
  • documentation updated

Additional information can be found in the documentation.

Standardize entry titles (HTML, case, whitespace) before checking for duplicates in feeds and categories.
@Alkarex Alkarex added this to the 1.29.0 milestone Mar 30, 2026
@Alkarex
Copy link
Copy Markdown
Member

Alkarex commented Mar 30, 2026

Could you please try to provide:

  • An example of situation with real-life feeds (URLs) where this patch is needed
  • Some XML codes snippet showing those real-life variations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants