Line normalization before diff

**Description**

The `DiffRowGenerator` class offers the `lineNormalizer` property. By default, it is used to replace `<` and `>` by their escaped versions `&lt;` and `&gt;`.

The `lineNormalizer` is applied to the input texts before the diff is calculated. While I see this is as a useful feature, in case of the default settings it might be surprising that the resulting text might not have HTML escaping anymore:

```java
final var generator = DiffRowGenerator.create() //
 .mergeOriginalRevised(true) //
 .showInlineDiffs(true) //
 .inlineDiffByWord(true) //
 .build();

final var rows = generator.generateDiffRows(List.of("hello <world>"), List.of("bye >world<"));

final var resultingText = rows.stream() //
 .map(DiffRow::getOldLine) //
 .collect(Collectors.joining(StringUtils.LF));
``` 

The resulting text is
```
hellobye &ltgt;world&gtlt;
``` 

Note the part ` &` is considered as an equal text part because both replacements `&lt;` and `&gt;` start with an ampersand. This resulting text is therefore no valid HTML anymore.

In order for this behaviour to be a problem, the following conditions must all be true:

1. The `inlineDiffByWord` must be used
2. The default `lineNormalizer` must be used
3. The two provided texts must differ at a position which starts with a character that is replaced by the `lineNormalizer`
4. A release >= 4.15 must be used.

**Workaround**
Override the `lineNormalizer`. E.g., by using the `SPLIT_BY_WORD_PATTERN` of release 4.12, in which [the ampersand was not considered a character that splits words](https://github.com/java-diff-utils/java-diff-utils/blob/0fd3bd8e061eed09dbb937c8ab9ba0969ba12264/java-diff-utils/src/main/java/com/github/difflib/text/DiffRowGenerator.java#L70).

**Solution approaches**
IMHO, the `SPLIT_BY_WORD_PATTERN` of release 4.15+ is fine and I do not consider it to be the problem.

The library could offer one of the following features:
1. a parameter which defines when the 'lineNormalizer' should be applied (before diff-ing or after)
2. a second type of line-normalizer that is applied after diff-ing
3. an option to have the library apply the [`processDiffs` function](https://github.com/java-diff-utils/java-diff-utils/blob/637cb7b6a309d66ff5e0cec2b3ffea52f867edc7/java-diff-utils/src/main/java/com/github/difflib/text/DiffRowGenerator.java#L190) to non-diffs as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Line normalization before diff #219

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Line normalization before diff #219

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions