Description
The DiffRowGenerator class offers the lineNormalizer property. By default, it is used to replace < and > by their escaped versions < and >.
The lineNormalizer is applied to the input texts before the diff is calculated. While I see this is as a useful feature, in case of the default settings it might be surprising that the resulting text might not have HTML escaping anymore:
final var generator = DiffRowGenerator.create() //
.mergeOriginalRevised(true) //
.showInlineDiffs(true) //
.inlineDiffByWord(true) //
.build();
final var rows = generator.generateDiffRows(List.of("hello <world>"), List.of("bye >world<"));
final var resultingText = rows.stream() //
.map(DiffRow::getOldLine) //
.collect(Collectors.joining(StringUtils.LF));
The resulting text is
<span class="editOldInline">hello</span><span class="editNewInline">bye</span> &<span class="editOldInline">lt</span><span class="editNewInline">gt</span>;world&<span class="editOldInline">gt</span><span class="editNewInline">lt</span>;
Note the part & is considered as an equal text part because both replacements < and > start with an ampersand. This resulting text is therefore no valid HTML anymore.
In order for this behaviour to be a problem, the following conditions must all be true:
- The
inlineDiffByWord must be used
- The default
lineNormalizer must be used
- The two provided texts must differ at a position which starts with a character that is replaced by the
lineNormalizer
- A release >= 4.15 must be used.
Workaround
Override the lineNormalizer. E.g., by using the SPLIT_BY_WORD_PATTERN of release 4.12, in which the ampersand was not considered a character that splits words.
Solution approaches
IMHO, the SPLIT_BY_WORD_PATTERN of release 4.15+ is fine and I do not consider it to be the problem.
The library could offer one of the following features:
- a parameter which defines when the 'lineNormalizer' should be applied (before diff-ing or after)
- a second type of line-normalizer that is applied after diff-ing
- an option to have the library apply the
processDiffs function to non-diffs as well
Description
The
DiffRowGeneratorclass offers thelineNormalizerproperty. By default, it is used to replace<and>by their escaped versions<and>.The
lineNormalizeris applied to the input texts before the diff is calculated. While I see this is as a useful feature, in case of the default settings it might be surprising that the resulting text might not have HTML escaping anymore:The resulting text is
Note the part
&is considered as an equal text part because both replacements<and>start with an ampersand. This resulting text is therefore no valid HTML anymore.In order for this behaviour to be a problem, the following conditions must all be true:
inlineDiffByWordmust be usedlineNormalizermust be usedlineNormalizerWorkaround
Override the
lineNormalizer. E.g., by using theSPLIT_BY_WORD_PATTERNof release 4.12, in which the ampersand was not considered a character that splits words.Solution approaches
IMHO, the
SPLIT_BY_WORD_PATTERNof release 4.15+ is fine and I do not consider it to be the problem.The library could offer one of the following features:
processDiffsfunction to non-diffs as well