Skip to content

[core] CPD is always case sensitive #4396

@adangel

Description

@adangel

Affects PMD Version: 6.x

Description:

Some languages like PL/SQL or the new T-SQL (#4390) are case-insensitive. When tokenizing, this is working correctly, e.g. the lexers are agnostic to casing. JavaCC has a grammar option and ANTLR since 4.10 as well.

However, when we convert the original tokens into CPD TokenEntries, we don't seem to use the token kind and use the original token text, which contains the original casing. It's therefore very easy to work around duplicated for these languages by just changing the casing:

echo 'select a, b, c, d, e, f from table where x = 1 and y = 2;' > file1.plsql
cp file1.plsql file2.plsql
echo 'sEleCt a, b, c, d, e, f frOm table where x = 1 and y = 2;' > file3.plsql

run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file2.plsql

results correctly in:

Found a 1 line (23 tokens) duplication in the following files: 
Starting at line 1 of /home/andreas/temp/plsql/file1.plsql
Starting at line 1 of /home/andreas/temp/plsql/file2.plsql

select a, b, c, d, e, f from table where x = 1 and y = 2;

since file1.plsql and file2.plsql are identical.

However, comparing file1.plsql and file3.plsql which differ only in casing, shows no duplications:

run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file3.plsql

I think, this problem affects both JavaCC and ANTLR based languages.

Metadata

Metadata

Assignees

No one assigned

    Labels

    a:bugPMD crashes or fails to analyse a file.in:cpdAffects the copy-paste detector

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions