[core] CPD is always case sensitive

**Affects PMD Version:** 6.x

**Description:**

Some languages like PL/SQL or the new T-SQL (#4390) are case-insensitive. When tokenizing, this is working correctly, e.g. the lexers are agnostic to casing. JavaCC has a grammar option and ANTLR since 4.10 as well.

However, when we convert the original tokens into CPD TokenEntries, we don't seem to use the token kind and use the original token text, which contains the original casing. It's therefore very easy to work around duplicated for these languages by just changing the casing:

```shell
echo 'select a, b, c, d, e, f from table where x = 1 and y = 2;' > file1.plsql
cp file1.plsql file2.plsql
echo 'sEleCt a, b, c, d, e, f frOm table where x = 1 and y = 2;' > file3.plsql

run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file2.plsql
```

results correctly in:
```
Found a 1 line (23 tokens) duplication in the following files: 
Starting at line 1 of /home/andreas/temp/plsql/file1.plsql
Starting at line 1 of /home/andreas/temp/plsql/file2.plsql

select a, b, c, d, e, f from table where x = 1 and y = 2;
```
since file1.plsql and file2.plsql are identical.

However, comparing file1.plsql and file3.plsql which differ only in casing, shows no duplications:

```
run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file3.plsql
```


I think, this problem affects both JavaCC and ANTLR based languages.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] CPD is always case sensitive #4396

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[core] CPD is always case sensitive #4396

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions