Skip to content

Fix tokenization edge case where llama output does not start with a space#1375

Merged
abetlen merged 3 commits into
abetlen:mainfrom
noamgat:patch-1
May 4, 2024
Merged

Fix tokenization edge case where llama output does not start with a space#1375
abetlen merged 3 commits into
abetlen:mainfrom
noamgat:patch-1

Conversation

@noamgat

@noamgat noamgat commented Apr 23, 2024

Copy link
Copy Markdown
Contributor

Created following investigation in this issue:
noamgat/lm-format-enforcer#92

See this notebook for a reproduction of the problem:
https://colab.research.google.com/drive/1Ooz11nFPk19zyJdMDx42CeesU8aWZMdI#scrollTo=oKpHw5PZ30uC

When using the model
TheBloke/tinyllama-1.1b-chat-v1.0-GGUF

In the current implementation, the token sequence [6377] decodes to {" while the token sequence [1,6377] decodes to ". This is because the LLama tokenizer doesn't add a leading space when decoding this sequence, but the llama-cpp-python code that wraps it assumes that it does.
This breaks that assumption, and only returns output[1:] instead of output when the first character is a space.

Implementation note: I made the check output[0:1] == ' ' and not output[0] == ' ' to avoid edge cases where the output is empty (maybe if the first tokens are partial unicode characters).

@noamgat

noamgat commented May 3, 2024

Copy link
Copy Markdown
Contributor Author

Is it possible to review this? I think its a very straightforward fix.

@abetlen

abetlen commented May 3, 2024

Copy link
Copy Markdown
Owner

@noamgat yup and thank you for looking into this

@abetlen abetlen merged commit e0d7674 into abetlen:main May 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants