-
-
Notifications
You must be signed in to change notification settings - Fork 35.4k
buffer: add buffer.isUtf8 for utf8 validation #45947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
be10b36
buffer: add buffer.isUtf8 for utf8 validation
anonrig bcb19ec
fixup! buffer: add buffer.isUtf8 for utf8 validation
anonrig 6c8ac38
fixup! buffer: add buffer.isUtf8 for utf8 validation
anonrig 103e807
fixup! buffer: add buffer.isUtf8 for utf8 validation
anonrig b590f06
fixup! buffer: add buffer.isUtf8 for utf8 validation
anonrig e940f59
fixup! buffer: add buffer.isUtf8 for utf8 validation
anonrig File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev
Previous commit
fixup! buffer: add buffer.isUtf8 for utf8 validation
- Loading branch information
commit e940f594f7550aa1bf6f50916e84f427497704d5
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does a zero length buffer return true? I would expect this to be false.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it does not include an invalid code point. Is there a similar Node function that has a different behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the stated description of the API is, "This function is used to check if input contains UTF-8 code points"... An empty buffer does not contain UTF-8 code points so it really can't return true. Other methods we have that accept ArrayBuffer or TypedArray, with the exception of Web Streams which have specifically defined handling for detached, will treat those as indistinguishable from a zero-length input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. Thats correct. What do you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just follow up with an additional pr that returned false for zero-length, removing the detached check and error entirely.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion it should not be changed. It should return
true. It's like the empty string which is valid UTF-8. I would be surprised ifisUtf8(encoder.encode('');returnsfalse.To avoid confusion the documentation can be updated like this "This function returns
falseif the input contains invalid UTF-8 code points, elsetrue".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge there is that with that logic the empty buffer would pass any encoding check. isASCII? Yes. isUTF16le? Yes. IsUTF32be? Yes. Is Shift-JIS? Yes.... Which just simply isn't useful. If you want the inverse check, isInvalidUtf8() then implement that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a pull request: #45973
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UTF-8 RFC specifies...
Reference: https://www.rfc-editor.org/rfc/rfc3629
So UTF-8 explicitly, by its ABNF, includes the empty string.
Note that, in general, from a non-empty buffer alone, we cannot determine uniquely the character encoding. A BOM may help but UTF-8 is BOM-less.
A string of bytes may be interpreted under different encodings... and in some cases, it is by design. Thus, for example, ASCII buffers are always valid UTF-8 and Latin1 buffers (by design).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense and that is how it works in some other popular programming languages.