-
-
Notifications
You must be signed in to change notification settings - Fork 35.4k
doc: add documentation for invalid byte sequences #28249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
aa01448
a5c9915
0b2d591
1739d22
834df54
3e7d1bb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -165,38 +165,6 @@ console.log(Buffer.from('fhqwhgads', 'utf16le')); | |
| // Prints: <Buffer 66 00 68 00 71 00 77 00 68 00 67 00 61 00 64 00 73 00> | ||
| ``` | ||
|
|
||
| ### Evaluating legal code points for '`utf-8'` encoding | ||
|
|
||
| Byte sequences that do not have corresponding UTF-16 encodings and non-legal | ||
| Unicode values, along with their UTF-8 counterparts must be treated as | ||
| invalid byte sequences. | ||
|
|
||
| For cases regarding operations other than employing backward compatibility | ||
| for 7-bit (and [extended 8-bit]((https://en.wikipedia.org/wiki/UTF-8#Description)) | ||
| in rare cases) `'ascii'` data, and the valid [`UTF-8` code units](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout), | ||
| it should be noted that the replacement character (`�`) is returned, | ||
| and *no exception will be thrown*. | ||
|
|
||
| It should also be noted that a `U+FFFD` replacement value | ||
| (representing the aforementioned replacement character) will be returned | ||
| in case of decoding errors (invalid unicode scalar values). | ||
|
|
||
| ```js | ||
| // Assuming an invalid byte sequence | ||
| const buf = Buffer.from([237, 166, 164]); | ||
|
|
||
| const buf_str = buf.toString('utf-8'); | ||
|
|
||
| console.log(buf_str); | ||
| // Prints: '�' | ||
|
|
||
| console.log(buf.byteLength(buf_str)); | ||
| // Prints: 3 | ||
|
|
||
| console.log(buf.codePointAt(0).toString(16)); | ||
| // Prints: 'fffd' | ||
| ``` | ||
|
|
||
| The character encodings currently supported by Node.js include: | ||
|
|
||
| * `'ascii'` - For 7-bit ASCII data only. This encoding is fast and will strip | ||
|
|
@@ -229,6 +197,38 @@ the WHATWG specification it is possible that the server actually returned | |
| `'win-1252'`-encoded data, and using `'latin1'` encoding may incorrectly decode | ||
| the characters. | ||
|
|
||
| ### Evaluating legal code points for '`utf-8'` encoding | ||
|
|
||
| Byte sequences that do not have corresponding UTF-16 encodings and non-legal | ||
| Unicode values, along with their UTF-8 counterparts must be treated as | ||
| invalid byte sequences. | ||
|
|
||
| For cases regarding operations other than employing backward compatibility | ||
| for 7-bit (and [extended 8-bit]((https://en.wikipedia.org/wiki/UTF-8#Description)) | ||
| in rare cases) `'ascii'` data, and the valid [`UTF-8` code units](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout), | ||
| it should be noted that the replacement character (`�`) is returned, | ||
| and *no exception will be thrown*. | ||
|
rexagod marked this conversation as resolved.
Outdated
|
||
|
|
||
| It should also be noted that a `U+FFFD` replacement value | ||
|
rexagod marked this conversation as resolved.
Outdated
|
||
| (representing the aforementioned replacement character) will be returned | ||
| in case of decoding errors (invalid unicode scalar values). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be honest, I don’t understand most of the text or its relevance here… the text basically says that invalid UTF-8 byte sequences will be decoded into How do UTF-16 and ASCII relate to that? What does “non-legal Unicode value” mean? (I would guess that this refers to characters that would be beyond U+10FFFF – if that’s correct, can you clarify that in the text?) |
||
|
|
||
| ```js | ||
| // Assuming an invalid byte sequence | ||
| const buf = Buffer.from([237, 166, 164]); | ||
|
|
||
| const buf_str = buf.toString('utf-8'); | ||
|
|
||
| console.log(buf_str); | ||
| // Prints: '�' | ||
|
|
||
| console.log(buf.byteLength(buf_str)); | ||
| // Prints: 3 | ||
|
|
||
| console.log(buf.codePointAt(0).toString(16)); | ||
| // Prints: 'fffd' | ||
| ``` | ||
|
|
||
| ## Buffers and TypedArray | ||
| <!-- YAML | ||
| changes: | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.