Skip to content
This repository was archived by the owner on Jul 31, 2018. It is now read-only.
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Updating
  • Loading branch information
jasnell committed Aug 11, 2016
commit 71b757cccbcfbfebef08aa5297e140b02e119763
95 changes: 24 additions & 71 deletions XXX-icu-module.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,123 +2,76 @@
|--------|-----------------------------|
| Author | @jasnell |
| Status | DRAFT |
| Date | 2016-04-20 |
| Date | 2016-08-11 |

## Description

The ICU4C library that we use for internationalization contains a significant
array of additional functionality not currently exposed by the EcmaScript 402
standard. Some of this additional functionality would be useful to expose via
a new `'icu'` or (`'unicode'`) module.
a new `'unicode'` module.

## Interface

Initially, the `'icu'` module would provide methods for the following:
Initially, the `'unicode'` module would provide methods for the following:

1. Character encoding detection. ICU includes code that is able to look at a
stream of bytes and apply heuristics to detect the character encoding in
use. This is not always an exact match but it does a reasonably good job.
We can tune this detection to only look for the character encodings we
support in Core (ascii, iso8859-1, utf8 and utf16-le). Two specific APIs
would be exposed by the `'icu'` module for this capability:

```js
const icu = require('icu');

// Detect the encoding for a given buffer or string.
// Returns a string with the most likely match.
icu.detectEncoding(myBuffer);
icu.detectEncoding(myString);

// Detect the encoding for a given buffer or string.
// Returns an object whose keys are the detected
// encodings and whose values are a confidence value
// provided by ICU. The higher the confidence value,
// the better the match.
const encs = icu.detectEncodings(myBuffer);
console.log(encs);
// Prints something like {'ascii': 90, 'utf8': 15}
```

This mechanism is useful when working with data that might be in multiple
character sets (such as filenames on Linux, or reading through multiple
files in a directory).

```
const data = getDataSomehow();
const buffer = Buffer.from(data, icu.detectEncoding(data));
```

2. One-Shot and Streaming Buffer re-encoding. ICU includes code for converting
1. One-Shot and Streaming Buffer re-encoding. ICU includes code for converting
from one encoding to another. This is similar to what is provided by `iconv`
but it is built in to ICU4C. The `'icu'` module would include converters for
*only* the character encodings directly supported by core. Developers would
continue to use `iconv` or `iconv-lite` for more exotic things.
but it is built in to ICU4C. The `'unicode'` module would include converters
for *only* the character encodings directly supported by core. Developers
would continue to use `iconv` or `iconv-lite` (or similar) for more exotic
things.

```js
const icu = require('icu');
const unicode = require('unicode');

// One-shot conversion. Converts the entire Buffer in one go.
// Assumes that the Buffer is properly aligned on UFT-8 boundaries
const myBuffer = Buffer.from(getUtf8DataSomehow(), 'utf8');
const newBuffer = icu.reencode(myBuffer, 'utf8', 'ucs2');

const newBuffer = unicode.transcode(myBuffer, 'utf8', 'ucs2');

// Streaming conversion
const convertStream = icu.createConverter('utf8', 'ucs2');
convertStream.on('data', (chunk) => {
const transcodeStream = icu.createTranscoder('utf8', 'ucs2');
transcodeStream.on('data', (chunk) => {
// chunk is a UTF-16 (ucs2) encoded buffer
});
// Writing UTF-8 data
convertStream.write(getUtf8DataSomehow());
```

Additional convenience methods would be attached to `Buffer.prototype`:

```
const myBuffer = Buffer.from(getUtf8DataShow(), 'uf8');
const newBuffer = myBuffer.reencode('utf8', 'ucs2');
transcodeStream.write(getUtf8DataSomehow());
```

Again, this would ONLY support the encodings for which we already have built-in
support in core (acsii, iso8859-1, utf8 and utf16). This does not expand the
encoding support in core so `iconv` and `iconv-lite` would still be necessary.
support in core (acsii, iso8859-1, utf8 and utf16le).

3. UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`.
2. UTF-8 and UTF-16 aware `codePointAt()` and `charAt()` methods for `Buffer`.
This one is pretty straightforward. They would return either the Unicode
codepoint or the character at the given byte offset even if the byte offset
is not on a UTF-8 or UTF-16 lead byte. These are intended to be symmetrical
with `String.prototype.codePointAt()` and `String.prototype.charAt()`

```
const icu = require('icu');
```js
const unicode = require('unicode');

const myBuffer = Buffer.from('a€bc', 'utf8');

console.log(icu.codePointAt(myBuffer, 1, 'utf8'));
// or
console.log(myBuffer.codePointAt(1, 'utf8'));
console.log(unicode.codePointAt(myBuffer, 1, 'utf8'));

console.log(icu.charAt(myBuffer, 1, 'utf8'));
// or
console.log(myBuffer.charAt(1, 'utf8'));
console.log(unicode.charAt(myBuffer, 1, 'utf8'));
```

4. UTF-16 and UTF-8 aware `slice()` for `Buffer`. This is similar to the
3. UTF-16 and UTF-8 aware `slice()` for `Buffer`. This is similar to the
existing `Buffer.prototype.slice()` except that, rather than byte offsets,
the `start` and `end` are codepoint/character offsets This would make it
symmetrical with `String.prototype.slice()` but for Buffers. The advantage
is that this allows the Buffer to be sliced in a way that ensures proper
alignment with UTF-8 or UTF-16 encodings.

```
const icu = require('icu');
```js
const unicode = require('unicode');

const myBuffer = Buffer.from('a€bc', 'utf8');

icu.slice(myBuffer, 'utf8', 1, 2); // returns a Buffer with €
// or
Buffer.slice(1, 2, 'utf8'); // returns a Buffer with €
unicode.slice(myBuffer, 'utf8', 1, 3); // returns a Buffer with the utf8
// encoding of €b
```

*Passing in either `ascii` or `binary` would fallback to the current
Expand Down