UnicodeCharTokenizer.md

description: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

text.UnicodeCharTokenizer

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.UnicodeCharTokenizer()

Resulting tokens are integers (unicode codepoints). Scalar input will produce a Tensor output containing the codepoints. Tensor inputs will produce RaggedTensor outputs.

Example:

>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize("abc")
>>> print(tokens)
tf.Tensor([97 98 99], shape=(3,), dtype=int32)

>>> tokens = tokenizer.tokenize(["abc", "de"])
>>> print(tokens)
<tf.RaggedTensor [[97, 98, 99], [100, 101]]>

Note: any remaining illegal and special UTF-8 characters (like BOM characters) in the input string will not be treated specially by the tokenizer and show up in the output tokens. These should be normalized out before or after tokenization if they are unwanted in the application.

>>> t = ["abc" + chr(0xfffe) + chr(0x1fffe) ]
>>> tokens = tokenizer.tokenize(t)
>>> print(tokens.to_list())
[[97, 98, 99, 65534, 131070]]

Passing malformed UTF-8 will result in unpredictable behavior. Make sure inputs conform to UTF-8.

Methods

`detokenize`

View source

detokenize(
    input, name=None
)

Detokenizes input codepoints (integers) to UTF-8 strings.

Example:

>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize(["abc", "de"])
>>> s = tokenizer.detokenize(tokens)
>>> print(s)
tf.Tensor([b'abc' b'de'], shape=(2,), dtype=string)

Args
`input`	A `RaggedTensor` or `Tensor` of codepoints (ints) with a rank of at least 1.
`name`	The name argument that is passed to the op function.

Returns
A N-1 dimensional string tensor of the text corresponding to the UTF-8 codepoints in the input.

`split`

View source

split(
    input
)

Alias for Tokenizer.tokenize.

`split_with_offsets`

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

`tokenize`

View source

tokenize(
    input
)

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Input strings are split on character boundaries using unicode_decode_with_offsets.

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens (characters) of each string.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input
)

Tokenizes a tensor of UTF-8 strings to Unicode characters.

Example:

>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize_with_offsets("a"+chr(8364)+chr(10340))
>>> print(tokens[0])
tf.Tensor([   97  8364 10340], shape=(3,), dtype=int32)
>>> print(tokens[1])
tf.Tensor([0 1 4], shape=(3,), dtype=int64)
>>> print(tokens[2])
tf.Tensor([1 4 7], shape=(3,), dtype=int64)

The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings:

>>> toks = tokenizer.tokenize_with_offsets(["a"+chr(8364), "b"+chr(10300) ])
>>> print(toks[0])
<tf.RaggedTensor [[97, 8364], [98, 10300]]>
>>> print(toks[1])
<tf.RaggedTensor [[0, 1], [0, 1]]>
>>> print(toks[2])
<tf.RaggedTensor [[1, 4], [1, 4]]>

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens`: A `RaggedTensor` of code points (integer type). `start_offsets`: A `RaggedTensor` of the tokens' starting byte offset. `end_offsets`: A `RaggedTensor` of the tokens' ending byte offset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text.UnicodeCharTokenizer

Example:

Methods

`detokenize`

Example:

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`

Example:

FilesExpand file tree

UnicodeCharTokenizer.md

Latest commit

History

UnicodeCharTokenizer.md

File metadata and controls

text.UnicodeCharTokenizer

Example:

Methods

detokenize

Example:

split

split_with_offsets

tokenize

tokenize_with_offsets

Example:

`detokenize`

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`