description: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
Inherits From: TokenizerWithOffsets,
Tokenizer,
SplitterWithOffsets,
Splitter, Detokenizer
text.UnicodeCharTokenizer()
Resulting tokens are integers (unicode codepoints). Scalar input will produce a
Tensor output containing the codepoints. Tensor inputs will produce
RaggedTensor outputs.
>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize("abc")
>>> print(tokens)
tf.Tensor([97 98 99], shape=(3,), dtype=int32)
>>> tokens = tokenizer.tokenize(["abc", "de"])
>>> print(tokens)
<tf.RaggedTensor [[97, 98, 99], [100, 101]]>
Note: any remaining illegal and special UTF-8 characters (like BOM characters) in the input string will not be treated specially by the tokenizer and show up in the output tokens. These should be normalized out before or after tokenization if they are unwanted in the application.
>>> t = ["abc" + chr(0xfffe) + chr(0x1fffe) ]
>>> tokens = tokenizer.tokenize(t)
>>> print(tokens.to_list())
[[97, 98, 99, 65534, 131070]]
Passing malformed UTF-8 will result in unpredictable behavior. Make sure inputs conform to UTF-8.
detokenize(
input, name=None
)
Detokenizes input codepoints (integers) to UTF-8 strings.
>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize(["abc", "de"])
>>> s = tokenizer.detokenize(tokens)
>>> print(s)
tf.Tensor([b'abc' b'de'], shape=(2,), dtype=string)
| Args | |
|---|---|
| `input` | A `RaggedTensor` or `Tensor` of codepoints (ints) with a rank of at least 1. |
| `name` | The name argument that is passed to the op function. |
| Returns | |
|---|---|
| A N-1 dimensional string tensor of the text corresponding to the UTF-8 codepoints in the input. |
split(
input
)
Alias for
Tokenizer.tokenize.
split_with_offsets(
input
)
Alias for
TokenizerWithOffsets.tokenize_with_offsets.
tokenize(
input
)
Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.
Input strings are split on character boundaries using unicode_decode_with_offsets.
| Args | |
|---|---|
| `input` | A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. |
| Returns | |
|---|---|
| A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens (characters) of each string. |
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 strings to Unicode characters.
>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize_with_offsets("a"+chr(8364)+chr(10340))
>>> print(tokens[0])
tf.Tensor([ 97 8364 10340], shape=(3,), dtype=int32)
>>> print(tokens[1])
tf.Tensor([0 1 4], shape=(3,), dtype=int64)
>>> print(tokens[2])
tf.Tensor([1 4 7], shape=(3,), dtype=int64)
The start_offsets and end_offsets are in byte indices of the original
string. When calling with multiple string inputs, the offset indices will be
relative to the individual source strings:
>>> toks = tokenizer.tokenize_with_offsets(["a"+chr(8364), "b"+chr(10300) ])
>>> print(toks[0])
<tf.RaggedTensor [[97, 8364], [98, 10300]]>
>>> print(toks[1])
<tf.RaggedTensor [[0, 1], [0, 1]]>
>>> print(toks[2])
<tf.RaggedTensor [[1, 4], [1, 4]]>
| Args | |
|---|---|
| `input` | A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. |
| Returns | |
|---|---|
A tuple `(tokens, start_offsets, end_offsets)` where:
|