Skip to content

Latest commit

 

History

History
275 lines (217 loc) · 7.39 KB

File metadata and controls

275 lines (217 loc) · 7.39 KB

description: Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

text.UnicodeCharTokenizer

View source

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.UnicodeCharTokenizer()

Resulting tokens are integers (unicode codepoints). Scalar input will produce a Tensor output containing the codepoints. Tensor inputs will produce RaggedTensor outputs.

Example:

>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize("abc")
>>> print(tokens)
tf.Tensor([97 98 99], shape=(3,), dtype=int32)
>>> tokens = tokenizer.tokenize(["abc", "de"])
>>> print(tokens)
<tf.RaggedTensor [[97, 98, 99], [100, 101]]>

Note: any remaining illegal and special UTF-8 characters (like BOM characters) in the input string will not be treated specially by the tokenizer and show up in the output tokens. These should be normalized out before or after tokenization if they are unwanted in the application.

>>> t = ["abc" + chr(0xfffe) + chr(0x1fffe) ]
>>> tokens = tokenizer.tokenize(t)
>>> print(tokens.to_list())
[[97, 98, 99, 65534, 131070]]

Passing malformed UTF-8 will result in unpredictable behavior. Make sure inputs conform to UTF-8.

Methods

detokenize

View source

detokenize(
    input, name=None
)

Detokenizes input codepoints (integers) to UTF-8 strings.

Example:

>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize(["abc", "de"])
>>> s = tokenizer.detokenize(tokens)
>>> print(s)
tf.Tensor([b'abc' b'de'], shape=(2,), dtype=string)
Args
`input` A `RaggedTensor` or `Tensor` of codepoints (ints) with a rank of at least 1.
`name` The name argument that is passed to the op function.
Returns
A N-1 dimensional string tensor of the text corresponding to the UTF-8 codepoints in the input.

split

View source

split(
    input
)

Alias for Tokenizer.tokenize.

split_with_offsets

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

tokenize(
    input
)

Tokenizes a tensor of UTF-8 strings on Unicode character boundaries.

Input strings are split on character boundaries using unicode_decode_with_offsets.

Args
`input` A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.
Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens (characters) of each string.

tokenize_with_offsets

View source

tokenize_with_offsets(
    input
)

Tokenizes a tensor of UTF-8 strings to Unicode characters.

Example:

>>> tokenizer = tf_text.UnicodeCharTokenizer()
>>> tokens = tokenizer.tokenize_with_offsets("a"+chr(8364)+chr(10340))
>>> print(tokens[0])
tf.Tensor([   97  8364 10340], shape=(3,), dtype=int32)
>>> print(tokens[1])
tf.Tensor([0 1 4], shape=(3,), dtype=int64)
>>> print(tokens[2])
tf.Tensor([1 4 7], shape=(3,), dtype=int64)

The start_offsets and end_offsets are in byte indices of the original string. When calling with multiple string inputs, the offset indices will be relative to the individual source strings:

>>> toks = tokenizer.tokenize_with_offsets(["a"+chr(8364), "b"+chr(10300) ])
>>> print(toks[0])
<tf.RaggedTensor [[97, 8364], [98, 10300]]>
>>> print(toks[1])
<tf.RaggedTensor [[0, 1], [0, 1]]>
>>> print(toks[2])
<tf.RaggedTensor [[1, 4], [1, 4]]>
Args
`input` A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.
Returns
A tuple `(tokens, start_offsets, end_offsets)` where:
  • tokens: A RaggedTensor of code points (integer type).
  • start_offsets: A RaggedTensor of the tokens' starting byte offset.
  • end_offsets: A RaggedTensor of the tokens' ending byte offset.