description: Tokenizes a tensor of UTF-8 strings on whitespaces.
Tokenizes a tensor of UTF-8 strings on whitespaces.
Inherits From: TokenizerWithOffsets,
Tokenizer,
SplitterWithOffsets,
Splitter
text.WhitespaceTokenizer()
split(
input
)
Alias for
Tokenizer.tokenize.
split_with_offsets(
input
)
Alias for
TokenizerWithOffsets.tokenize_with_offsets.
tokenize(
input
)
Tokenizes a tensor of UTF-8 strings on whitespaces.
The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.
>>> WhitespaceTokenizer().tokenize("small medium large")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'small', b'medium',
b'large'], dtype=object)>
| Args | |
|---|---|
| `input` | A `RaggedTensor` or `Tensor` of UTF-8 strings with any shape. |
| Returns | |
|---|---|
| A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string. |
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 strings on whitespaces.
The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.
>>> splitter = WhitespaceTokenizer()
>>> pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
>>> print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]
| Args | |
|---|---|
| `input` | A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape. |
| Returns | |
|---|---|
A tuple `(tokens, start_offsets, end_offsets)` where:
|