WhitespaceTokenizer.md

>>> WhitespaceTokenizer().tokenize("small medium large")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'small', b'medium',
b'large'], dtype=object)>

Args
`input`	A `RaggedTensor` or `Tensor` of UTF-8 strings with any shape.

Returns
A `RaggedTensor` of tokenized text. The returned shape is the shape of the input tensor with an added ragged dimension for tokens of each string.

`tokenize_with_offsets`

View source

tokenize_with_offsets(
    input
)

Tokenizes a tensor of UTF-8 strings on whitespaces.

The strings are split on ICU defined whitespace characters. These whitespace characters are dropped.

Example:

>>> splitter = WhitespaceTokenizer()
>>> pieces, starts, ends = splitter.tokenize_with_offsets("a bb ccc")
>>> print(pieces.numpy(), starts.numpy(), ends.numpy())
[b'a' b'bb' b'ccc'] [0 2 5] [1 4 8]

Args
`input`	A `RaggedTensor`or `Tensor` of UTF-8 strings with any shape.

Returns
A tuple `(tokens, start_offsets, end_offsets)` where: `tokens`: A `RaggedTensor` of tokenized text. `start_offsets`: A `RaggedTensor` of the tokens' starting byte offset. `end_offsets`: A `RaggedTensor` of the tokens' ending byte offset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text.WhitespaceTokenizer

Methods

`split`

`split_with_offsets`

`tokenize`

Example:

`tokenize_with_offsets`

Example:

FilesExpand file tree

WhitespaceTokenizer.md

Latest commit

History

WhitespaceTokenizer.md

File metadata and controls

text.WhitespaceTokenizer

Methods

split

split_with_offsets

tokenize

Example:

tokenize_with_offsets

Example:

`split`

`split_with_offsets`

`tokenize`

`tokenize_with_offsets`