Skip to content

Commit cd1254b

Browse files
zoyahavtf-transform-team
authored andcommitted
Generate TFT docs
PiperOrigin-RevId: 275863009
1 parent 92cb544 commit cd1254b

16 files changed

Lines changed: 117 additions & 50 deletions

docs/api_docs/python/tft.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ Init module for TF.Transform.
5959

6060
[`ngrams(...)`](./tft/ngrams.md): Create a `SparseTensor` of n-grams.
6161

62-
[`pca(...)`](./tft/pca.md): Computes pca on the dataset using biased covariance.
62+
[`pca(...)`](./tft/pca.md): Computes PCA on the dataset using biased covariance.
6363

6464
[`ptransform_analyzer(...)`](./tft/ptransform_analyzer.md): Applies a user-provided PTransform over the whole dataset.
6565

docs/api_docs/python/tft/apply_buckets_with_interpolation.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ distance relationships in the raw data are not necessarily preserved (data
3131
points that close to each other in the raw feature space may not be equally
3232
close in the transformed feature space). This means that unlike linear
3333
normalization methods, correlations between features may be distorted by the
34-
transformation.
34+
transformation. This scaling method may help with stability and minimize
35+
exploding gradients in neural networks.
3536

3637
#### Args:
3738

docs/api_docs/python/tft/bucketize.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ tft.bucketize(
1212
epsilon=None,
1313
weights=None,
1414
elementwise=False,
15-
always_return_num_quantiles=False,
15+
always_return_num_quantiles=True,
1616
name=None
1717
)
1818
```
@@ -26,10 +26,11 @@ Returns a bucketized column, with a bucket index assigned to each input.
2626
in the quantiles computation, and the result of `bucketize` will be a
2727
`SparseTensor` with non-missing values mapped to buckets.
2828
* <b>`num_buckets`</b>: Values in the input `x` are divided into approximately
29-
equal-sized buckets, where the number of buckets is num_buckets.
30-
This is a hint. The actual number of buckets computed can be
31-
less or more than the requested number. Use the generated metadata to
32-
find the computed number of buckets.
29+
equal-sized buckets, where the number of buckets is `num_buckets`. By
30+
default, the exact number will be available to `bucketize`. If
31+
`always_return_num_quantiles` is False, the actual number of
32+
buckets computed can be less or more than the requested number. Use the
33+
generated metadata to find the computed number of buckets.
3334
* <b>`epsilon`</b>: (Optional) Error tolerance, typically a small fraction close to
3435
zero. If a value is not specified by the caller, a suitable value is
3536
computed based on experimental results. For `num_buckets` less
@@ -44,8 +45,8 @@ Returns a bucketized column, with a bucket index assigned to each input.
4445
* <b>`elementwise`</b>: (Optional) If true, bucketize each element of the tensor
4546
independently.
4647
* <b>`always_return_num_quantiles`</b>: (Optional) A bool that determines whether the
47-
exact num_buckets should be returned (defaults to False for now, but will
48-
be changed to True in an imminent update).
48+
exact num_buckets should be returned. If False, `num_buckets` will be
49+
treated as a suggestion.
4950
* <b>`name`</b>: (Optional) A name for this operation.
5051

5152

docs/api_docs/python/tft/compute_and_apply_vocabulary.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,9 @@ operation.
4747
absolute frequency is >= to the supplied threshold. If set to None, the
4848
full vocabulary is generated. Absolute frequency means the number of
4949
occurences of the element in the dataset, as opposed to the proportion of
50-
instances that contain that element.
50+
instances that contain that element. If labels are provided and the vocab
51+
is computed using mutual information, tokens are filtered if their mutual
52+
information with the label is < the supplied threshold.
5153
* <b>`num_oov_buckets`</b>: Any lookup of an out-of-vocabulary token will return a
5254
bucket ID based on its hash if `num_oov_buckets` is greater than zero.
5355
Otherwise it is assigned the `default_value`.
@@ -60,8 +62,16 @@ operation.
6062
downstream component.
6163
* <b>`weights`</b>: (Optional) Weights `Tensor` for the vocabulary. It must have the
6264
same shape as x.
63-
* <b>`labels`</b>: (Optional) Labels `Tensor` for the vocabulary. It must have dtype
64-
int64, have values 0 or 1, and have the same shape as x.
65+
* <b>`labels`</b>: (Optional) A `Tensor` of labels for the vocabulary. If provided,
66+
the vocabulary is calculated based on mutual information with the label,
67+
rather than frequency. The labels must have the same batch dimension as x.
68+
If x is sparse, labels should be a 1D tensor reflecting row-wise labels.
69+
If x is dense, labels can either be a 1D tensor of row-wise labels, or
70+
a dense tensor of the identical shape as x (i.e. element-wise labels).
71+
Labels should be a discrete integerized tensor (If the label is numeric,
72+
it should first be bucketized; If the label is a string, an integer
73+
vocabulary should first be applied). Note: `SparseTensor` labels are not
74+
yet supported (b/134931826).
6575
* <b>`use_adjusted_mutual_info`</b>: If true, use adjusted mutual information.
6676
* <b>`min_diff_from_avg`</b>: Mutual information of a feature will be adjusted to zero
6777
whenever the difference between count of the feature with any label and

docs/api_docs/python/tft/pca.md

Lines changed: 33 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -14,62 +14,67 @@ tft.pca(
1414
)
1515
```
1616

17-
Computes pca on the dataset using biased covariance.
17+
Computes PCA on the dataset using biased covariance.
1818

19-
The pca analyzer computes output_dim orthonormal vectors that capture
19+
The PCA analyzer computes output_dim orthonormal vectors that capture
2020
directions/axes corresponding to the highest variances in the input vectors of
21-
x. The output vectors are returned as a rank-2 tensor with shape
22-
(input_dim, output_dim), where the 0th dimension are the components of each
21+
`x`. The output vectors are returned as a rank-2 tensor with shape
22+
`(input_dim, output_dim)`, where the 0th dimension are the components of each
2323
output vector, and the 1st dimension are the output vectors representing
2424
orthogonal directions in the input space, sorted in order of decreasing
2525
variances.
2626

2727
The output rank-2 tensor (matrix) serves a useful transform purpose. Formally,
2828
the matrix can be used downstream in the transform step by multiplying it to
29-
the input tensor x. This transform reduces the dimension of input vectors to
29+
the input tensor `x`. This transform reduces the dimension of input vectors to
3030
output_dim in a way that retains the maximal variance.
3131

3232
NOTE: To properly use PCA, input vector components should be converted to
3333
similar units of measurement such that the vectors represent a Euclidean
3434
space. If no such conversion is available (e.g. one element represents time,
3535
another element distance), the canonical approach is to first apply a
3636
transformation to the input data to normalize numerical variances, i.e.
37-
tft.scale_to_z_score(). Normalization allows PCA to choose output axes that
37+
`tft.scale_to_z_score()`. Normalization allows PCA to choose output axes that
3838
help decorrelate input axes.
3939

4040
Below are a couple intuitive examples of PCA.
4141

4242
Consider a simple 2-dimensional example:
4343

44-
Input x is a series of vectors [e, e] where e is Gaussian with mean 0,
44+
Input x is a series of vectors `[e, e]` where `e` is Gaussian with mean 0,
4545
variance 1. The two components are perfectly correlated, and the resulting
4646
covariance matrix is
47+
48+
```
4749
[[1 1],
4850
[1 1]].
49-
Applying PCA with output_dim = 1 would discover the first principal component
50-
[1 / sqrt(2), 1 / sqrt(2)]. When multipled to the original example, each
51-
vector [e, e] would be mapped to a scalar sqrt(2) * e. The second principal
52-
component would be [-1 / sqrt(2), 1 / sqrt(2)] and would map [e, e] to 0,
53-
which indicates that the second component captures no variance at all. This
54-
agrees with our intuition since we know that the two axes in the input are
55-
perfectly correlated and can be fully explained by a single scalar e.
51+
```
52+
53+
Applying PCA with `output_dim = 1` would discover the first principal
54+
component `[1 / sqrt(2), 1 / sqrt(2)]`. When multipled to the original
55+
example, each vector `[e, e]` would be mapped to a scalar `sqrt(2) * e`. The
56+
second principal component would be `[-1 / sqrt(2), 1 / sqrt(2)]` and would
57+
map `[e, e]` to 0, which indicates that the second component captures no
58+
variance at all. This agrees with our intuition since we know that the two
59+
axes in the input are perfectly correlated and can be fully explained by a
60+
single scalar `e`.
5661

5762
Consider a 3-dimensional example:
5863

59-
Input x is a series of vectors [a, a, b], where a is a zero-mean, unit
60-
variance Gaussian. b is a zero-mean, variance 4 Gaussian and is independent of
61-
a. The first principal component of the unnormalized vector would be [0, 0, 1]
62-
since b has a much larger variance than any linear combination of the first
63-
two components. This would map [a, a, b] onto b, asserting that the axis with
64-
highest energy is the third component. While this may be the desired
65-
output if a and b correspond to the same units, it is not statistically
66-
desireable when the units are irreconciliable. In such a case, one should
67-
first normalize each component to unit variance first, i.e. b := b / 2.
68-
The first principal component of a normalized vector would yield
69-
[1 / sqrt(2), 1 / sqrt(2), 0], and would map [a, a, b] to sqrt(2) * a. The
70-
second component would be [0, 0, 1] and map [a, a, b] to b. As can be seen,
71-
the benefit of normalization is that PCA would capture highly correlated
72-
components first and collapse them into a lower dimension.
64+
Input `x` is a series of vectors `[a, a, b]`, where `a` is a zero-mean, unit
65+
variance Gaussian and `b` is a zero-mean, variance 4 Gaussian and is
66+
independent of `a`. The first principal component of the unnormalized vector
67+
would be `[0, 0, 1]` since `b` has a much larger variance than any linear
68+
combination of the first two components. This would map `[a, a, b]` onto `b`,
69+
asserting that the axis with highest energy is the third component. While this
70+
may be the desired output if `a` and `b` correspond to the same units, it is
71+
not statistically desireable when the units are irreconciliable. In such a
72+
case, one should first normalize each component to unit variance first, i.e.
73+
`b := b / 2`. The first principal component of a normalized vector would yield
74+
`[1 / sqrt(2), 1 / sqrt(2), 0]`, and would map `[a, a, b]` to `sqrt(2) * a`.
75+
The second component would be `[0, 0, 1]` and map `[a, a, b]` to `b`. As can
76+
be seen, the benefit of normalization is that PCA would capture highly
77+
correlated components first and collapse them into a lower dimension.
7378

7479
#### Args:
7580

docs/api_docs/python/tft/quantiles.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ tft.quantiles(
1212
epsilon,
1313
weights=None,
1414
reduce_instance_dims=True,
15-
always_return_num_quantiles=False,
15+
always_return_num_quantiles=True,
1616
name=None
1717
)
1818
```
@@ -28,9 +28,11 @@ See go/squawd for details, and how to control the error due to approximation.
2828

2929
* <b>`x`</b>: An input `Tensor`.
3030
* <b>`num_buckets`</b>: Values in the `x` are divided into approximately equal-sized
31-
buckets, where the number of buckets is num_buckets. This is a hint. The
32-
actual number of buckets computed can be less or more than the requested
33-
number. Use the generated metadata to find the computed number of buckets.
31+
buckets, where the number of buckets is `num_buckets`. By default, the
32+
exact number will be returned, minus one (boundary count is one less).
33+
If `always_return_num_quantiles` is False, the actual number of buckets
34+
computed can be less or more than the requested number. Use the generated
35+
metadata to find the computed number of buckets.
3436
* <b>`epsilon`</b>: Error tolerance, typically a small fraction close to zero (e.g.
3537
0.01). Higher values of epsilon increase the quantile approximation, and
3638
hence result in more unequal buckets, but could improve performance,
@@ -52,8 +54,8 @@ See go/squawd for details, and how to control the error due to approximation.
5254
to arrive at a single output vector. If False, only collapses the batch
5355
dimension and outputs a vector of the same shape as the input.
5456
* <b>`always_return_num_quantiles`</b>: (Optional) A bool that determines whether the
55-
exact num_buckets should be returned (defaults to False for now, but will
56-
be changed to True in an imminent update).
57+
exact num_buckets should be returned. If False, `num_buckets` will be
58+
treated as a suggestion.
5759
* <b>`name`</b>: (Optional) A name for this operation.
5860

5961

docs/api_docs/python/tft_beam/AnalyzeAndTransformDataset.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,12 @@ from_runner_api(
218218
get_type_hints()
219219
```
220220

221+
Gets and/or initializes type hints for this object.
221222

223+
If type hints have not been set, attempts to initialize type hints in this
224+
order:
225+
- Using self.default_type_hints().
226+
- Using self.__class__ type hints.
222227

223228
<h3 id="get_windowing"><code>get_windowing</code></h3>
224229

docs/api_docs/python/tft_beam/AnalyzeDataset.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,12 @@ from_runner_api(
203203
get_type_hints()
204204
```
205205

206+
Gets and/or initializes type hints for this object.
206207

208+
If type hints have not been set, attempts to initialize type hints in this
209+
order:
210+
- Using self.default_type_hints().
211+
- Using self.__class__ type hints.
207212

208213
<h3 id="get_windowing"><code>get_windowing</code></h3>
209214

docs/api_docs/python/tft_beam/AnalyzeDatasetWithCache.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ will write out cache for statistics that it does compute whenever possible.
5151

5252
* <b>`preprocessing_fn`</b>: A function that accepts and returns a dictionary from
5353
strings to `Tensor` or `SparseTensor`s.
54+
* <b>`pipeline`</b>: (Optional) a beam Pipeline.
5455

5556
<h2 id="__init__"><code>__init__</code></h2>
5657

@@ -203,7 +204,12 @@ from_runner_api(
203204
get_type_hints()
204205
```
205206

207+
Gets and/or initializes type hints for this object.
206208

209+
If type hints have not been set, attempts to initialize type hints in this
210+
order:
211+
- Using self.default_type_hints().
212+
- Using self.__class__ type hints.
207213

208214
<h3 id="get_windowing"><code>get_windowing</code></h3>
209215

docs/api_docs/python/tft_beam/Context.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Context manager for tensorflow-transform.
2020

2121
All the attributes in this context are kept on a thread local state.
2222

23-
#### Args:
23+
#### Attributes:
2424

2525
* <b>`temp_dir`</b>: (Optional) The temporary directory used within in this block.
2626
* <b>`desired_batch_size`</b>: (Optional) A batch size to batch elements by. If not

0 commit comments

Comments
 (0)