Skip to content

Commit 2a12958

Browse files
zoyahavtf-transform-team
authored andcommitted
Update docs
PiperOrigin-RevId: 248344586
1 parent d148f9c commit 2a12958

12 files changed

Lines changed: 107 additions & 19 deletions

docs/api_docs/python/_toc.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,8 @@ toc:
6666
path: /tfx/transform/api_docs/python/tft/scale_to_0_1
6767
- title: scale_to_z_score
6868
path: /tfx/transform/api_docs/python/tft/scale_to_z_score
69+
- title: scale_to_z_score_per_key
70+
path: /tfx/transform/api_docs/python/tft/scale_to_z_score_per_key
6971
- title: segment_indices
7072
path: /tfx/transform/api_docs/python/tft/segment_indices
7173
- title: size

docs/api_docs/python/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
* <a href="./tft/scale_by_min_max.md"><code>tft.scale_by_min_max</code></a>
3939
* <a href="./tft/scale_to_0_1.md"><code>tft.scale_to_0_1</code></a>
4040
* <a href="./tft/scale_to_z_score.md"><code>tft.scale_to_z_score</code></a>
41+
* <a href="./tft/scale_to_z_score_per_key.md"><code>tft.scale_to_z_score_per_key</code></a>
4142
* <a href="./tft/segment_indices.md"><code>tft.segment_indices</code></a>
4243
* <a href="./tft/size.md"><code>tft.size</code></a>
4344
* <a href="./tft/sparse_tensor_to_dense_with_shape.md"><code>tft.sparse_tensor_to_dense_with_shape</code></a>

docs/api_docs/python/tft.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,8 @@ Init module for TF.Transform.
8383

8484
[`scale_to_z_score(...)`](./tft/scale_to_z_score.md): Returns a standardized column with mean 0 and variance 1.
8585

86+
[`scale_to_z_score_per_key(...)`](./tft/scale_to_z_score_per_key.md): Returns a standardized column with mean 0 and variance 1, grouped per key.
87+
8688
[`segment_indices(...)`](./tft/segment_indices.md): Returns a `Tensor` of indices within each segment.
8789

8890
[`size(...)`](./tft/size.md): Computes the total size of instances in a `Tensor` over the whole dataset.

docs/api_docs/python/tft/WeightedMeanAndVarCombiner.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ Converts an accumulator into the output (mean, var) tuple.
114114

115115
#### Returns:
116116

117-
A 2-tuple composed of (mean, var) or None if accumulator is None.
117+
A 2-tuple composed of (mean, var).
118118

119119
<h3 id="merge_accumulators"><code>merge_accumulators</code></h3>
120120

@@ -126,7 +126,7 @@ Merges several `_WeightedMeanAndVarAccumulator`s to a single accumulator.
126126

127127
#### Args:
128128

129-
* <b>`accumulators`</b>: A list of `_WeightedMeanAndVarAccumulator`s and/or Nones.
129+
* <b>`accumulators`</b>: A list of `_WeightedMeanAndVarAccumulator`s.
130130

131131

132132
#### Returns:

docs/api_docs/python/tft/apply_buckets_with_interpolation.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,26 @@ interpolated values are normalized to the range [0, 1]. Values that are
2222
less than or equal to the lowest boundary, or greater than or equal to the
2323
highest boundary, will be mapped to 0 and 1 respectively.
2424

25+
This is a non-linear approach to normalization that is less sensitive to
26+
outliers than min-max or z-score scaling. When outliers are present, standard
27+
forms of normalization can leave the majority of the data compressed into a
28+
very small segment of the output range, whereas this approach tends to spread
29+
out the more frequent values (if quantile buckets are used). Note that
30+
distance relationships in the raw data are not necessarily preserved (data
31+
points that close to each other in the raw feature space may not be equally
32+
close in the transformed feature space). This means that unlike linear
33+
normalization methods, correlations between features may be distorted by the
34+
transformation.
35+
2536
#### Args:
2637

27-
* <b>`x`</b>: A numeric input `Tensor` (tf.float32, tf.float64, tf.int32, tf.int64).
38+
* <b>`x`</b>: A numeric input `Tensor`/`SparseTensor` (tf.float[32|64], tf.int[32|64])
2839
* <b>`bucket_boundaries`</b>: Sorted bucket boundaries as a rank-2 `Tensor`.
2940
* <b>`name`</b>: (Optional) A name for this operation.
3041

3142

3243
#### Returns:
3344

34-
A `Tensor` of the same shape as `x`, normalized to the range [0, 1]. If the
35-
input x is tf.float64, the returned values will be tf.float64.
36-
Otherwise, returned values are tf.float32.
45+
A `Tensor` or `SparseTensor` of the same shape as `x`, normalized to the
46+
range [0, 1]. If the input x is tf.float64, the returned values will be
47+
tf.float64. Otherwise, returned values are tf.float32.

docs/api_docs/python/tft/bucketize_per_key.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,9 @@ Returns a bucketized column, with a bucket index assigned to each input.
2222
* <b>`x`</b>: A numeric input `Tensor` or `SparseTensor` with rank 1, whose values
2323
should be mapped to buckets. `SparseTensor`s will have their non-missing
2424
values mapped and missing values left as missing.
25-
* <b>`key`</b>: A Tensor with the same shape as `x` and dtype tf.string. If `x` is
26-
a `SparseTensor`, `key` must exactly match `x` in everything except
27-
values, i.e. indices and dense_shape must be identical.
25+
* <b>`key`</b>: A Tensor or `SparseTensor` with the same shape as `x` and dtype
26+
tf.string. If `x` is a `SparseTensor`, `key` must exactly match `x` in
27+
everything except values, i.e. indices and dense_shape must be identical.
2828
* <b>`num_buckets`</b>: Values in the input `x` are divided into approximately
2929
equal-sized buckets, where the number of buckets is num_buckets.
3030
* <b>`epsilon`</b>: (Optional) see `bucketize`

docs/api_docs/python/tft/ptransform_analyzer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ collection by the caller.
2626
#### Args:
2727

2828
* <b>`inputs`</b>: A list of input `Tensor`s.
29-
* <b>`output_dtypes`</b>: The list of dtypes of the output of the analyzer.
29+
* <b>`output_dtypes`</b>: The list of TensorFlow dtypes of the output of the analyzer.
3030
* <b>`output_shapes`</b>: The list of shapes of the output of the analyzer. Must have
3131
the same length as output_dtypes.
3232
* <b>`ptransform`</b>: A Beam PTransform that accepts a Beam PCollection where each
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
<div itemscope itemtype="http://developers.google.com/ReferenceObject">
2+
<meta itemprop="name" content="tft.scale_to_z_score_per_key" />
3+
<meta itemprop="path" content="Stable" />
4+
</div>
5+
6+
# tft.scale_to_z_score_per_key
7+
8+
``` python
9+
tft.scale_to_z_score_per_key(
10+
x,
11+
key=None,
12+
elementwise=False,
13+
name=None,
14+
output_dtype=None
15+
)
16+
```
17+
18+
Returns a standardized column with mean 0 and variance 1, grouped per key.
19+
20+
Scaling to z-score subtracts out the mean and divides by standard deviation.
21+
Note that the standard deviation computed here is based on the biased variance
22+
(0 delta degrees of freedom), as computed by analyzers.var.
23+
24+
#### Args:
25+
26+
* <b>`x`</b>: A numeric `Tensor` or `SparseTensor`.
27+
* <b>`key`</b>: A Tensor or `SparseTensor` of dtype tf.string. If `x` is a
28+
`SparseTensor`, `key` must exactly match `x` in everything except
29+
values.
30+
* <b>`elementwise`</b>: If true, scales each element of the tensor independently;
31+
otherwise uses the mean and variance of the whole tensor.
32+
Currently, not supported for per-key operations.
33+
* <b>`name`</b>: (Optional) A name for this operation.
34+
* <b>`output_dtype`</b>: (Optional) If not None, casts the output tensor to this type.
35+
36+
37+
#### Returns:
38+
39+
A `Tensor` or `SparseTensor` containing the input column scaled to mean 0
40+
and variance 1 (standard deviation 1), grouped per key.
41+
That is, for all keys k: (x - mean(x)) / std_dev(x) for all x with key k.
42+
If `x` is floating point, the mean will have the same type as `x`. If `x` is
43+
integral, the output is cast to tf.float32.
44+
45+
Note that TFLearn generally permits only tf.int64 and tf.float32, so casting
46+
this scaler's output may be necessary.

docs/api_docs/python/tft/vocabulary.md

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ within each vocabulary entry (b/117796748).
7676
* <b>`frequency_threshold`</b>: Limit the generated vocabulary only to elements whose
7777
absolute frequency is >= to the supplied threshold. If set to None, the
7878
full vocabulary is generated. Absolute frequency means the number of
79-
occurences of the element in the dataset, as opposed to the proportion of
79+
occurrences of the element in the dataset, as opposed to the proportion of
8080
instances that contain that element.
8181
* <b>`vocab_filename`</b>: The file name for the vocabulary file. If none, the
8282
"uniques" scope name in the context of this graph will be used as the file
@@ -90,12 +90,16 @@ within each vocabulary entry (b/117796748).
9090
will be of the form 'frequency word'.
9191
* <b>`weights`</b>: (Optional) Weights `Tensor` for the vocabulary. It must have the
9292
same shape as x.
93-
* <b>`labels`</b>: (Optional) Labels `Tensor` for the vocabulary. It must have dtype
94-
int64, have values 0 or 1, and have the same shape as x.
95-
* <b>`use_adjusted_mutual_info`</b>: If true, use adjusted mutual information.
96-
* <b>`min_diff_from_avg`</b>: Mutual information of a feature will be adjusted to zero
97-
whenever the difference between count of the feature with any label and
98-
its expected count is lower than min_diff_from_average.
93+
* <b>`labels`</b>: (Optional) Labels `Tensor` for the vocabulary. It must have the same
94+
shape as x and be a discrete integerized tensor (If the label is numeric,
95+
it should first be bucketized; If the label is a string, an integer
96+
vocabulary should first be applied).
97+
* <b>`use_adjusted_mutual_info`</b>: If true, and labels are provided, calculate
98+
vocabulary using adjusted rather than raw mutual information.
99+
* <b>`min_diff_from_avg`</b>: MI (or AMI) of a feature x label will be adjusted to zero
100+
whenever the difference between count and the expected (average) count is
101+
lower than min_diff_from_average. This can be thought of as a regularizing
102+
parameter that pushes small MI/AMI values to zero.
99103
* <b>`coverage_top_k`</b>: (Optional), (Experimental) The minimum number of elements
100104
per key to be included in the vocabulary.
101105
* <b>`coverage_frequency_threshold`</b>: (Optional), (Experimental) Limit the coverage

docs/api_docs/python/tft_beam/AnalyzeAndTransformDataset.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,17 @@
4141

4242
Combination of AnalyzeDataset and TransformDataset.
4343

44+
```python
4445
transformed, transform_fn = AnalyzeAndTransformDataset(
4546
preprocessing_fn).expand(dataset)
47+
```
4648

4749
should be equivalent to
4850

51+
```python
4952
transform_fn = AnalyzeDataset(preprocessing_fn).expand(dataset)
5053
transformed = TransformDataset().expand((dataset, transform_fn))
54+
```
5155

5256
but may be more efficient since it avoids multiple passes over the data.
5357

0 commit comments

Comments
 (0)