Generate TFT docs

zoyahav · tf-transform-team · commit cd1254b65db1 · 2019-10-21T09:51:32.000-07:00
PiperOrigin-RevId: 275863009
diff --git a/docs/api_docs/python/tft.md b/docs/api_docs/python/tft.md
@@ -59,7 +59,7 @@ Init module for TF.Transform.
 
 [`ngrams(...)`](./tft/ngrams.md): Create a `SparseTensor` of n-grams.
 
-[`pca(...)`](./tft/pca.md): Computes pca on the dataset using biased covariance.
+[`pca(...)`](./tft/pca.md): Computes PCA on the dataset using biased covariance.
 
 [`ptransform_analyzer(...)`](./tft/ptransform_analyzer.md): Applies a user-provided PTransform over the whole dataset.
 
diff --git a/docs/api_docs/python/tft/apply_buckets_with_interpolation.md b/docs/api_docs/python/tft/apply_buckets_with_interpolation.md
@@ -31,7 +31,8 @@ distance relationships in the raw data are not necessarily preserved (data
 points that close to each other in the raw feature space may not be equally
 close in the transformed feature space). This means that unlike linear
 normalization methods, correlations between features may be distorted by the
-transformation.
+transformation. This scaling method may help with stability and minimize
+exploding gradients in neural networks.
 
 #### Args:
 
diff --git a/docs/api_docs/python/tft/bucketize.md b/docs/api_docs/python/tft/bucketize.md
@@ -12,7 +12,7 @@ tft.bucketize(
     epsilon=None,
     weights=None,
     elementwise=False,
-    always_return_num_quantiles=False,
+    always_return_num_quantiles=True,
     name=None
 )
 ```
@@ -26,10 +26,11 @@ Returns a bucketized column, with a bucket index assigned to each input.
     in the quantiles computation, and the result of `bucketize` will be a
     `SparseTensor` with non-missing values mapped to buckets.
 * <b>`num_buckets`</b>: Values in the input `x` are divided into approximately
-    equal-sized buckets, where the number of buckets is num_buckets.
-    This is a hint. The actual number of buckets computed can be
-    less or more than the requested number. Use the generated metadata to
-    find the computed number of buckets.
+    equal-sized buckets, where the number of buckets is `num_buckets`. By
+    default, the exact number will be available to `bucketize`. If
+    `always_return_num_quantiles` is False, the actual number of
+    buckets computed can be less or more than the requested number. Use the
+    generated metadata to find the computed number of buckets.
 * <b>`epsilon`</b>: (Optional) Error tolerance, typically a small fraction close to
     zero. If a value is not specified by the caller, a suitable value is
     computed based on experimental results.  For `num_buckets` less
@@ -44,8 +45,8 @@ Returns a bucketized column, with a bucket index assigned to each input.
 * <b>`elementwise`</b>: (Optional) If true, bucketize each element of the tensor
     independently.
 * <b>`always_return_num_quantiles`</b>: (Optional) A bool that determines whether the
-    exact num_buckets should be returned (defaults to False for now, but will
-    be changed to True in an imminent update).
+    exact num_buckets should be returned. If False, `num_buckets` will be
+    treated as a suggestion.
 * <b>`name`</b>: (Optional) A name for this operation.
 
 
diff --git a/docs/api_docs/python/tft/compute_and_apply_vocabulary.md b/docs/api_docs/python/tft/compute_and_apply_vocabulary.md
@@ -47,7 +47,9 @@ operation.
     absolute frequency is >= to the supplied threshold. If set to None, the
     full vocabulary is generated.  Absolute frequency means the number of
     occurences of the element in the dataset, as opposed to the proportion of
-    instances that contain that element.
+    instances that contain that element. If labels are provided and the vocab
+    is computed using mutual information, tokens are filtered if their mutual
+    information with the label is < the supplied threshold.
 * <b>`num_oov_buckets`</b>:  Any lookup of an out-of-vocabulary token will return a
     bucket ID based on its hash if `num_oov_buckets` is greater than zero.
     Otherwise it is assigned the `default_value`.
@@ -60,8 +62,16 @@ operation.
     downstream component.
 * <b>`weights`</b>: (Optional) Weights `Tensor` for the vocabulary. It must have the
     same shape as x.
-* <b>`labels`</b>: (Optional) Labels `Tensor` for the vocabulary. It must have dtype
-    int64, have values 0 or 1, and have the same shape as x.
+* <b>`labels`</b>: (Optional) A `Tensor` of labels for the vocabulary. If provided,
+    the vocabulary is calculated based on mutual information with the label,
+    rather than frequency. The labels must have the same batch dimension as x.
+    If x is sparse, labels should be a 1D tensor reflecting row-wise labels.
+    If x is dense, labels can either be a 1D tensor of row-wise labels, or
+    a dense tensor of the identical shape as x (i.e. element-wise labels).
+    Labels should be a discrete integerized tensor (If the label is numeric,
+    it should first be bucketized; If the label is a string, an integer
+    vocabulary should first be applied). Note: `SparseTensor` labels are not
+    yet supported (b/134931826).
 * <b>`use_adjusted_mutual_info`</b>: If true, use adjusted mutual information.
 * <b>`min_diff_from_avg`</b>: Mutual information of a feature will be adjusted to zero
     whenever the difference between count of the feature with any label and
diff --git a/docs/api_docs/python/tft/pca.md b/docs/api_docs/python/tft/pca.md
@@ -14,62 +14,67 @@ tft.pca(
 )
 ```
 
-Computes pca on the dataset using biased covariance.
+Computes PCA on the dataset using biased covariance.
 
-The pca analyzer computes output_dim orthonormal vectors that capture
+The PCA analyzer computes output_dim orthonormal vectors that capture
 directions/axes corresponding to the highest variances in the input vectors of
-x. The output vectors are returned as a rank-2 tensor with shape
-(input_dim, output_dim), where the 0th dimension are the components of each
+`x`. The output vectors are returned as a rank-2 tensor with shape
+`(input_dim, output_dim)`, where the 0th dimension are the components of each
 output vector, and the 1st dimension are the output vectors representing
 orthogonal directions in the input space, sorted in order of decreasing
 variances.
 
 The output rank-2 tensor (matrix) serves a useful transform purpose. Formally,
 the matrix can be used downstream in the transform step by multiplying it to
-the input tensor x. This transform reduces the dimension of input vectors to
+the input tensor `x`. This transform reduces the dimension of input vectors to
 output_dim in a way that retains the maximal variance.
 
 NOTE: To properly use PCA, input vector components should be converted to
 similar units of measurement such that the vectors represent a Euclidean
 space. If no such conversion is available (e.g. one element represents time,
 another element distance), the canonical approach is to first apply a
 transformation to the input data to normalize numerical variances, i.e.
-tft.scale_to_z_score(). Normalization allows PCA to choose output axes that
+`tft.scale_to_z_score()`. Normalization allows PCA to choose output axes that
 help decorrelate input axes.
 
 Below are a couple intuitive examples of PCA.
 
 Consider a simple 2-dimensional example:
 
-Input x is a series of vectors [e, e] where e is Gaussian with mean 0,
+Input x is a series of vectors `[e, e]` where `e` is Gaussian with mean 0,
 variance 1. The two components are perfectly correlated, and the resulting
 covariance matrix is
+
+```
 [[1 1],
  [1 1]].
-Applying PCA with output_dim = 1 would discover the first principal component
-[1 / sqrt(2), 1 / sqrt(2)]. When multipled to the original example, each
-vector [e, e] would be mapped to a scalar sqrt(2) * e. The second principal
-component would be [-1 / sqrt(2), 1 / sqrt(2)] and would map [e, e] to 0,
-which indicates that the second component captures no variance at all. This
-agrees with our intuition since we know that the two axes in the input are
-perfectly correlated and can be fully explained by a single scalar e.
+```
+
+Applying PCA with `output_dim = 1` would discover the first principal
+component `[1 / sqrt(2), 1 / sqrt(2)]`. When multipled to the original
+example, each vector `[e, e]` would be mapped to a scalar `sqrt(2) * e`. The
+second principal component would be `[-1 / sqrt(2), 1 / sqrt(2)]` and would
+map `[e, e]` to 0, which indicates that the second component captures no
+variance at all. This agrees with our intuition since we know that the two
+axes in the input are perfectly correlated and can be fully explained by a
+single scalar `e`.
 
 Consider a 3-dimensional example:
 
-Input x is a series of vectors [a, a, b], where a is a zero-mean, unit
-variance Gaussian. b is a zero-mean, variance 4 Gaussian and is independent of
-a. The first principal component of the unnormalized vector would be [0, 0, 1]
-since b has a much larger variance than any linear combination of the first
-two components. This would map [a, a, b] onto b, asserting that the axis with
-highest energy is the third component. While this may be the desired
-output if a and b correspond to the same units, it is not statistically
-desireable when the units are irreconciliable. In such a case, one should
-first normalize each component to unit variance first, i.e. b := b / 2.
-The first principal component of a normalized vector would yield
-[1 / sqrt(2), 1 / sqrt(2), 0], and would map [a, a, b] to sqrt(2) * a. The
-second component would be [0, 0, 1] and map [a, a, b] to b. As can be seen,
-the benefit of normalization is that PCA would capture highly correlated
-components first and collapse them into a lower dimension.
+Input `x` is a series of vectors `[a, a, b]`, where `a` is a zero-mean, unit
+variance Gaussian and `b` is a zero-mean, variance 4 Gaussian and is
+independent of `a`. The first principal component of the unnormalized vector
+would be `[0, 0, 1]` since `b` has a much larger variance than any linear
+combination of the first two components. This would map `[a, a, b]` onto `b`,
+asserting that the axis with highest energy is the third component. While this
+may be the desired output if `a` and `b` correspond to the same units, it is
+not statistically desireable when the units are irreconciliable. In such a
+case, one should first normalize each component to unit variance first, i.e.
+`b := b / 2`. The first principal component of a normalized vector would yield
+`[1 / sqrt(2), 1 / sqrt(2), 0]`, and would map `[a, a, b]` to `sqrt(2) * a`.
+The second component would be `[0, 0, 1]` and map `[a, a, b]` to `b`. As can
+be seen, the benefit of normalization is that PCA would capture highly
+correlated components first and collapse them into a lower dimension.
 
 #### Args:
 
diff --git a/docs/api_docs/python/tft/quantiles.md b/docs/api_docs/python/tft/quantiles.md
@@ -12,7 +12,7 @@ tft.quantiles(
     epsilon,
     weights=None,
     reduce_instance_dims=True,
-    always_return_num_quantiles=False,
+    always_return_num_quantiles=True,
     name=None
 )
 ```
@@ -28,9 +28,11 @@ See go/squawd for details, and how to control the error due to approximation.
 
 * <b>`x`</b>: An input `Tensor`.
 * <b>`num_buckets`</b>: Values in the `x` are divided into approximately equal-sized
-    buckets, where the number of buckets is num_buckets. This is a hint. The
-    actual number of buckets computed can be less or more than the requested
-    number. Use the generated metadata to find the computed number of buckets.
+    buckets, where the number of buckets is `num_buckets`. By default, the
+    exact number will be returned, minus one (boundary count is one less).
+    If `always_return_num_quantiles` is False, the actual number of buckets
+    computed can be less or more than the requested number. Use the generated
+    metadata to find the computed number of buckets.
 * <b>`epsilon`</b>: Error tolerance, typically a small fraction close to zero (e.g.
     0.01). Higher values of epsilon increase the quantile approximation, and
     hence result in more unequal buckets, but could improve performance,
@@ -52,8 +54,8 @@ See go/squawd for details, and how to control the error due to approximation.
       to arrive at a single output vector. If False, only collapses the batch
       dimension and outputs a vector of the same shape as the input.
 * <b>`always_return_num_quantiles`</b>: (Optional) A bool that determines whether the
-    exact num_buckets should be returned (defaults to False for now, but will
-    be changed to True in an imminent update).
+    exact num_buckets should be returned. If False, `num_buckets` will be
+    treated as a suggestion.
 * <b>`name`</b>: (Optional) A name for this operation.
 
 
diff --git a/docs/api_docs/python/tft_beam/AnalyzeAndTransformDataset.md b/docs/api_docs/python/tft_beam/AnalyzeAndTransformDataset.md
@@ -218,7 +218,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/AnalyzeDataset.md b/docs/api_docs/python/tft_beam/AnalyzeDataset.md
@@ -203,7 +203,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/AnalyzeDatasetWithCache.md b/docs/api_docs/python/tft_beam/AnalyzeDatasetWithCache.md
@@ -51,6 +51,7 @@ will write out cache for statistics that it does compute whenever possible.
 
 * <b>`preprocessing_fn`</b>: A function that accepts and returns a dictionary from
     strings to `Tensor` or `SparseTensor`s.
+* <b>`pipeline`</b>: (Optional) a beam Pipeline.
 
 <h2 id="__init__"><code>__init__</code></h2>
 
@@ -203,7 +204,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/Context.md b/docs/api_docs/python/tft_beam/Context.md
@@ -20,7 +20,7 @@ Context manager for tensorflow-transform.
 
 All the attributes in this context are kept on a thread local state.
 
-#### Args:
+#### Attributes:
 
 * <b>`temp_dir`</b>: (Optional) The temporary directory used within in this block.
 * <b>`desired_batch_size`</b>: (Optional) A batch size to batch elements by. If not
diff --git a/docs/api_docs/python/tft_beam/ReadTransformFn.md b/docs/api_docs/python/tft_beam/ReadTransformFn.md
@@ -189,7 +189,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/TransformDataset.md b/docs/api_docs/python/tft_beam/TransformDataset.md
@@ -206,7 +206,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/WriteMetadata.md b/docs/api_docs/python/tft_beam/WriteMetadata.md
@@ -194,7 +194,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/WriteTransformFn.md b/docs/api_docs/python/tft_beam/WriteTransformFn.md
@@ -194,7 +194,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/analyzer_cache/ReadAnalysisCacheFromFS.md b/docs/api_docs/python/tft_beam/analyzer_cache/ReadAnalysisCacheFromFS.md
@@ -200,7 +200,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>
 
diff --git a/docs/api_docs/python/tft_beam/analyzer_cache/WriteAnalysisCacheToFS.md b/docs/api_docs/python/tft_beam/analyzer_cache/WriteAnalysisCacheToFS.md
@@ -53,6 +53,7 @@ so the cache must already exist there when constructing this.
 __init__(
     pipeline,
     cache_base_dir,
+    dataset_keys=None,
     sink=None
 )
 ```
@@ -63,9 +64,10 @@ Init method.
 
 * <b>`pipeline`</b>: A beam Pipeline.
 * <b>`cache_base_dir`</b>: A str, the path that the cache should be stored in.
-* <b>`sink`</b>: (Optional) A PTransform class that takes a path, and optional
-    file_name_suffix arguments in its constructor, and is used to write the
-    cache.
+* <b>`dataset_keys`</b>: (Optional) An iterable of strings.
+* <b>`sink`</b>: (Optional) A PTransform class that takes a path in its constructor,
+    and is used to write the cache. If not provided this uses a GZipped
+    TFRecord sink.
 
 
 
@@ -207,7 +209,12 @@ from_runner_api(
 get_type_hints()
 ```
 
+Gets and/or initializes type hints for this object.
 
+If type hints have not been set, attempts to initialize type hints in this
+order:
+- Using self.default_type_hints().
+- Using self.__class__ type hints.
 
 <h3 id="get_windowing"><code>get_windowing</code></h3>