Update docs

zoyahav · tf-transform-team · commit 2a1295897063 · 2019-05-15T09:12:40.000-07:00
PiperOrigin-RevId: 248344586
diff --git a/docs/api_docs/python/_toc.yaml b/docs/api_docs/python/_toc.yaml
@@ -66,6 +66,8 @@ toc:
       path: /tfx/transform/api_docs/python/tft/scale_to_0_1
     - title: scale_to_z_score
       path: /tfx/transform/api_docs/python/tft/scale_to_z_score
+    - title: scale_to_z_score_per_key
+      path: /tfx/transform/api_docs/python/tft/scale_to_z_score_per_key
     - title: segment_indices
       path: /tfx/transform/api_docs/python/tft/segment_indices
     - title: size
diff --git a/docs/api_docs/python/index.md b/docs/api_docs/python/index.md
@@ -38,6 +38,7 @@
 *  <a href="./tft/scale_by_min_max.md"><code>tft.scale_by_min_max</code></a>
 *  <a href="./tft/scale_to_0_1.md"><code>tft.scale_to_0_1</code></a>
 *  <a href="./tft/scale_to_z_score.md"><code>tft.scale_to_z_score</code></a>
+*  <a href="./tft/scale_to_z_score_per_key.md"><code>tft.scale_to_z_score_per_key</code></a>
 *  <a href="./tft/segment_indices.md"><code>tft.segment_indices</code></a>
 *  <a href="./tft/size.md"><code>tft.size</code></a>
 *  <a href="./tft/sparse_tensor_to_dense_with_shape.md"><code>tft.sparse_tensor_to_dense_with_shape</code></a>
diff --git a/docs/api_docs/python/tft.md b/docs/api_docs/python/tft.md
@@ -83,6 +83,8 @@ Init module for TF.Transform.
 
 [`scale_to_z_score(...)`](./tft/scale_to_z_score.md): Returns a standardized column with mean 0 and variance 1.
 
+[`scale_to_z_score_per_key(...)`](./tft/scale_to_z_score_per_key.md): Returns a standardized column with mean 0 and variance 1, grouped per key.
+
 [`segment_indices(...)`](./tft/segment_indices.md): Returns a `Tensor` of indices within each segment.
 
 [`size(...)`](./tft/size.md): Computes the total size of instances in a `Tensor` over the whole dataset.
diff --git a/docs/api_docs/python/tft/WeightedMeanAndVarCombiner.md b/docs/api_docs/python/tft/WeightedMeanAndVarCombiner.md
@@ -114,7 +114,7 @@ Converts an accumulator into the output (mean, var) tuple.
 
 #### Returns:
 
-A 2-tuple composed of (mean, var) or None if accumulator is None.
+A 2-tuple composed of (mean, var).
 
 <h3 id="merge_accumulators"><code>merge_accumulators</code></h3>
 
@@ -126,7 +126,7 @@ Merges several `_WeightedMeanAndVarAccumulator`s to a single accumulator.
 
 #### Args:
 
-* <b>`accumulators`</b>: A list of `_WeightedMeanAndVarAccumulator`s and/or Nones.
+* <b>`accumulators`</b>: A list of `_WeightedMeanAndVarAccumulator`s.
 
 
 #### Returns:
diff --git a/docs/api_docs/python/tft/apply_buckets_with_interpolation.md b/docs/api_docs/python/tft/apply_buckets_with_interpolation.md
@@ -22,15 +22,26 @@ interpolated values are normalized to the range [0, 1]. Values that are
 less than or equal to the lowest boundary, or greater than or equal to the
 highest boundary, will be mapped to 0 and 1 respectively.
 
+This is a non-linear approach to normalization that is less sensitive to
+outliers than min-max or z-score scaling. When outliers are present, standard
+forms of normalization can leave the majority of the data compressed into a
+very small segment of the output range, whereas this approach tends to spread
+out the more frequent values (if quantile buckets are used). Note that
+distance relationships in the raw data are not necessarily preserved (data
+points that close to each other in the raw feature space may not be equally
+close in the transformed feature space). This means that unlike linear
+normalization methods, correlations between features may be distorted by the
+transformation.
+
 #### Args:
 
-* <b>`x`</b>: A numeric input `Tensor` (tf.float32, tf.float64, tf.int32, tf.int64).
+* <b>`x`</b>: A numeric input `Tensor`/`SparseTensor` (tf.float[32|64], tf.int[32|64])
 * <b>`bucket_boundaries`</b>: Sorted bucket boundaries as a rank-2 `Tensor`.
 * <b>`name`</b>: (Optional) A name for this operation.
 
 
 #### Returns:
 
-A `Tensor` of the same shape as `x`, normalized to the range [0, 1]. If the
-  input x is tf.float64, the returned values will be tf.float64.
-  Otherwise, returned values are tf.float32.
+A `Tensor` or `SparseTensor` of the same shape as `x`, normalized to the
+  range [0, 1]. If the input x is tf.float64, the returned values will be
+  tf.float64. Otherwise, returned values are tf.float32.
diff --git a/docs/api_docs/python/tft/bucketize_per_key.md b/docs/api_docs/python/tft/bucketize_per_key.md
@@ -22,9 +22,9 @@ Returns a bucketized column, with a bucket index assigned to each input.
 * <b>`x`</b>: A numeric input `Tensor` or `SparseTensor` with rank 1, whose values
     should be mapped to buckets.  `SparseTensor`s will have their non-missing
     values mapped and missing values left as missing.
-* <b>`key`</b>: A Tensor with the same shape as `x` and dtype tf.string.  If `x` is
-    a `SparseTensor`, `key` must exactly match `x` in everything except
-    values, i.e. indices and dense_shape must be identical.
+* <b>`key`</b>: A Tensor or `SparseTensor` with the same shape as `x` and dtype
+    tf.string.  If `x` is a `SparseTensor`, `key` must exactly match `x` in
+    everything except values, i.e. indices and dense_shape must be identical.
 * <b>`num_buckets`</b>: Values in the input `x` are divided into approximately
     equal-sized buckets, where the number of buckets is num_buckets.
 * <b>`epsilon`</b>: (Optional) see `bucketize`
diff --git a/docs/api_docs/python/tft/ptransform_analyzer.md b/docs/api_docs/python/tft/ptransform_analyzer.md
@@ -26,7 +26,7 @@ collection by the caller.
 #### Args:
 
 * <b>`inputs`</b>: A list of input `Tensor`s.
-* <b>`output_dtypes`</b>: The list of dtypes of the output of the analyzer.
+* <b>`output_dtypes`</b>: The list of TensorFlow dtypes of the output of the analyzer.
 * <b>`output_shapes`</b>: The list of shapes of the output of the analyzer.  Must have
     the same length as output_dtypes.
 * <b>`ptransform`</b>: A Beam PTransform that accepts a Beam PCollection where each
diff --git a/docs/api_docs/python/tft/scale_to_z_score_per_key.md b/docs/api_docs/python/tft/scale_to_z_score_per_key.md
@@ -0,0 +1,46 @@
+<div itemscope itemtype="http://developers.google.com/ReferenceObject">
+<meta itemprop="name" content="tft.scale_to_z_score_per_key" />
+<meta itemprop="path" content="Stable" />
+</div>
+
+# tft.scale_to_z_score_per_key
+
+``` python
+tft.scale_to_z_score_per_key(
+    x,
+    key=None,
+    elementwise=False,
+    name=None,
+    output_dtype=None
+)
+```
+
+Returns a standardized column with mean 0 and variance 1, grouped per key.
+
+Scaling to z-score subtracts out the mean and divides by standard deviation.
+Note that the standard deviation computed here is based on the biased variance
+(0 delta degrees of freedom), as computed by analyzers.var.
+
+#### Args:
+
+* <b>`x`</b>: A numeric `Tensor` or `SparseTensor`.
+* <b>`key`</b>: A Tensor or `SparseTensor` of dtype tf.string.  If `x` is a
+      `SparseTensor`, `key` must exactly match `x` in everything except
+      values.
+* <b>`elementwise`</b>: If true, scales each element of the tensor independently;
+      otherwise uses the mean and variance of the whole tensor.
+      Currently, not supported for per-key operations.
+* <b>`name`</b>: (Optional) A name for this operation.
+* <b>`output_dtype`</b>: (Optional) If not None, casts the output tensor to this type.
+
+
+#### Returns:
+
+A `Tensor` or `SparseTensor` containing the input column scaled to mean 0
+and variance 1 (standard deviation 1), grouped per key.
+That is, for all keys k: (x - mean(x)) / std_dev(x) for all x with key k.
+If `x` is floating point, the mean will have the same type as `x`. If `x` is
+integral, the output is cast to tf.float32.
+
+Note that TFLearn generally permits only tf.int64 and tf.float32, so casting
+this scaler's output may be necessary.
diff --git a/docs/api_docs/python/tft/vocabulary.md b/docs/api_docs/python/tft/vocabulary.md
@@ -76,7 +76,7 @@ within each vocabulary entry (b/117796748).
 * <b>`frequency_threshold`</b>: Limit the generated vocabulary only to elements whose
     absolute frequency is >= to the supplied threshold. If set to None, the
     full vocabulary is generated.  Absolute frequency means the number of
-    occurences of the element in the dataset, as opposed to the proportion of
+    occurrences of the element in the dataset, as opposed to the proportion of
     instances that contain that element.
 * <b>`vocab_filename`</b>: The file name for the vocabulary file. If none, the
     "uniques" scope name in the context of this graph will be used as the file
@@ -90,12 +90,16 @@ within each vocabulary entry (b/117796748).
     will be of the form 'frequency word'.
 * <b>`weights`</b>: (Optional) Weights `Tensor` for the vocabulary. It must have the
     same shape as x.
-* <b>`labels`</b>: (Optional) Labels `Tensor` for the vocabulary. It must have dtype
-    int64, have values 0 or 1, and have the same shape as x.
-* <b>`use_adjusted_mutual_info`</b>: If true, use adjusted mutual information.
-* <b>`min_diff_from_avg`</b>: Mutual information of a feature will be adjusted to zero
-    whenever the difference between count of the feature with any label and
-    its expected count is lower than min_diff_from_average.
+* <b>`labels`</b>: (Optional) Labels `Tensor` for the vocabulary. It must have the same
+    shape as x and be a discrete integerized tensor (If the label is numeric,
+    it should first be bucketized; If the label is a string, an integer
+    vocabulary should first be applied).
+* <b>`use_adjusted_mutual_info`</b>: If true, and labels are provided, calculate
+    vocabulary using adjusted rather than raw mutual information.
+* <b>`min_diff_from_avg`</b>: MI (or AMI) of a feature x label will be adjusted to zero
+    whenever the difference between count and the expected (average) count is
+    lower than min_diff_from_average. This can be thought of as a regularizing
+    parameter that pushes small MI/AMI values to zero.
 * <b>`coverage_top_k`</b>: (Optional), (Experimental) The minimum number of elements
     per key to be included in the vocabulary.
 * <b>`coverage_frequency_threshold`</b>: (Optional), (Experimental) Limit the coverage
diff --git a/docs/api_docs/python/tft_beam/AnalyzeAndTransformDataset.md b/docs/api_docs/python/tft_beam/AnalyzeAndTransformDataset.md
@@ -41,13 +41,17 @@
 
 Combination of AnalyzeDataset and TransformDataset.
 
+```python
 transformed, transform_fn = AnalyzeAndTransformDataset(
     preprocessing_fn).expand(dataset)
+```
 
 should be equivalent to
 
+```python
 transform_fn = AnalyzeDataset(preprocessing_fn).expand(dataset)
 transformed = TransformDataset().expand((dataset, transform_fn))
+```
 
 but may be more efficient since it avoids multiple passes over the data.
 
diff --git a/docs/api_docs/python/tft_beam/analyzer_cache/ReadAnalysisCacheFromFS.md b/docs/api_docs/python/tft_beam/analyzer_cache/ReadAnalysisCacheFromFS.md
@@ -46,11 +46,19 @@ Reads cache from the FS written by WriteAnalysisCacheToFS.
 ``` python
 __init__(
     cache_base_dir,
-    dataset_keys
+    dataset_keys,
+    source=None
 )
 ```
 
+Init method.
 
+#### Args:
+
+* <b>`cache_base_dir`</b>: A string, the path that the cache should be stored in.
+* <b>`dataset_keys`</b>: An iterable of strings.
+* <b>`source`</b>: (Optional) A PTransform class that takes a path argument in its
+    constructor, and is used to read the cache.
 
 
 
diff --git a/docs/api_docs/python/tft_beam/analyzer_cache/WriteAnalysisCacheToFS.md b/docs/api_docs/python/tft_beam/analyzer_cache/WriteAnalysisCacheToFS.md
@@ -44,10 +44,20 @@ Writes a cache object that can be read by ReadAnalysisCacheFromFS.
 <h2 id="__init__"><code>__init__</code></h2>
 
 ``` python
-__init__(cache_base_dir)
+__init__(
+    cache_base_dir,
+    sink=None
+)
 ```
 
+Init method.
+
+#### Args:
 
+* <b>`cache_base_dir`</b>: A str, the path that the cache should be stored in.
+* <b>`sink`</b>: (Optional) A PTransform class that takes a path, and optional
+    file_name_suffix arguments in its constructor, and is used to write the
+    cache.