Skip to content

Commit 687e31d

Browse files
tf-transform-teamzoyahav
authored andcommitted
Project import generated by Copybara.
PiperOrigin-RevId: 230419331
1 parent 4fe846a commit 687e31d

28 files changed

Lines changed: 970 additions & 1219 deletions

RELEASE.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,13 @@
44
* Performance improvements for vocabulary generation when using top_k.
55
* New optimized highly experimental API for analyzing a dataset was added,
66
`AnalyzeDatasetWithCache`, which allows reading and writing analyzer cache.
7+
* Update `DatasetMetadata` to be a wrapper around the
8+
`tensorflow_metadata.proto.v0.schema_pb2.Schema` proto. TensorFlow Metadata
9+
will be the schema used to define data parsing across TFX. The serialized
10+
`DatasetMetadata` is now the `Schema` proto in ascii format, but the previous
11+
format can still be read.
12+
* Change `ApplySavedModel` implementation to use `tf.Session.make_callable`
13+
instead of `tf.Session.run` for improved performance.
714

815
## Bug Fixes and Other Changes
916
* `tft.vocabulary` and `tft.compute_and_apply_vocabulary` now support filtering
@@ -21,8 +28,20 @@
2128
`tf.Session.run` for improved performance.
2229
* ExampleProtoCoder now also supports non-serialized Example representations.
2330
* `tft.tfidf` now accepts a scalar Tensor as `vocab_size`.
31+
* `assertItemsEqual` in unit tests are replaced by `assertCountEqual`.
32+
* `NumPyCombiner` now outputs TF dtypes in output_tensor_infos instead of numpy
33+
dtypes.
34+
* Adds function `tft.apply_pyfunc` that provides limited support for
35+
`tf.pyfunc`. Note that this is incompatible with serving. See documentation
36+
for more details.
2437

2538
## Breaking changes
39+
* `ColumnSchema` and related classes (`Domain`, `Axis` and
40+
`ColumnRepresentation` and their subclasses) have been removed. In order to
41+
create a schema, use `from_feature_spec`. In order to inspect a schema
42+
use the `as_feature_spec` and `domains` methods of `Schema`. The
43+
constructors of these classes are replaced by functions that still work when
44+
creating a `Schema` but this usage is deprecated.
2645

2746
## Deprecations
2847

docs/api_docs/python/tft/NumPyCombiner.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Combines the PCollection only on the 0th dimension using nparray.
3030
__init__(
3131
fn,
3232
output_dtypes,
33-
output_shapes=None
33+
output_shapes
3434
)
3535
```
3636

docs/api_docs/python/tft/TFTransformOutput.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,23 @@
11
<div itemscope itemtype="http://developers.google.com/ReferenceObject">
22
<meta itemprop="name" content="tft.TFTransformOutput" />
33
<meta itemprop="path" content="Stable" />
4+
<meta itemprop="property" content="post_transform_statistics_path"/>
5+
<meta itemprop="property" content="pre_transform_statistics_path"/>
6+
<meta itemprop="property" content="raw_metadata"/>
47
<meta itemprop="property" content="transform_savedmodel_dir"/>
58
<meta itemprop="property" content="transformed_metadata"/>
69
<meta itemprop="property" content="__init__"/>
710
<meta itemprop="property" content="load_transform_graph"/>
811
<meta itemprop="property" content="num_buckets_for_transformed_feature"/>
12+
<meta itemprop="property" content="raw_feature_spec"/>
913
<meta itemprop="property" content="transform_raw_features"/>
1014
<meta itemprop="property" content="transformed_feature_spec"/>
1115
<meta itemprop="property" content="vocabulary_by_name"/>
1216
<meta itemprop="property" content="vocabulary_file_by_name"/>
1317
<meta itemprop="property" content="vocabulary_size_by_name"/>
18+
<meta itemprop="property" content="POST_TRANSFORM_FEATURE_STATS_PATH"/>
19+
<meta itemprop="property" content="PRE_TRANSFORM_FEATURE_STATS_PATH"/>
20+
<meta itemprop="property" content="RAW_METADATA_DIR"/>
1421
<meta itemprop="property" content="TRANSFORMED_METADATA_DIR"/>
1522
<meta itemprop="property" content="TRANSFORM_FN_DIR"/>
1623
</div>
@@ -39,6 +46,30 @@ __init__(transform_output_dir)
3946

4047
## Properties
4148

49+
<h3 id="post_transform_statistics_path"><code>post_transform_statistics_path</code></h3>
50+
51+
Returns the path to the post-transform datum statistics.
52+
53+
Note: post_transform_statistics is not guaranteed to exist in the output of
54+
tf.transform and hence using this could fail, if post_transform statistics
55+
is not present in TFTransformOutput.
56+
57+
<h3 id="pre_transform_statistics_path"><code>pre_transform_statistics_path</code></h3>
58+
59+
Returns the path to the pre-transform datum statistics.
60+
61+
Note: pre_transform_statistics is not guaranteed to exist in the output of
62+
tf.transform and hence using this could fail, if pre_transform statistics is
63+
not present in TFTransformOutput.
64+
65+
<h3 id="raw_metadata"><code>raw_metadata</code></h3>
66+
67+
A DatasetMetadata.
68+
69+
Note: raw_metadata is not guaranteed to exist in the output of tf.transform
70+
and hence using this could fail, if raw_metadata is not present in
71+
TFTransformOutput.
72+
4273
<h3 id="transform_savedmodel_dir"><code>transform_savedmodel_dir</code></h3>
4374

4475
A python str.
@@ -71,6 +102,18 @@ num_buckets_for_transformed_feature(name)
71102

72103
Returns the number of buckets for an integerized transformed feature.
73104

105+
<h3 id="raw_feature_spec"><code>raw_feature_spec</code></h3>
106+
107+
``` python
108+
raw_feature_spec()
109+
```
110+
111+
Returns a feature_spec for the raw features.
112+
113+
#### Returns:
114+
115+
A dict from feature names to FixedLenFeature/SparseFeature/VarLenFeature.
116+
74117
<h3 id="transform_raw_features"><code>transform_raw_features</code></h3>
75118

76119
``` python
@@ -142,6 +185,12 @@ Like vocabulary_file_by_name, but returns the size of vocabulary.
142185

143186
## Class Members
144187

188+
<h3 id="POST_TRANSFORM_FEATURE_STATS_PATH"><code>POST_TRANSFORM_FEATURE_STATS_PATH</code></h3>
189+
190+
<h3 id="PRE_TRANSFORM_FEATURE_STATS_PATH"><code>PRE_TRANSFORM_FEATURE_STATS_PATH</code></h3>
191+
192+
<h3 id="RAW_METADATA_DIR"><code>RAW_METADATA_DIR</code></h3>
193+
145194
<h3 id="TRANSFORMED_METADATA_DIR"><code>TRANSFORMED_METADATA_DIR</code></h3>
146195

147196
<h3 id="TRANSFORM_FN_DIR"><code>TRANSFORM_FN_DIR</code></h3>

tensorflow_transform/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,5 @@
2020
from tensorflow_transform.mappers import *
2121
from tensorflow_transform.output_wrapper import TFTransformOutput
2222
from tensorflow_transform.pretrained_models import *
23+
from tensorflow_transform.py_func.api import apply_pyfunc
2324
# pylint: enable=wildcard-import

tensorflow_transform/analyzer_nodes.py

Lines changed: 114 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,8 @@ def accumulator_coder(self):
160160
class CacheCoder(object):
161161
"""A coder iterface for encoding and decoding cache items."""
162162

163+
__metaclass__ = abc.ABCMeta
164+
163165
def __repr__(self):
164166
return '<{}>'.format(self.__class__.__name__)
165167

@@ -327,32 +329,92 @@ def output_tensor_infos(self):
327329
] + self.combiner.output_tensor_infos()
328330

329331

330-
class Vocabulary(
331-
collections.namedtuple(
332-
'Vocabulary',
333-
[
334-
'top_k',
335-
'frequency_threshold',
336-
'vocab_filename',
337-
'store_frequency',
338-
'vocab_ordering_type',
339-
'use_adjusted_mutual_info',
340-
'min_diff_from_avg',
341-
'coverage_top_k',
342-
'coverage_frequency_threshold',
343-
'key_fn',
344-
'label'
345-
]),
346-
AnalyzerDef):
347-
"""OperationDef for computing a vocabulary of unique values.
332+
class VocabularyAccumulate(
333+
collections.namedtuple('VocabularyAccumulate',
334+
['vocab_ordering_type', 'label']),
335+
nodes.OperationDef):
336+
"""An operation that accumulates unique words with their frequency or weight.
348337
349-
This analyzer computes a vocabulary composed of the unique values present in
350-
the input elements. It selects a subset of the unique elements based on the
351-
provided parameters. It may also accept a label and weight as input
352-
depending on the parameters.
338+
This operation is implemented by
339+
`tensorflow_transform.beam.analyzer_impls.VocabularyAccumulateImpl`.
340+
"""
353341

354-
This analyzer is implemented by
355-
`tensorflow_transform.beam.analyzer_impls.VocabularyImpl`.
342+
def __new__(cls, vocab_ordering_type, label=None):
343+
if label is None:
344+
scope = tf.get_default_graph().get_name_scope()
345+
label = '{}[{}]'.format(cls.__name__, scope)
346+
return super(VocabularyAccumulate, cls).__new__(
347+
cls, vocab_ordering_type=vocab_ordering_type, label=label)
348+
349+
@property
350+
def num_outputs(self):
351+
return 1
352+
353+
@property
354+
def is_partitionable(self):
355+
return True
356+
357+
@property
358+
def cache_coder(self):
359+
return _VocabularyAccumulatorCoder()
360+
361+
362+
class _VocabularyAccumulatorCoder(CacheCoder):
363+
"""Coder for vocabulary accumulators."""
364+
365+
def encode_cache(self, accumulator):
366+
# Need to wrap in np.array and call tolist to make it JSON serializable.
367+
word, count = accumulator
368+
accumulator = (word.decode('utf-8'), count)
369+
return tf.compat.as_bytes(
370+
json.dumps(np.array(accumulator, dtype=object).tolist()))
371+
372+
def decode_cache(self, encoded_accumulator):
373+
return np.array(json.loads(encoded_accumulator), dtype=object)
374+
375+
376+
class VocabularyMerge(
377+
collections.namedtuple('VocabularyMerge', [
378+
'vocab_ordering_type', 'use_adjusted_mutual_info', 'min_diff_from_avg',
379+
'label'
380+
]), nodes.OperationDef):
381+
"""An operation that merges the accumulators produced by VocabularyAccumulate.
382+
383+
This operation operates on the output of VocabularyAccumulate and is
384+
implemented by `tensorflow_transform.beam.analyzer_impls.VocabularyMergeImpl`.
385+
386+
See `tft.vocabulary` for a description of the parameters.
387+
"""
388+
389+
def __new__(cls,
390+
vocab_ordering_type,
391+
use_adjusted_mutual_info,
392+
min_diff_from_avg,
393+
label=None):
394+
if label is None:
395+
scope = tf.get_default_graph().get_name_scope()
396+
label = '{}[{}]'.format(cls.__name__, scope)
397+
return super(VocabularyMerge, cls).__new__(
398+
cls,
399+
vocab_ordering_type=vocab_ordering_type,
400+
use_adjusted_mutual_info=use_adjusted_mutual_info,
401+
min_diff_from_avg=min_diff_from_avg,
402+
label=label)
403+
404+
@property
405+
def num_outputs(self):
406+
return 1
407+
408+
409+
class VocabularyOrderAndFilter(
410+
collections.namedtuple('VocabularyOrderAndFilter', [
411+
'top_k', 'frequency_threshold', 'coverage_top_k',
412+
'coverage_frequency_threshold', 'key_fn', 'label'
413+
]), nodes.OperationDef):
414+
"""An operation that filters and orders a computed vocabulary.
415+
416+
This operation operates on the output of VocabularyMerge and is implemented by
417+
`tensorflow_transform.beam.analyzer_impls.VocabularyOrderAndFilterImpl`.
356418
357419
See `tft.vocabulary` for a description of the parameters.
358420
"""
@@ -361,32 +423,49 @@ def __new__(
361423
cls,
362424
top_k,
363425
frequency_threshold,
364-
vocab_filename,
365-
store_frequency,
366-
vocab_ordering_type,
367-
use_adjusted_mutual_info,
368-
min_diff_from_avg,
369426
coverage_top_k,
370427
coverage_frequency_threshold,
371428
key_fn,
372429
label=None):
373430
if label is None:
374431
scope = tf.get_default_graph().get_name_scope()
375432
label = '{}[{}]'.format(cls.__name__, scope)
376-
return super(Vocabulary, cls).__new__(
433+
return super(VocabularyOrderAndFilter, cls).__new__(
377434
cls,
378435
top_k=top_k,
379436
frequency_threshold=frequency_threshold,
380-
vocab_filename=vocab_filename,
381-
store_frequency=store_frequency,
382-
vocab_ordering_type=vocab_ordering_type,
383-
use_adjusted_mutual_info=use_adjusted_mutual_info,
384-
min_diff_from_avg=min_diff_from_avg,
385437
coverage_top_k=coverage_top_k,
386438
coverage_frequency_threshold=coverage_frequency_threshold,
387439
key_fn=key_fn,
388440
label=label)
389441

442+
@property
443+
def num_outputs(self):
444+
return 1
445+
446+
447+
class VocabularyWrite(
448+
collections.namedtuple('VocabularyWrite',
449+
['vocab_filename', 'store_frequency', 'label']),
450+
AnalyzerDef):
451+
"""An analyzer that writes vocabulary files from an accumulator.
452+
453+
This operation operates on the output of VocabularyOrderAndFilter and is
454+
implemented by `tensorflow_transform.beam.analyzer_impls.VocabularyWriteImpl`.
455+
456+
See `tft.vocabulary` for a description of the parameters.
457+
"""
458+
459+
def __new__(cls, vocab_filename, store_frequency, label=None):
460+
if label is None:
461+
scope = tf.get_default_graph().get_name_scope()
462+
label = '{}[{}]'.format(cls.__name__, scope)
463+
return super(VocabularyWrite, cls).__new__(
464+
cls,
465+
vocab_filename=vocab_filename,
466+
store_frequency=store_frequency,
467+
label=label)
468+
390469
@property
391470
def output_tensor_infos(self):
392471
return [TensorInfo(tf.string, [], True)]

tensorflow_transform/analyzers.py

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ class NumPyCombiner(analyzer_nodes.Combiner):
142142
output_shapes: The shapes of the outputs.
143143
"""
144144

145-
def __init__(self, fn, output_dtypes, output_shapes=None):
145+
def __init__(self, fn, output_dtypes, output_shapes):
146146
self._fn = fn
147147
self._output_dtypes = output_dtypes
148148
self._output_shapes = output_shapes
@@ -186,7 +186,7 @@ def extract_output(self, accumulator):
186186

187187
def output_tensor_infos(self):
188188
return [
189-
analyzer_nodes.TensorInfo(dtype, shape, False)
189+
analyzer_nodes.TensorInfo(tf.as_dtype(dtype), shape, False)
190190
for dtype, shape in zip(self._output_dtypes, self._output_shapes)
191191
]
192192

@@ -820,19 +820,35 @@ def vocabulary(
820820
assert none_counts is None
821821
analyzer_inputs = [unique_inputs]
822822

823-
(vocab_filename,) = apply_analyzer(
824-
analyzer_nodes.Vocabulary,
825-
*analyzer_inputs,
823+
input_values_node = analyzer_nodes.get_input_tensors_value_nodes(
824+
analyzer_inputs)
825+
826+
accumulate_output_value_node = nodes.apply_operation(
827+
analyzer_nodes.VocabularyAccumulate, input_values_node,
828+
vocab_ordering_type=vocab_ordering_type)
829+
830+
merge_output_value_node = nodes.apply_operation(
831+
analyzer_nodes.VocabularyMerge, accumulate_output_value_node,
826832
use_adjusted_mutual_info=use_adjusted_mutual_info,
827833
min_diff_from_avg=min_diff_from_avg,
834+
vocab_ordering_type=vocab_ordering_type)
835+
836+
filtered_value_node = nodes.apply_operation(
837+
analyzer_nodes.VocabularyOrderAndFilter,
838+
merge_output_value_node,
828839
coverage_top_k=coverage_top_k,
829840
coverage_frequency_threshold=coverage_frequency_threshold,
830841
key_fn=key_fn,
831842
top_k=top_k,
832-
frequency_threshold=frequency_threshold,
843+
frequency_threshold=frequency_threshold)
844+
845+
vocab_filename_node = nodes.apply_operation(
846+
analyzer_nodes.VocabularyWrite,
847+
filtered_value_node,
833848
vocab_filename=vocab_filename,
834-
store_frequency=store_frequency,
835-
vocab_ordering_type=vocab_ordering_type)
849+
store_frequency=store_frequency)
850+
851+
vocab_filename = analyzer_nodes.wrap_as_tensor(vocab_filename_node)
836852
return vocab_filename
837853

838854

0 commit comments

Comments
 (0)