1- # Data generators for T2T models .
1+ # T2T Problems .
22
3- This directory contains data generators for a number of problems. We use a
4- naming scheme for the problems, they have names of the form
3+ This directory contains ` Problem ` specifications for a number of problems. We
4+ use a naming scheme for the problems, they have names of the form
55` [task-family]_[task]_[specifics] ` . Data for all currently supported problems
66can be generated by calling the main generator binary (` t2t-datagen ` ). For
77example:
@@ -20,53 +20,51 @@ All tasks produce TFRecord files of `tensorflow.Example` protocol buffers.
2020
2121## Adding a new problem
2222
23- 1 . Implement and register a Python generator for the dataset
24- 1 . Add a problem specification to ` problem_hparams.py ` specifying input and
25- output modalities
23+ To add a new problem, subclass
24+ [ ` Problem ` ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py )
25+ and register it with ` @registry.register_problem ` . See
26+ [ ` WMTEnDeTokens8k ` ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py )
27+ for an example.
2628
27- To add a new problem, you first need to create python generators for training
28- and development data for the problem. The python generators should yield
29- dictionaries with string keys and values being lists of {int, float, str}.
30- Here is a very simple generator for a data-set where inputs are lists of 1s with
31- length upto 100 and targets are lists of length 1 with an integer denoting the
32- length of the input list.
29+ ` Problem ` s support data generation, training, and decoding.
30+
31+ Data generation is handles by ` Problem.generate_data ` which should produce 2
32+ datasets, training and dev, which should be named according to
33+ ` Problem.training_filepaths ` and ` Problem.dev_filepaths ` .
34+ ` Problem.generate_data ` should also produce any other files that may be required
35+ for training/decoding, e.g. a vocabulary file.
36+
37+ A particularly easy way to implement ` Problem.generate_data ` for your dataset is
38+ to create 2 Python generators, one for the training data and another for the
39+ dev data, and pass them to ` generator_utils.generate_dataset_and_shuffle ` . See
40+ [ ` WMTEnDeTokens8k.generate_data ` ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py )
41+ for an example of usage.
42+
43+ The generators should yield dictionaries with string keys and values being lists
44+ of {int, float, str}. Here is a very simple generator for a data-set where
45+ inputs are lists of 2s with length upto 100 and targets are lists of length 1
46+ with an integer denoting the length of the input list.
3347
3448```
3549def length_generator(nbr_cases):
3650 for _ in xrange(nbr_cases):
3751 length = np.random.randint(100) + 1
38- yield {"inputs": [1 ] * length, "targets": [length]}
52+ yield {"inputs": [2 ] * length, "targets": [length]}
3953```
4054
41- Note that our data reader uses 0 for padding, so it is a good idea to never
42- generate 0s, except if all your examples have the same size (in which case
43- they'll never be padded anyway) or if you're doing padding on your own (in which
44- case please use 0s for padding). When adding the python generator function,
45- please also add unit tests to check if the code runs.
55+ Note that our data reader uses 0 for padding and other parts of the code assume
56+ end-of-string (EOS) is 1, so it is a good idea to never generate 0s or 1s,
57+ except if all your examples have the same size (in which case they'll never be
58+ padded anyway) or if you're doing padding on your own (in which case please use
59+ 0s for padding). When adding the python generator function, please also add unit
60+ tests to check if the code runs.
4661
4762The generator can do arbitrary setup before beginning to yield examples - for
4863example, downloading data, generating vocabulary files, etc.
4964
5065Some examples:
5166
52- * [ Algorithmic generators ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py )
67+ * [ Algorithmic problems ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py )
5368 and their [ unit tests] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic_test.py )
54- * [ WMT generators ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py )
69+ * [ WMT problems ] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py )
5570 and their [ unit tests] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt_test.py )
56-
57- When your python generator is ready and tested, add it to the
58- ` _SUPPORTED_PROBLEM_GENERATORS ` dictionary in the
59- [ data
60- generator] ( https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/t2t-datagen ) .
61- The keys are problem names, and the values are pairs of (training-set-generator
62- function, dev-set-generator function). For the generator above, one could add
63- the following lines:
64-
65- ```
66- "algorithmic_length_upto100":
67- (lambda: algorithmic.length_generator(10000),
68- lambda: algorithmic.length_generator(1000)),
69- ```
70-
71- Note the lambdas above: we don't want to call the generators too early.
72-
0 commit comments