data_generators

Data generators for T2T models.

This directory contains data generators for a number of problems. We use a naming scheme for the problems, they have names of the form [task-family]_[task]_[specifics]. Data for all currently supported problems can be generated by calling the main generator binary (t2t-datagen). For example:

t2t-datagen \
  --problem=algorithmic_identity_binary40 \
  --data_dir=/tmp

will generate training and development data for the algorithmic copy task - /tmp/algorithmic_identity_binary40-dev-00000-of-00001 and /tmp/algorithmic_identity_binary40-train-00000-of-00001. All tasks produce TFRecord files of tensorflow.Example protocol buffers.

Adding a new problem

Implement and register a Python generator for the dataset
Add a problem specification to problem_hparams.py specifying input and output modalities

To add a new problem, you first need to create python generators for training and development data for the problem. The python generators should yield dictionaries with string keys and values being lists of {int, float, str}. Here is a very simple generator for a data-set where inputs are lists of 1s with length upto 100 and targets are lists of length 1 with an integer denoting the length of the input list.

def length_generator(nbr_cases):
  for _ in xrange(nbr_cases):
    length = np.random.randint(100) + 1
    yield {"inputs": [1] * length, "targets": [length]}

Note that our data reader uses 0 for padding, so it is a good idea to never generate 0s, except if all your examples have the same size (in which case they'll never be padded anyway) or if you're doing padding on your own (in which case please use 0s for padding). When adding the python generator function, please also add unit tests to check if the code runs.

The generator can do arbitrary setup before beginning to yield examples - for example, downloading data, generating vocabulary files, etc.

Some examples:

Algorithmic generators and their unit tests
WMT generators and their unit tests

When your python generator is ready and tested, add it to the _SUPPORTED_PROBLEM_GENERATORS dictionary in the data generator. The keys are problem names, and the values are pairs of (training-set-generator function, dev-set-generator function). For the generator above, one could add the following lines:

  "algorithmic_length_upto100":
  (lambda: algorithmic.length_generator(10000),
   lambda: algorithmic.length_generator(1000)),

Note the lambdas above: we don't want to call the generators too early.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
algorithmic.py		algorithmic.py
algorithmic_math.py		algorithmic_math.py
algorithmic_math_test.py		algorithmic_math_test.py
algorithmic_test.py		algorithmic_test.py
audio.py		audio.py
audio_test.py		audio_test.py
concatenate_examples.py		concatenate_examples.py
generator_utils.py		generator_utils.py
generator_utils_test.py		generator_utils_test.py
image.py		image.py
image_test.py		image_test.py
lm_example.py		lm_example.py
problem_hparams.py		problem_hparams.py
problem_hparams_test.py		problem_hparams_test.py
replace_oov.py		replace_oov.py
snli.py		snli.py
text_encoder.py		text_encoder.py
text_encoder_build_subword.py		text_encoder_build_subword.py
text_encoder_inspect_subword.py		text_encoder_inspect_subword.py
tokenizer.py		tokenizer.py
tokenizer_test.py		tokenizer_test.py
wmt.py		wmt.py
wmt_test.py		wmt_test.py
wsj_parsing.py		wsj_parsing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Data generators for T2T models.

Adding a new problem

FilesExpand file tree

data_generators

Directory actions

More options

Directory actions

More options

Latest commit

History

data_generators

Folders and files

parent directory

README.md

Data generators for T2T models.

Adding a new problem