new_problem.md

T2T: Train on Your Own Data

Let's add a new dataset together and train the transformer model. We'll be learning to define English words by training the transformer to "translate" between English words and their definitions on a character level.

About the Problem

For each problem we want to tackle we create a new problem class and register it. Let's call our problem Word2def.

Since many text2text problems share similar methods, there's already a class called Text2TextProblem that extends the base problem class, Problem (both found in problem.py).

For our problem, we can go ahead and create the file word2def.py in the data_generators folder and add our new problem, Word2def, which extends Text2TextProblem. Let's also register it while we're at it so we can specify the problem through flags.

@registry.register_problem
class Word2def(problem.Text2TextProblem):
  """Problem spec for English word to dictionary definition."""
  @property
  def is_character_level(self):
    ...

We need to implement the following methods from Text2TextProblem. in our new class:

is_character_level
targeted_vocab_size
generator
input_space_id
target_space_id
num_shards
vocab_name
use_subword_tokenizer

Let's tackle them one by one:

input_space_id, target_space_id, is_character_level, targeted_vocab_size, use_subword_tokenizer:

SpaceIDs tell Tensor2Tensor what sort of space the input and target tensors are in. These are things like, EN_CHR (English character), EN_TOK (English token), AUDIO_WAV (audio waveform), IMAGE, DNA (genetic bases). The complete list can be found at data_generators/problem.py. in the class SpaceID.

Since we're generating definitions and feeding in words at the character level, we set is_character_level to true, and use the same SpaceID, EN_CHR, for both input and target. Additionally, since we aren't using tokens, we don't need to give a targeted_vocab_size or define use_subword_tokenizer.

vocab_name:

vocab_name will be used to name your vocabulary files. We can call ours 'vocab.word2def.en'

num_shards:

The number of shards to break data files into.

@registry.register_problem()
class Word2def(problem.Text2TextProblem):
  """Problem spec for English word to dictionary definition."""

  @property
  def is_character_level(self):
    return True

  @property
  def vocab_name(self):
    return "vocab.word2def.en"

  @property
  def input_space_id(self):
    return problem.SpaceID.EN_CHR

  @property
  def target_space_id(self):
    return problem.SpaceID.EN_CHR

  @property
  def num_shards(self):
    return 100

  @property
  def use_subword_tokenizer(self):
    return False

generator:

We're almost done. generator generates the training and evaluation data and stores them in files like "word2def_train.lang1" in your DATA_DIR. Thankfully several commonly used methods like character_generator, and token_generator are already written in the file translate.py. We will import character_generator and text_encoder to write:

  def generator(self, data_dir, tmp_dir, train):
    character_vocab = text_encoder.ByteTextEncoder()
    datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
    return character_generator(datasets[0], datasets[1], character_vocab, EOS)

Now our word2def.py file looks like the below:

@registry.register_problem()
class Word2def(problem.Text2TextProblem):
  """Problem spec for English word to dictionary definition."""
  @property
  def is_character_level(self):
    return True

  @property
  def vocab_name(self):
    return "vocab.word2def.en"

  def generator(self, data_dir, tmp_dir, train):
    character_vocab = text_encoder.ByteTextEncoder()
    datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
    return character_generator(datasets[0], datasets[1], character_vocab, EOS)

  @property
  def input_space_id(self):
    return problem.SpaceID.EN_CHR

  @property
  def target_space_id(self):
    return problem.SpaceID.EN_CHR

  @property
  def num_shards(self):
    return 100

  @property
  def use_subword_tokenizer(self):
    return False

Data:

Now we need to tell Tensor2Tensor where our data is located.

I've gone ahead and split all words into a train and test set and saved them in files called words.train.txt, words.test.txt, definitions.train.txt, and definitions.test.txt in a directory called LOCATION_OF_DATA/. Let's tell T2T where these files are:

# English Word2def datasets
_WORD2DEF_TRAIN_DATASETS = [
    LOCATION_OF_DATA + 'words_train.txt',
    LOCATION_OF_DATA + 'definitions_train.txt'
]

_WORD2DEF_TEST_DATASETS = [
    LOCATION_OF_DATA + 'words_test.txt',
    LOCATION_OF_DATA + 'definitions_test.txt'
]

Putting it all together

Now our word2def.py file looks like:

""" Problem definition for word to dictionary definition.
"""

import os

from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.wmt import character_generator

from tensor2tensor.utils import registry

# English Word2def datasets
_WORD2DEF_TRAIN_DATASETS = [
    LOCATION_OF_DATA+'words_train.txt',
    LOCATION_OF_DATA+'definitions_train.txt'
]

_WORD2DEF_TEST_DATASETS = [
    LOCATION_OF_DATA+'words_test.txt',
    LOCATION_OF_DATA+'definitions_test.txt'
]

@registry.register_problem()
class Word2def(problem.Text2TextProblem):
  """Problem spec for English word to dictionary definition."""
  @property
  def is_character_level(self):
    return True

  @property
  def vocab_name(self):
    return "vocab.word2def.en"

  def generator(self, data_dir, tmp_dir, train):
    character_vocab = text_encoder.ByteTextEncoder()
    datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
    return character_generator(datasets[0], datasets[1], character_vocab, EOS)

  @property
  def input_space_id(self):
    return problem.SpaceID.EN_CHR

  @property
  def target_space_id(self):
    return problem.SpaceID.EN_CHR

  @property
  def num_shards(self):
    return 100

  @property
  def use_subword_tokenizer(self):
    return False

Hyperparameters

All hyperparamters inherit from _default_hparams() in problem.py. If you would like to customize your hyperparameters, register a new hyperparameter set in word2def.py like the example provided in the walkthrough. For example:

from tensor2tensor.models import transformer

@registry.register_hparams
def word2def_hparams(self):
    hparams = transformer.transformer_base_single_gpu()  # Or whatever you'd like to build off.
    hparams.batch_size = 1024
    return hparams

Run the problem

Now that we've gotten our problem set up, let's train a model and generate definitions.

We specify our problem name, the model, and hparams.

PROBLEM=word2def
MODEL=transformer
HPARAMS=word2def_hparams

The rest of the steps are as given in the walkthrough.

What if we wanted to train a model to generate words given definitions? In T2T, we can change the problem name to be PROBLEM=word2def_rev.

All done. Let us know what definitions your model generated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T2T: Train on Your Own Data

About the Problem

Data:

Putting it all together

Hyperparameters

Run the problem

FilesExpand file tree

new_problem.md

Latest commit

History

new_problem.md

File metadata and controls

T2T: Train on Your Own Data

About the Problem

Data:

Putting it all together

Hyperparameters

Run the problem