Let's add a new dataset together and train the transformer model. We'll be learning to define English words by training the transformer to "translate" between English words and their definitions on a character level.
For each problem we want to tackle we create a new problem class and register it. Let's call our problem Word2def.
Since many text2text problems share similar methods, there's already a class
called Text2TextProblem that extends the base problem class, Problem
(both found in
problem.py).
For our problem, we can go ahead and create the file word2def.py in the
data_generators
folder and add our new problem, Word2def, which extends
Text2TextProblem.
Let's also register it while we're at it so we can specify the problem through
flags.
@registry.register_problem
class Word2def(problem.Text2TextProblem):
"""Problem spec for English word to dictionary definition."""
@property
def is_character_level(self):
...We need to implement the following methods from
Text2TextProblem.
in our new class:
- is_character_level
- targeted_vocab_size
- generator
- input_space_id
- target_space_id
- num_shards
- vocab_name
- use_subword_tokenizer
Let's tackle them one by one:
input_space_id, target_space_id, is_character_level, targeted_vocab_size, use_subword_tokenizer:
SpaceIDs tell Tensor2Tensor what sort of space the input and target tensors are
in. These are things like, EN_CHR (English character), EN_TOK (English token),
AUDIO_WAV (audio waveform), IMAGE, DNA (genetic bases). The complete list can be
found at
data_generators/problem.py.
in the class SpaceID.
Since we're generating definitions and feeding in words at the character level, we set is_character_level to true, and use the same SpaceID, EN_CHR, for both input and target. Additionally, since we aren't using tokens, we don't need to give a targeted_vocab_size or define use_subword_tokenizer.
vocab_name:
vocab_name will be used to name your vocabulary files. We can call ours 'vocab.word2def.en'
num_shards:
The number of shards to break data files into.
@registry.register_problem()
class Word2def(problem.Text2TextProblem):
"""Problem spec for English word to dictionary definition."""
@property
def is_character_level(self):
return True
@property
def vocab_name(self):
return "vocab.word2def.en"
@property
def input_space_id(self):
return problem.SpaceID.EN_CHR
@property
def target_space_id(self):
return problem.SpaceID.EN_CHR
@property
def num_shards(self):
return 100
@property
def use_subword_tokenizer(self):
return Falsegenerator:
We're almost done. generator generates the training and evaluation data and
stores them in files like "word2def_train.lang1" in your DATA_DIR. Thankfully
several commonly used methods like character_generator, and token_generator
are already written in the file
translate.py.
We will import character_generator and
text_encoder
to write:
def generator(self, data_dir, tmp_dir, train):
character_vocab = text_encoder.ByteTextEncoder()
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
return character_generator(datasets[0], datasets[1], character_vocab, EOS)Now our word2def.py file looks like the below:
@registry.register_problem()
class Word2def(problem.Text2TextProblem):
"""Problem spec for English word to dictionary definition."""
@property
def is_character_level(self):
return True
@property
def vocab_name(self):
return "vocab.word2def.en"
def generator(self, data_dir, tmp_dir, train):
character_vocab = text_encoder.ByteTextEncoder()
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
return character_generator(datasets[0], datasets[1], character_vocab, EOS)
@property
def input_space_id(self):
return problem.SpaceID.EN_CHR
@property
def target_space_id(self):
return problem.SpaceID.EN_CHR
@property
def num_shards(self):
return 100
@property
def use_subword_tokenizer(self):
return FalseNow we need to tell Tensor2Tensor where our data is located.
I've gone ahead and split all words into a train and test set and saved them in files called words.train.txt, words.test.txt,
definitions.train.txt, and definitions.test.txt in a directory called LOCATION_OF_DATA/. Let's tell T2T where these files are:
# English Word2def datasets
_WORD2DEF_TRAIN_DATASETS = [
LOCATION_OF_DATA + 'words_train.txt',
LOCATION_OF_DATA + 'definitions_train.txt'
]
_WORD2DEF_TEST_DATASETS = [
LOCATION_OF_DATA + 'words_test.txt',
LOCATION_OF_DATA + 'definitions_test.txt'
]Now our word2def.py file looks like:
""" Problem definition for word to dictionary definition.
"""
import os
from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_encoder
from tensor2tensor.data_generators.wmt import character_generator
from tensor2tensor.utils import registry
# English Word2def datasets
_WORD2DEF_TRAIN_DATASETS = [
LOCATION_OF_DATA+'words_train.txt',
LOCATION_OF_DATA+'definitions_train.txt'
]
_WORD2DEF_TEST_DATASETS = [
LOCATION_OF_DATA+'words_test.txt',
LOCATION_OF_DATA+'definitions_test.txt'
]
@registry.register_problem()
class Word2def(problem.Text2TextProblem):
"""Problem spec for English word to dictionary definition."""
@property
def is_character_level(self):
return True
@property
def vocab_name(self):
return "vocab.word2def.en"
def generator(self, data_dir, tmp_dir, train):
character_vocab = text_encoder.ByteTextEncoder()
datasets = _WORD2DEF_TRAIN_DATASETS if train else _WORD2DEF_TEST_DATASETS
return character_generator(datasets[0], datasets[1], character_vocab, EOS)
@property
def input_space_id(self):
return problem.SpaceID.EN_CHR
@property
def target_space_id(self):
return problem.SpaceID.EN_CHR
@property
def num_shards(self):
return 100
@property
def use_subword_tokenizer(self):
return FalseAll hyperparamters inherit from _default_hparams() in problem.py. If you would like to customize your hyperparameters, register a new hyperparameter set in word2def.py like the example provided in the walkthrough. For example:
from tensor2tensor.models import transformer
@registry.register_hparams
def word2def_hparams(self):
hparams = transformer.transformer_base_single_gpu() # Or whatever you'd like to build off.
hparams.batch_size = 1024
return hparamsNow that we've gotten our problem set up, let's train a model and generate definitions.
We specify our problem name, the model, and hparams.
PROBLEM=word2def
MODEL=transformer
HPARAMS=word2def_hparamsThe rest of the steps are as given in the walkthrough.
What if we wanted to train a model to generate words given definitions? In T2T, we can change the problem name to be PROBLEM=word2def_rev.
All done. Let us know what definitions your model generated.