Skip to content

Commit 5979357

Browse files
tomsiadevnv-kkudrynski
authored andcommitted
[SIM/TF2] README updates. BYOD bugfix.
1 parent 9cb7dd0 commit 5979357

2 files changed

Lines changed: 57 additions & 4 deletions

File tree

TensorFlow2/Recommendation/SIM/README.md

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ This repository provides a script and recipe to train the SIM model to achieve s
3232
* [Channel definitions and requirements](#channel-definitions-and-requirements)
3333
* [Training process](#training-process)
3434
* [Inference process](#inference-process)
35+
* [Log format](#log-format)
36+
* [Training log data](#training-log-data)
37+
* [Inference log data](#inference-log-data)
3538
- [Performance](#performance)
3639
* [Benchmarking](#benchmarking)
3740
* [Training performance benchmark](#training-performance-benchmark)
@@ -413,7 +416,10 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
413416

414417
5. Start preprocessing.
415418

416-
For details of the required file format and certain preprocessing parameters (for example, `${NUMBER_OF_USER_FEATURES}` refer to [BYO dataset](#byo-dataset))
419+
For details of the required file format and certain preprocessing parameters refer to [BYO dataset](#byo-dataset).
420+
421+
422+
`${NUMBER_OF_USER_FEATURES}` defines how many user specific features are present in dataset. If using default Amazon Books dataset and `sim_preprocessing` script (as shown below), this parameter should be set to <b>1</b> (in this case, the only user specific features is <b>user_id</b>. Other features are item specific).
417423

418424
```bash
419425
python preprocessing/sim_preprocessing.py \
@@ -452,6 +458,8 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
452458
--amp
453459
```
454460

461+
For the explanation of output logs, refer to [Log format](#log-format) section.
462+
455463
Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark your performance to [Training performance benchmark](#training-performance-results), or [Inference performance benchmark](#inference-performance-results). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
456464

457465
## Advanced
@@ -705,6 +713,51 @@ Inference can be run using `main.py` script by specifying the `--mode inference
705713

706714
Example usage of training and inference are demonstrated in [Quick Start Guide](#quick-start-guide).
707715

716+
### Log format
717+
718+
There are three type of log lines during model execution. Each of them have `step` value, however it is formatted differently based on the type of log:
719+
- <b>step log</b> - step value is in format `[epoch, step]`:
720+
721+
DLLL {"timestamp": ..., "datetime": ..., "elapsedtime": ..., "type": ..., `"step": [2, 79]`, "data": ...}
722+
723+
- <b>end of epoch log</b> - step value is in format `[epoch]`:
724+
725+
DLLL {"timestamp": ..., "datetime": ..., "elapsedtime": ..., "type": ..., `"step": [2]`, "data": ...}
726+
727+
- <b>summary log</b> - logged once at the end of script execution. Step value is in fomat `[]`:
728+
729+
DLLL {"timestamp": ..., "datetime": ..., "elapsedtime": ..., "type": ..., `"step": []`, "data": ...}
730+
731+
In those logs, `data` field contains dictonary in form `{metric: value}`. Metrics logged differ based on log type (step, end of epoch, summary) and model mode (training, inference).
732+
733+
#### Training log data
734+
- <b> step log </b>
735+
- classification_loss - loss at the final output of the model.
736+
- dien_aux_loss - loss at the output of auxiliary model.
737+
- total_loss - sum of the above.
738+
- samples/s - estimated throughput in samples per second.
739+
- <b> end of epoch log </b>
740+
- throughput - average throughput during epoch in samples/s.
741+
- time - epoch time in seconds.
742+
- train_auc - AUC during evaluation on train set.
743+
- test_auc - AUC during evaluation on test set.
744+
- train_loss - loss during evaluation on train set.
745+
- test_loss - loss during evaluation on test set.
746+
- latency_[mean, p90, p95, p99] - latencies in miliseconds.
747+
- <b> summary log </b>
748+
- time_to_train - total training time in seconds.
749+
- train_auc, test_auc, train_loss, test_loss - results from the last epoch (see above).
750+
751+
#### Inference log data
752+
- <b> step log </b>
753+
- samples/s - estimated throughput in samples per second.
754+
- <b> end of epoch log is not present</b>
755+
- <b> summary log </b>
756+
- throughput - average throughput during epoch in samples/s.
757+
- time - total execution time in seconds.
758+
- latency_[mean, p90, p95, p99] - latencies in miliseconds.
759+
760+
708761
## Performance
709762

710763
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).

TensorFlow2/Recommendation/SIM/main.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -624,8 +624,8 @@ def main(
624624

625625
feature_spec = FeatureSpec.from_yaml(dataset_dir / feature_spec)
626626

627-
# since all features must be included in each tfrecord file, therefore we can select only first file of each chunk
628-
train_files = [dataset_dir / chunk[FILES_SELECTOR][0] for chunk in feature_spec.source_spec[TRAIN_MAPPING]]
627+
# since each tfrecord file must include all of the features, it is enough to read first chunk for each split.
628+
train_files = [dataset_dir / file for file in feature_spec.source_spec[TRAIN_MAPPING][0][FILES_SELECTOR]]
629629

630630
if prefetch_train_size < 0:
631631
prefetch_train_size = train_dataset_size // global_batch_size
@@ -637,7 +637,7 @@ def main(
637637
)
638638

639639
if mode == "train":
640-
test_files = [dataset_dir / chunk[FILES_SELECTOR][0] for chunk in feature_spec.source_spec[TEST_MAPPING]]
640+
test_files = [dataset_dir / file for file in feature_spec.source_spec[TEST_MAPPING][0][FILES_SELECTOR]]
641641
data_iterator_test = get_data_iterator(
642642
test_files, feature_spec, batch_size, num_gpus, long_seq_length,
643643
amp=amp, disable_cache=disable_cache, prefetch_size=prefetch_test_size

0 commit comments

Comments
 (0)