[SIM/TF2] README updates. BYOD bugfix.

tomsiadev · nv-kkudrynski · commit 597935798c1e · 2022-05-23T10:43:53.000-07:00
diff --git a/TensorFlow2/Recommendation/SIM/README.md b/TensorFlow2/Recommendation/SIM/README.md
@@ -32,6 +32,9 @@ This repository provides a script and recipe to train the SIM model to achieve s
             * [Channel definitions and requirements](#channel-definitions-and-requirements)
     * [Training process](#training-process)
     * [Inference process](#inference-process)
+    * [Log format](#log-format)
+        * [Training log data](#training-log-data)
+        * [Inference log data](#inference-log-data)
 - [Performance](#performance)
     * [Benchmarking](#benchmarking)
         * [Training performance benchmark](#training-performance-benchmark)
@@ -413,7 +416,10 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
 
 5. Start preprocessing.
 
-    For details of the required file format and certain preprocessing parameters (for example, `${NUMBER_OF_USER_FEATURES}` refer to [BYO dataset](#byo-dataset))
+    For details of the required file format and certain preprocessing parameters refer to [BYO dataset](#byo-dataset).
+    
+    
+    `${NUMBER_OF_USER_FEATURES}` defines how many user specific features are present in dataset. If using default Amazon Books dataset and `sim_preprocessing` script (as shown below), this parameter should be set to <b>1</b> (in this case, the only user specific features is <b>user_id</b>. Other features are item specific).
 
    ```bash
    python preprocessing/sim_preprocessing.py \
@@ -452,6 +458,8 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
     --amp
    ```
 
+For the explanation of output logs, refer to [Log format](#log-format) section.
+
 Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark your performance to [Training performance benchmark](#training-performance-results), or [Inference performance benchmark](#inference-performance-results). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
 
 ## Advanced
@@ -705,6 +713,51 @@ Inference  can be run using `main.py` script by specifying the `--mode inference
 
 Example usage of training and inference are demonstrated in [Quick Start Guide](#quick-start-guide).
 
+### Log format
+
+There are three type of log lines during model execution. Each of them have `step` value, however it is formatted differently based on the type of log:
+- <b>step log</b> - step value is in format `[epoch, step]`:
+
+DLLL {"timestamp": ..., "datetime": ..., "elapsedtime": ..., "type": ..., `"step": [2, 79]`, "data": ...}
+
+- <b>end of epoch log</b> - step value is in format `[epoch]`:
+
+DLLL {"timestamp": ..., "datetime": ..., "elapsedtime": ..., "type": ..., `"step": [2]`, "data": ...}
+
+- <b>summary log</b> - logged once at the end of script execution. Step value is in fomat `[]`:
+
+DLLL {"timestamp": ..., "datetime": ..., "elapsedtime": ..., "type": ..., `"step": []`, "data": ...}
+
+In those logs, `data` field contains dictonary in form `{metric: value}`. Metrics logged differ based on log type (step, end of epoch, summary) and model mode (training, inference).
+
+#### Training log data
+- <b> step log </b>
+  - classification_loss - loss at the final output of the model.
+  - dien_aux_loss - loss at the output of auxiliary model.
+  - total_loss - sum of the above.
+  - samples/s - estimated throughput in samples per second.
+- <b> end of epoch log </b>
+  - throughput - average throughput during epoch in samples/s.
+  - time - epoch time in seconds.
+  - train_auc - AUC during evaluation on train set.
+  - test_auc - AUC during evaluation on test set.
+  - train_loss - loss during evaluation on train set.
+  - test_loss - loss during evaluation on test set.
+  - latency_[mean, p90, p95, p99] - latencies in miliseconds.
+- <b> summary log </b>
+  - time_to_train - total training time in seconds.
+  - train_auc, test_auc, train_loss, test_loss - results from the last epoch (see above).
+
+#### Inference log data
+- <b> step log </b>
+  - samples/s - estimated throughput in samples per second.
+- <b> end of epoch log is not present</b>
+- <b> summary log </b>
+  - throughput - average throughput during epoch in samples/s.  
+  - time - total execution time in seconds.
+  - latency_[mean, p90, p95, p99] - latencies in miliseconds.
+
+
 ## Performance
 
 The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA's latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
diff --git a/TensorFlow2/Recommendation/SIM/main.py b/TensorFlow2/Recommendation/SIM/main.py
@@ -624,8 +624,8 @@ def main(
 
     feature_spec = FeatureSpec.from_yaml(dataset_dir / feature_spec)
 
-    # since all features must be included in each tfrecord file, therefore we can select only first file of each chunk
-    train_files = [dataset_dir / chunk[FILES_SELECTOR][0] for chunk in feature_spec.source_spec[TRAIN_MAPPING]]
+    # since each tfrecord file must include all of the features, it is enough to read first chunk for each split. 
+    train_files = [dataset_dir / file for file in feature_spec.source_spec[TRAIN_MAPPING][0][FILES_SELECTOR]]
 
     if prefetch_train_size < 0:
         prefetch_train_size = train_dataset_size // global_batch_size
@@ -637,7 +637,7 @@ def main(
     )
 
     if mode == "train":
-        test_files = [dataset_dir / chunk[FILES_SELECTOR][0] for chunk in feature_spec.source_spec[TEST_MAPPING]]
+        test_files = [dataset_dir / file for file in feature_spec.source_spec[TEST_MAPPING][0][FILES_SELECTOR]]
         data_iterator_test = get_data_iterator(
             test_files, feature_spec, batch_size, num_gpus, long_seq_length,
             amp=amp, disable_cache=disable_cache, prefetch_size=prefetch_test_size