Correction training script filename in README and Fix Bug for Step Running Time Display (deepspeedai#815)

UEFI-code · lekurile · web-flow · commit dd0f181bad81 · 2023-12-11T17:05:40.000-06:00
Co-authored-by: Lev Kurilenko &lt;113481193+lekurile@users.noreply.github.com&gt;
diff --git a/applications/DeepSpeed-Chat/README.md b/applications/DeepSpeed-Chat/README.md
@@ -136,7 +136,7 @@ pip install -e .
 If you only have around **1-2 hour** for coffee or lunch break, you can also try to train a small/toy model with DeepSpeed-Chat. For example, we prepared a training example for a **1.3B** model with a single dataset to test our framework on your consumer-grade GPUs. The best part is that you will have your model checkpoint ready to play with when you are back from your lunch break!
 
   ```bash
-  python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
+  python e2e_rlhf.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
   ```
 
   See the following table for the E2E time breakdown for training a 1.3 billion parameter ChatGPT model via DeepSpeed-Chat on a single commodity NVIDIA A6000 GPU with 48GB memory.
@@ -156,7 +156,7 @@ If you only have around **1-2 hour** for coffee or lunch break, you can also try
 If you only have around **half a day** and only a single server node, we suggest using an example of pretrained **OPT-13B** as the actor model and OPT-350M as the reward model in the following single script to generate a final 13B ChatGPT-style model:
 
   ```bash
-  python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
+  python e2e_rlhf.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
   ```
 
   See the following table for the E2E time breakdown for training a 13 billion parameter ChatGPT model via DeepSpeed-Chat on a single DGX node with 8 NVIDIA A100-40G GPUs.
@@ -175,7 +175,7 @@ If you only have around **half a day** and only a single server node, we suggest
 Want to try different model sizes and configurations? You got it! With DeepSpeed-Chat, users can easily do that. For example, if you have access to multi-nodes cluster or cloud resources and prefer to train a larger and higher-quality model for your research or business, you can simply use a similar script with your desired model sizes, e.g., **66B** and GPU counts=64:
 
   ```bash
-  python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node
+  python e2e_rlhf.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node
   ```
 
   See the following table for E2E time breakdown for training a 66 billion parameter ChatGPT model via DeepSpeed-Chat on 8 DGX nodes with 8 NVIDIA A100-80G GPUs/node.
diff --git a/applications/DeepSpeed-Chat/e2e_rlhf.py b/applications/DeepSpeed-Chat/e2e_rlhf.py
@@ -4,27 +4,27 @@
 # DeepSpeed Team
 """
 Run all steps with default settings:
-$ python3 train.py
+$ python3 e2e_rlhf.py
 
 Change the model used for each step:
-$ python3 train.py --actor-model 350m --reward-model 1.3b
+$ python3 e2e_rlhf.py --actor-model 350m --reward-model 1.3b
 
 Change the ZeRO stage used for actor/reward models:
-$ python3 train.py --actor-zero-stage 1 --reward-zero-stage 3
+$ python3 e2e_rlhf.py --actor-zero-stage 1 --reward-zero-stage 3
 
 Run a subset of the steps:
-$ python3 train.py --step 1 2
+$ python3 e2e_rlhf.py --step 1 2
 
 Note: Step 3 relies on models trained in Steps 1 & 2. If you have already
 trained these models, you can run just Step 3 and select which models from
 Steps 1 & 2 to use. For example, let's train models for Steps 1 & 2 using
 125m and 350m models:
-$ python3 train.py --step 1 2 --actor-model 125m --reward-model 125m
-$ python3 train.py --step 1 2 --actor-model 350m --reward-model 350m
+$ python3 e2e_rlhf.py --step 1 2 --actor-model 125m --reward-model 125m
+$ python3 e2e_rlhf.py --step 1 2 --actor-model 350m --reward-model 350m
 
 Now we can run Step 3 with any combination of these models:
-$ python3 train.py --step 3 --actor-model 125m --reward-model 350m
-$ python3 train.py --step 3 --actor-model 350m --reward-model 125m
+$ python3 e2e_rlhf.py --step 3 --actor-model 125m --reward-model 350m
+$ python3 e2e_rlhf.py --step 3 --actor-model 350m --reward-model 125m
 """
 
 import argparse
@@ -33,6 +33,7 @@
 import os
 import datetime
 import time
+import sys
 
 step_dirs = {
     1: "training/step1_supervised_finetuning",
@@ -144,7 +145,7 @@ def verify_model(args, step_num):
     model_file = os.path.join(output_dir, "pytorch_model.bin")
     if not os.path.isfile(model_file):
         error_str = f"Step {step_num} model has not been trained. Train it with:\n"
-        error_str += f"python3 train.py --step {step_num}"
+        error_str += f"{sys.executable.split('/')[-1]} {sys.argv[0]} --step {step_num}"
         error_str += f" --{model_type[step_num]}-model {model_size}"
         raise RuntimeError(error_str)
 
@@ -194,7 +195,7 @@ def main(args):
         cmd = get_cmd(args, step_num)
         launch_cmd(args, step_num, cmd)
 
-        step_time = int(time.time() - start_time)
+        step_time = int(time.time() - step_start_time)
         time_str = str(datetime.timedelta(seconds=step_time))
         print(f"---=== Finished Step {step_num} in {time_str} ===---")