Skip to content

Commit dd0f181

Browse files
UEFI-codelekurile
andauthored
Correction training script filename in README and Fix Bug for Step Running Time Display (deepspeedai#815)
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
1 parent 0e10c4b commit dd0f181

2 files changed

Lines changed: 14 additions & 13 deletions

File tree

applications/DeepSpeed-Chat/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ pip install -e .
136136
If you only have around **1-2 hour** for coffee or lunch break, you can also try to train a small/toy model with DeepSpeed-Chat. For example, we prepared a training example for a **1.3B** model with a single dataset to test our framework on your consumer-grade GPUs. The best part is that you will have your model checkpoint ready to play with when you are back from your lunch break!
137137

138138
```bash
139-
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
139+
python e2e_rlhf.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
140140
```
141141

142142
See the following table for the E2E time breakdown for training a 1.3 billion parameter ChatGPT model via DeepSpeed-Chat on a single commodity NVIDIA A6000 GPU with 48GB memory.
@@ -156,7 +156,7 @@ If you only have around **1-2 hour** for coffee or lunch break, you can also try
156156
If you only have around **half a day** and only a single server node, we suggest using an example of pretrained **OPT-13B** as the actor model and OPT-350M as the reward model in the following single script to generate a final 13B ChatGPT-style model:
157157

158158
```bash
159-
python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
159+
python e2e_rlhf.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
160160
```
161161

162162
See the following table for the E2E time breakdown for training a 13 billion parameter ChatGPT model via DeepSpeed-Chat on a single DGX node with 8 NVIDIA A100-40G GPUs.
@@ -175,7 +175,7 @@ If you only have around **half a day** and only a single server node, we suggest
175175
Want to try different model sizes and configurations? You got it! With DeepSpeed-Chat, users can easily do that. For example, if you have access to multi-nodes cluster or cloud resources and prefer to train a larger and higher-quality model for your research or business, you can simply use a similar script with your desired model sizes, e.g., **66B** and GPU counts=64:
176176

177177
```bash
178-
python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node
178+
python e2e_rlhf.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node
179179
```
180180

181181
See the following table for E2E time breakdown for training a 66 billion parameter ChatGPT model via DeepSpeed-Chat on 8 DGX nodes with 8 NVIDIA A100-80G GPUs/node.

applications/DeepSpeed-Chat/e2e_rlhf.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,27 @@
44
# DeepSpeed Team
55
"""
66
Run all steps with default settings:
7-
$ python3 train.py
7+
$ python3 e2e_rlhf.py
88
99
Change the model used for each step:
10-
$ python3 train.py --actor-model 350m --reward-model 1.3b
10+
$ python3 e2e_rlhf.py --actor-model 350m --reward-model 1.3b
1111
1212
Change the ZeRO stage used for actor/reward models:
13-
$ python3 train.py --actor-zero-stage 1 --reward-zero-stage 3
13+
$ python3 e2e_rlhf.py --actor-zero-stage 1 --reward-zero-stage 3
1414
1515
Run a subset of the steps:
16-
$ python3 train.py --step 1 2
16+
$ python3 e2e_rlhf.py --step 1 2
1717
1818
Note: Step 3 relies on models trained in Steps 1 & 2. If you have already
1919
trained these models, you can run just Step 3 and select which models from
2020
Steps 1 & 2 to use. For example, let's train models for Steps 1 & 2 using
2121
125m and 350m models:
22-
$ python3 train.py --step 1 2 --actor-model 125m --reward-model 125m
23-
$ python3 train.py --step 1 2 --actor-model 350m --reward-model 350m
22+
$ python3 e2e_rlhf.py --step 1 2 --actor-model 125m --reward-model 125m
23+
$ python3 e2e_rlhf.py --step 1 2 --actor-model 350m --reward-model 350m
2424
2525
Now we can run Step 3 with any combination of these models:
26-
$ python3 train.py --step 3 --actor-model 125m --reward-model 350m
27-
$ python3 train.py --step 3 --actor-model 350m --reward-model 125m
26+
$ python3 e2e_rlhf.py --step 3 --actor-model 125m --reward-model 350m
27+
$ python3 e2e_rlhf.py --step 3 --actor-model 350m --reward-model 125m
2828
"""
2929

3030
import argparse
@@ -33,6 +33,7 @@
3333
import os
3434
import datetime
3535
import time
36+
import sys
3637

3738
step_dirs = {
3839
1: "training/step1_supervised_finetuning",
@@ -144,7 +145,7 @@ def verify_model(args, step_num):
144145
model_file = os.path.join(output_dir, "pytorch_model.bin")
145146
if not os.path.isfile(model_file):
146147
error_str = f"Step {step_num} model has not been trained. Train it with:\n"
147-
error_str += f"python3 train.py --step {step_num}"
148+
error_str += f"{sys.executable.split('/')[-1]} {sys.argv[0]} --step {step_num}"
148149
error_str += f" --{model_type[step_num]}-model {model_size}"
149150
raise RuntimeError(error_str)
150151

@@ -194,7 +195,7 @@ def main(args):
194195
cmd = get_cmd(args, step_num)
195196
launch_cmd(args, step_num, cmd)
196197

197-
step_time = int(time.time() - start_time)
198+
step_time = int(time.time() - step_start_time)
198199
time_str = str(datetime.timedelta(seconds=step_time))
199200
print(f"---=== Finished Step {step_num} in {time_str} ===---")
200201

0 commit comments

Comments
 (0)