Skip to content

Commit d570b2c

Browse files
mrwyattiiyaozhewei
andauthored
Change DS-Chat script flags for deployment type (deepspeedai#291)
* refactor num-gpus flag to deployment-type * update docs * improve error message --------- Co-authored-by: Zhewei Yao <zheweiy@berkeley.edu>
1 parent d22b82b commit d570b2c

2 files changed

Lines changed: 23 additions & 31 deletions

File tree

applications/DeepSpeed-Chat/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ pip install -r requirements.txt
116116
If you only have around **1-2 hour** for coffee or lunch break, you can also try to train a small/toy model with DeepSpeed-Chat. For example, we prepared a training example for a **1.3B** model with a single dataset to test our framework on your consumer-grade GPUs. The best part is that you will have your model checkpoint ready to play with when you are back from your lunch break!
117117

118118
```bash
119-
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1
119+
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
120120
```
121121

122122
See the following table for the E2E time breakdown for training a 1.3 billion parameter ChatGPT model via DeepSpeed-Chat on a single commodity NVIDIA A6000 GPU with 48GB memory.
@@ -136,7 +136,7 @@ If you only have around **1-2 hour** for coffee or lunch break, you can also try
136136
If you only have around **half a day** and only a single server node, we suggest to use an example of pretrained **OPT-13B** as the actor model and OPT-350M as the reward model in the following single script to generate a final 13B ChatGPT-style model:
137137

138138
```bash
139-
python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8
139+
python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
140140
```
141141

142142
See the following table for the E2E time breakdown for training a 13 billion parameter ChatGPT model via DeepSpeed-Chat on a single DGX node with 8 NVIDIA A100-40G GPUs.
@@ -155,7 +155,7 @@ If you only have around **half a day** and only a single server node, we suggest
155155
Want to try different model sizes and configurations? You got it! With DeepSpeed-Chat, users can easily do that. For example, if you have access to multi-nodes cluster or cloud resources and prefer to train a larger and higher-quality model for your research or business, you can simply use a similar script with your desired model sizes, e.g., **66B** and GPU counts=64:
156156

157157
```bash
158-
python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --num-gpus 64
158+
python train.py --actor-model facebook/opt-66b --reward-model facebook/opt-350m --deployment-type multi_node
159159
```
160160

161161
See the following table for E2E time breakdown for training a 66 billion parameter ChatGPT model via DeepSpeed-Chat on 8 DGX nodes with 8 NVIDIA A100-80G GPUs/node.

applications/DeepSpeed-Chat/train.py

Lines changed: 20 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -40,16 +40,7 @@
4040
3: "training/step3_rlhf_finetuning",
4141
}
4242
model_type = {1: "actor", 2: "reward", 3: "step3"}
43-
default_zero_stage = {
44-
"single_node": {
45-
"1.3b": 2,
46-
"6.7b": 3,
47-
"13b": 3
48-
},
49-
"multi_node": {
50-
"66b": 3
51-
},
52-
}
43+
dse_url = "https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/"
5344

5445

5546
def parse_args():
@@ -97,10 +88,10 @@ def parse_args():
9788
help="Directory for output of each step",
9889
)
9990
parser.add_argument(
100-
"--num-gpus",
101-
type=int,
102-
default=1,
103-
choices=(1, 8, 64),
91+
"--deployment-type",
92+
type=str,
93+
default="single_gpu",
94+
choices=("single_gpu", "single_node", "multi_node"),
10495
help="Number of GPUs to run the actor/reward models on",
10596
)
10697
args = parser.parse_args()
@@ -110,15 +101,6 @@ def parse_args():
110101
"Non-default zero stages may result in OOM errors or worse performance."
111102
)
112103

113-
if args.num_gpus == 1:
114-
args.script_type = "single_gpu"
115-
elif args.num_gpus == 8:
116-
args.script_type = "single_node"
117-
elif args.num_gpus == 64:
118-
args.script_type = "multi_node"
119-
else:
120-
raise NotImplementedError(
121-
f"{args.num_gpus} GPUs not supported by this script")
122104
return args
123105

124106

@@ -146,7 +128,7 @@ def get_script(args, step_num):
146128
os.getcwd(),
147129
step_dirs[step_num],
148130
"training_scripts",
149-
args.script_type,
131+
args.deployment_type,
150132
f"run_{model_size}.sh",
151133
)
152134
assert os.path.isfile(
@@ -184,13 +166,23 @@ def get_cmd(args, step_num):
184166
return cmd
185167

186168

187-
def launch_cmd(cmd, step_num):
169+
def launch_cmd(args, step_num, cmd):
188170
working_dir = step_dirs[step_num]
171+
print(f"Running:\n{cmd}")
189172
p = subprocess.Popen(cmd, cwd=working_dir, shell=True)
190173
p.wait()
191174
if p.returncode != 0:
192-
raise RuntimeError(
193-
f"Step {step_num} exited with non-zero status {p.returncode}")
175+
raise RuntimeError('\n\n'.join((
176+
f"Step {step_num} exited with non-zero status {p.returncode}",
177+
f"Launch command: {cmd}",
178+
f"Log output: {os.path.join(get_output_dir(args, step_num), 'training.log')}",
179+
f"Please see our tutorial at {dse_url}{step_dirs[step_num]}",
180+
"Please check that you have installed our requirements: `pip install -r requirements.txt`",
181+
f"If you are seeing an OOM error, try modifying {get_script(args, step_num)}:",
182+
" - Reduce `--per_device_*_batch_size`",
183+
" - Increase `--zero_stage {0,1,2,3}` on multi-gpu setups",
184+
" - Enable `--gradient_checkpointing` or `--only_optimizer_lora`"
185+
)))
194186

195187

196188
def main(args):
@@ -200,7 +192,7 @@ def main(args):
200192
step_start_time = time.time()
201193

202194
cmd = get_cmd(args, step_num)
203-
launch_cmd(cmd, step_num)
195+
launch_cmd(args, step_num, cmd)
204196

205197
step_time = int(time.time() - start_time)
206198
time_str = str(datetime.timedelta(seconds=step_time))

0 commit comments

Comments
 (0)