Traceback (most recent call last):
File "/workspace/bert/run_pretraining.py", line 678, in <module>
args, final_loss, train_time_raw, global_step = main()
File "/workspace/bert/run_pretraining.py", line 506, in main
model, optimizer, lr_scheduler, checkpoint, global_step, criterion = prepare_model_and_optimizer(args, device)
File "/workspace/bert/run_pretraining.py", line 409, in prepare_model_and_optimizer
optimizer.load_state_dict(checkpoint['optimizer']) # , strict=False)
File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 111, in load_state_dict
raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
bash scripts/docker/build.sh
bash scripts/docker/launch.sh
cd /workspace && \
apt update && apt install -y build-essential devscripts debhelper fakeroot && \
apt purge -y libnccl2 libnccl-dev && \
cd /workspace && \
git clone https://github.com/NVIDIA/nccl.git && \
cd nccl/ && \
git fetch && \
git checkout v2.7.6-1 && \
make -j NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" pkg.debian.build && \
dpkg -i build/pkg/deb/*.deb && \
cd /workspace/bert
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_large_pyt_amp_ckpt_pretraining_lamb/versions/1/zip -O bert_large_pyt_amp_ckpt_pretraining_lamb_1.zip
#scripts/run_pretraining.sh
echo "Container nvidia build = " $NVIDIA_BUILD_ID
train_batch_size=${1:-8192}
learning_rate=${2:-"6e-3"}
precision=${3:-"fp16"}
num_gpus=${4:-16}
warmup_proportion=${5:-"0.2843"}
train_steps=${6:-7038}
save_checkpoint_steps=${7:-200}
resume_training=${8:-"true"}
create_logfile=${9:-"true"}
accumulate_gradients=${10:-"true"}
gradient_accumulation_steps=${11:-128}
seed=${12:-42}
job_name=${13:-"bert_lamb_pretraining"}
allreduce_post_accumulation=${14:-"true"}
allreduce_post_accumulation_fp16=${15:-"true"}
train_batch_size_phase2=${16:-4096}
learning_rate_phase2=${17:-"4e-3"}
warmup_proportion_phase2=${18:-"0.128"}
train_steps_phase2=${19:-1563}
gradient_accumulation_steps_phase2=${20:-512}
DATASET=hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en # change this for other datasets
DATA_DIR_PHASE1=${21:-$BERT_PREP_WORKING_DIR/${DATASET}/}
BERT_CONFIG=bert_config.json
DATASET2=hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en # change this for other datasets
DATA_DIR_PHASE2=${22:-$BERT_PREP_WORKING_DIR/${DATASET2}/}
CODEDIR=${23:-"/workspace/bert"}
init_checkpoint=${24:-"/workspace/bert/DLE_BERT_FP16_PyT_LAMB_92_hard_scaling_node.pt"}
RESULTS_DIR=$CODEDIR/results
CHECKPOINTS_DIR=$RESULTS_DIR/checkpoints
mkdir -p $CHECKPOINTS_DIR
if [ ! -d "$DATA_DIR_PHASE1" ] ; then
echo "Warning! $DATA_DIR_PHASE1 directory missing. Training cannot start"
fi
if [ ! -d "$RESULTS_DIR" ] ; then
echo "Error! $RESULTS_DIR directory missing."
exit -1
fi
if [ ! -d "$CHECKPOINTS_DIR" ] ; then
echo "Warning! $CHECKPOINTS_DIR directory missing."
echo "Checkpoints will be written to $RESULTS_DIR instead."
CHECKPOINTS_DIR=$RESULTS_DIR
fi
if [ ! -f "$BERT_CONFIG" ] ; then
echo "Error! BERT large configuration file not found at $BERT_CONFIG"
exit -1
fi
PREC=""
if [ "$precision" = "fp16" ] ; then
PREC="--fp16"
elif [ "$precision" = "fp32" ] ; then
PREC=""
elif [ "$precision" = "tf32" ] ; then
PREC=""
else
echo "Unknown <precision> argument"
exit -2
fi
ACCUMULATE_GRADIENTS=""
if [ "$accumulate_gradients" == "true" ] ; then
ACCUMULATE_GRADIENTS="--gradient_accumulation_steps=$gradient_accumulation_steps"
fi
CHECKPOINT=""
if [ "$resume_training" == "true" ] ; then
CHECKPOINT="--resume_from_checkpoint"
fi
ALL_REDUCE_POST_ACCUMULATION=""
if [ "$allreduce_post_accumulation" == "true" ] ; then
ALL_REDUCE_POST_ACCUMULATION="--allreduce_post_accumulation"
fi
ALL_REDUCE_POST_ACCUMULATION_FP16=""
if [ "$allreduce_post_accumulation_fp16" == "true" ] ; then
ALL_REDUCE_POST_ACCUMULATION_FP16="--allreduce_post_accumulation_fp16"
fi
INIT_CHECKPOINT=""
if [ "$init_checkpoint" != "None" ] ; then
INIT_CHECKPOINT="--init_checkpoint=$init_checkpoint"
fi
echo $DATA_DIR_PHASE1
INPUT_DIR=$DATA_DIR_PHASE1
CMD=" $CODEDIR/run_pretraining.py"
CMD+=" --input_dir=$DATA_DIR_PHASE1"
CMD+=" --output_dir=$CHECKPOINTS_DIR"
CMD+=" --config_file=$BERT_CONFIG"
CMD+=" --bert_model=bert-large-uncased"
CMD+=" --train_batch_size=$train_batch_size"
CMD+=" --max_seq_length=128"
CMD+=" --max_predictions_per_seq=20"
CMD+=" --max_steps=$train_steps"
CMD+=" --warmup_proportion=$warmup_proportion"
CMD+=" --num_steps_per_checkpoint=$save_checkpoint_steps"
CMD+=" --learning_rate=$learning_rate"
CMD+=" --seed=$seed"
CMD+=" $PREC"
CMD+=" $ACCUMULATE_GRADIENTS"
CMD+=" $CHECKPOINT"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION_FP16"
CMD+=" $INIT_CHECKPOINT"
CMD+=" --do_train"
CMD+=" --json-summary ${RESULTS_DIR}/dllogger.json "
CMD="python3 -m torch.distributed.launch --nproc_per_node=$num_gpus $CMD"
if [ "$create_logfile" = "true" ] ; then
export GBS=$(expr $train_batch_size \* $num_gpus)
printf -v TAG "pyt_bert_pretraining_phase1_%s_gbs%d" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$RESULTS_DIR/$job_name.$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee $LOGFILE
fi
set +x
echo "finished pretraining"
#Start Phase2
PREC=""
if [ "$precision" = "fp16" ] ; then
PREC="--fp16"
elif [ "$precision" = "fp32" ] ; then
PREC=""
elif [ "$precision" = "tf32" ] ; then
PREC=""
else
echo "Unknown <precision> argument"
exit -2
fi
ACCUMULATE_GRADIENTS=""
if [ "$accumulate_gradients" == "true" ] ; then
ACCUMULATE_GRADIENTS="--gradient_accumulation_steps=$gradient_accumulation_steps_phase2"
fi
ALL_REDUCE_POST_ACCUMULATION=""
if [ "$allreduce_post_accumulation" == "true" ] ; then
ALL_REDUCE_POST_ACCUMULATION="--allreduce_post_accumulation"
fi
ALL_REDUCE_POST_ACCUMULATION_FP16=""
if [ "$allreduce_post_accumulation_fp16" == "true" ] ; then
ALL_REDUCE_POST_ACCUMULATION_FP16="--allreduce_post_accumulation_fp16"
fi
echo $DATA_DIR_PHASE2
INPUT_DIR=$DATA_DIR_PHASE2
CMD=" $CODEDIR/run_pretraining.py"
CMD+=" --input_dir=$DATA_DIR_PHASE2"
CMD+=" --output_dir=$CHECKPOINTS_DIR"
CMD+=" --config_file=$BERT_CONFIG"
CMD+=" --bert_model=bert-large-uncased"
CMD+=" --train_batch_size=$train_batch_size_phase2"
CMD+=" --max_seq_length=512"
CMD+=" --max_predictions_per_seq=80"
CMD+=" --max_steps=$train_steps_phase2"
CMD+=" --warmup_proportion=$warmup_proportion_phase2"
CMD+=" --num_steps_per_checkpoint=$save_checkpoint_steps"
CMD+=" --learning_rate=$learning_rate_phase2"
CMD+=" --seed=$seed"
CMD+=" $PREC"
CMD+=" $ACCUMULATE_GRADIENTS"
CMD+=" $CHECKPOINT"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION_FP16"
CMD+=" --do_train --phase2 --resume_from_checkpoint --phase1_end_step=$train_steps"
CMD+=" --json-summary ${RESULTS_DIR}/dllogger.json "
CMD="python3 -m torch.distributed.launch --nproc_per_node=$num_gpus $CMD"
if [ "$create_logfile" = "true" ] ; then
export GBS=$(expr $train_batch_size_phase2 \* $num_gpus)
printf -v TAG "pyt_bert_pretraining_phase2_%s_gbs%d" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$RESULTS_DIR/$job_name.$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee $LOGFILE
fi
set +x
echo "finished phase2"
bash scripts/run_pretraining.sh
root@anonymous:/workspace/bert# bash scripts/run_pretraining.sh
Container nvidia build = 13419386
/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/
Logs written to /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase1_fp16_gbs131072.200814191701.log
+ '[' -z /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase1_fp16_gbs131072.200814191701.log ']'
+ tee /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase1_fp16_gbs131072.200814191701.log
+ python3 -m torch.distributed.launch --nproc_per_node=16 /workspace/bert/run_pretraining.py --input_dir=/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/ --output_dir=/workspace/bert/results/checkpoints --config_file=bert_config.json --bert_model=bert-large-uncased --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=6e-3 --seed=42 --fp16 --gradient_accumulation_steps=128 --resume_from_checkpoint --allreduce_post_accumulation --allreduce_post_accumulation_fp16 --init_checkpoint=/workspace/bert/DLE_BERT_FP16_PyT_LAMB_92_hard_scaling_node.pt --do_train --json-summary /workspace/bert/results/dllogger.json
device: cuda:9 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:10 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:15 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:12 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:11 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:4 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:6 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:14 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:13 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:5 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:3 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:7 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:8 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True
DLL 2020-08-14 19:17:19.159906 - PARAMETER Config : ["Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, amp=False, bert_model='bert-large-uncased', checkpoint_activations=False, config_file='bert_config.json', disable_progress_bar=False, do_train=True, fp16=True, gradient_accumulation_steps=128, init_checkpoint='/workspace/bert/DLE_BERT_FP16_PyT_LAMB_92_hard_scaling_node.pt', init_loss_scale=1048576, input_dir='/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/', json_summary='/workspace/bert/results/dllogger.json', learning_rate=0.006, local_rank=0, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=20, max_seq_length=128, max_steps=7038.0, n_gpu=1, num_steps_per_checkpoint=200, num_train_epochs=3.0, output_dir='/workspace/bert/results/checkpoints', phase1_end_step=7038, phase2=False, resume_from_checkpoint=True, resume_step=-1, seed=42, skip_checkpoint=False, steps_this_run=7038.0, train_batch_size=64, use_env=False, warmup_proportion=0.2843)"]
resume step from -1
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Traceback (most recent call last):
File "/workspace/bert/run_pretraining.py", line 678, in <module>
args, final_loss, train_time_raw, global_step = main()
File "/workspace/bert/run_pretraining.py", line 506, in main
model, optimizer, lr_scheduler, checkpoint, global_step, criterion = prepare_model_and_optimizer(args, device)
File "/workspace/bert/run_pretraining.py", line 409, in prepare_model_and_optimizer
optimizer.load_state_dict(checkpoint['optimizer']) # , strict=False)
File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 111, in load_state_dict
raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
Related to Model/Framework(s)
Loading one of the pre-trained models from NGC results in
ValueError: loaded state dict has a different number of parameter groupswhen doing BERT pre-training. I'm currently running on A100 and will update with details for V100.Describe the bug
Pre-trained model being used: https://ngc.nvidia.com/catalog/models/nvidia:bert_large_pyt_amp_ckpt_pretraining_lamb
To Reproduce
Data downloaded and pre-processed using
create_datasets_from_start.sh.(Optional) When running on A100 instance on GCP, update NCCL:
scripts/run_pretraining.sh:Results:
Expected behavior
Should just work.
Environment
Please provide at least: