[BERT/PyTorch] ValueError: loaded state dict has a different number of parameter groups

Related to **Model/Framework(s)** 
Loading one of the pre-trained models from NGC results in `ValueError: loaded state dict has a different number of parameter groups` when doing BERT pre-training. I'm currently running on A100 and will update with details for V100.

**Describe the bug**

Pre-trained model being used: https://ngc.nvidia.com/catalog/models/nvidia:bert_large_pyt_amp_ckpt_pretraining_lamb

```
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 678, in <module>
    args, final_loss, train_time_raw, global_step = main()
  File "/workspace/bert/run_pretraining.py", line 506, in main
    model, optimizer, lr_scheduler, checkpoint, global_step, criterion = prepare_model_and_optimizer(args, device)
  File "/workspace/bert/run_pretraining.py", line 409, in prepare_model_and_optimizer
    optimizer.load_state_dict(checkpoint['optimizer'])  # , strict=False)
  File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 111, in load_state_dict
    raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
```


**To Reproduce**

1. Build and launch Docker:

```
bash scripts/docker/build.sh
bash scripts/docker/launch.sh
```

2. Data downloaded and pre-processed using `create_datasets_from_start.sh`.

3. (Optional) When running on A100 instance on GCP, update NCCL:

```
cd /workspace && \
  apt update && apt install -y build-essential devscripts debhelper fakeroot && \
  apt purge -y libnccl2 libnccl-dev && \
  cd /workspace && \
  git clone https://github.com/NVIDIA/nccl.git && \
  cd nccl/ && \
  git fetch && \
  git checkout v2.7.6-1 && \
  make -j NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" pkg.debian.build && \
  dpkg -i build/pkg/deb/*.deb && \
  cd /workspace/bert
```

4. Download pre-trained checkpoint:

```
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_large_pyt_amp_ckpt_pretraining_lamb/versions/1/zip -O bert_large_pyt_amp_ckpt_pretraining_lamb_1.zip
```

5. Modify `scripts/run_pretraining.sh`:

```
#scripts/run_pretraining.sh
echo "Container nvidia build = " $NVIDIA_BUILD_ID
train_batch_size=${1:-8192}
learning_rate=${2:-"6e-3"}
precision=${3:-"fp16"}
num_gpus=${4:-16}
warmup_proportion=${5:-"0.2843"}
train_steps=${6:-7038}
save_checkpoint_steps=${7:-200}
resume_training=${8:-"true"}
create_logfile=${9:-"true"}
accumulate_gradients=${10:-"true"}
gradient_accumulation_steps=${11:-128}
seed=${12:-42}
job_name=${13:-"bert_lamb_pretraining"}
allreduce_post_accumulation=${14:-"true"}
allreduce_post_accumulation_fp16=${15:-"true"}
train_batch_size_phase2=${16:-4096}
learning_rate_phase2=${17:-"4e-3"}
warmup_proportion_phase2=${18:-"0.128"}
train_steps_phase2=${19:-1563}
gradient_accumulation_steps_phase2=${20:-512}
DATASET=hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en  # change this for other datasets
DATA_DIR_PHASE1=${21:-$BERT_PREP_WORKING_DIR/${DATASET}/}
BERT_CONFIG=bert_config.json
DATASET2=hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en  # change this for other datasets
DATA_DIR_PHASE2=${22:-$BERT_PREP_WORKING_DIR/${DATASET2}/}
CODEDIR=${23:-"/workspace/bert"}
init_checkpoint=${24:-"/workspace/bert/DLE_BERT_FP16_PyT_LAMB_92_hard_scaling_node.pt"}
RESULTS_DIR=$CODEDIR/results
CHECKPOINTS_DIR=$RESULTS_DIR/checkpoints

mkdir -p $CHECKPOINTS_DIR


if [ ! -d "$DATA_DIR_PHASE1" ] ; then
   echo "Warning! $DATA_DIR_PHASE1 directory missing. Training cannot start"
fi
if [ ! -d "$RESULTS_DIR" ] ; then
   echo "Error! $RESULTS_DIR directory missing."
   exit -1
fi
if [ ! -d "$CHECKPOINTS_DIR" ] ; then
   echo "Warning! $CHECKPOINTS_DIR directory missing."
   echo "Checkpoints will be written to $RESULTS_DIR instead."
   CHECKPOINTS_DIR=$RESULTS_DIR
fi
if [ ! -f "$BERT_CONFIG" ] ; then
   echo "Error! BERT large configuration file not found at $BERT_CONFIG"
   exit -1
fi

PREC=""
if [ "$precision" = "fp16" ] ; then
   PREC="--fp16"
elif [ "$precision" = "fp32" ] ; then
   PREC=""
elif [ "$precision" = "tf32" ] ; then
   PREC=""
else
   echo "Unknown <precision> argument"
   exit -2
fi

ACCUMULATE_GRADIENTS=""
if [ "$accumulate_gradients" == "true" ] ; then
   ACCUMULATE_GRADIENTS="--gradient_accumulation_steps=$gradient_accumulation_steps"
fi

CHECKPOINT=""
if [ "$resume_training" == "true" ] ; then
   CHECKPOINT="--resume_from_checkpoint"
fi

ALL_REDUCE_POST_ACCUMULATION=""
if [ "$allreduce_post_accumulation" == "true" ] ; then
   ALL_REDUCE_POST_ACCUMULATION="--allreduce_post_accumulation"
fi

ALL_REDUCE_POST_ACCUMULATION_FP16=""
if [ "$allreduce_post_accumulation_fp16" == "true" ] ; then
   ALL_REDUCE_POST_ACCUMULATION_FP16="--allreduce_post_accumulation_fp16"
fi

INIT_CHECKPOINT=""
if [ "$init_checkpoint" != "None" ] ; then
   INIT_CHECKPOINT="--init_checkpoint=$init_checkpoint"
fi

echo $DATA_DIR_PHASE1
INPUT_DIR=$DATA_DIR_PHASE1
CMD=" $CODEDIR/run_pretraining.py"
CMD+=" --input_dir=$DATA_DIR_PHASE1"
CMD+=" --output_dir=$CHECKPOINTS_DIR"
CMD+=" --config_file=$BERT_CONFIG"
CMD+=" --bert_model=bert-large-uncased"
CMD+=" --train_batch_size=$train_batch_size"
CMD+=" --max_seq_length=128"
CMD+=" --max_predictions_per_seq=20"
CMD+=" --max_steps=$train_steps"
CMD+=" --warmup_proportion=$warmup_proportion"
CMD+=" --num_steps_per_checkpoint=$save_checkpoint_steps"
CMD+=" --learning_rate=$learning_rate"
CMD+=" --seed=$seed"
CMD+=" $PREC"
CMD+=" $ACCUMULATE_GRADIENTS"
CMD+=" $CHECKPOINT"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION_FP16"
CMD+=" $INIT_CHECKPOINT"
CMD+=" --do_train"
CMD+=" --json-summary ${RESULTS_DIR}/dllogger.json "

CMD="python3 -m torch.distributed.launch --nproc_per_node=$num_gpus $CMD"


if [ "$create_logfile" = "true" ] ; then
  export GBS=$(expr $train_batch_size \* $num_gpus)
  printf -v TAG "pyt_bert_pretraining_phase1_%s_gbs%d" "$precision" $GBS
  DATESTAMP=`date +'%y%m%d%H%M%S'`
  LOGFILE=$RESULTS_DIR/$job_name.$TAG.$DATESTAMP.log
  printf "Logs written to %s\n" "$LOGFILE"
fi

set -x
if [ -z "$LOGFILE" ] ; then
   $CMD
else
   (
     $CMD
   ) |& tee $LOGFILE
fi

set +x

echo "finished pretraining"

#Start Phase2

PREC=""
if [ "$precision" = "fp16" ] ; then
   PREC="--fp16"
elif [ "$precision" = "fp32" ] ; then
   PREC=""
elif [ "$precision" = "tf32" ] ; then
   PREC=""
else
   echo "Unknown <precision> argument"
   exit -2
fi

ACCUMULATE_GRADIENTS=""
if [ "$accumulate_gradients" == "true" ] ; then
   ACCUMULATE_GRADIENTS="--gradient_accumulation_steps=$gradient_accumulation_steps_phase2"
fi

ALL_REDUCE_POST_ACCUMULATION=""
if [ "$allreduce_post_accumulation" == "true" ] ; then
   ALL_REDUCE_POST_ACCUMULATION="--allreduce_post_accumulation"
fi

ALL_REDUCE_POST_ACCUMULATION_FP16=""
if [ "$allreduce_post_accumulation_fp16" == "true" ] ; then
   ALL_REDUCE_POST_ACCUMULATION_FP16="--allreduce_post_accumulation_fp16"
fi

echo $DATA_DIR_PHASE2
INPUT_DIR=$DATA_DIR_PHASE2
CMD=" $CODEDIR/run_pretraining.py"
CMD+=" --input_dir=$DATA_DIR_PHASE2"
CMD+=" --output_dir=$CHECKPOINTS_DIR"
CMD+=" --config_file=$BERT_CONFIG"
CMD+=" --bert_model=bert-large-uncased"
CMD+=" --train_batch_size=$train_batch_size_phase2"
CMD+=" --max_seq_length=512"
CMD+=" --max_predictions_per_seq=80"
CMD+=" --max_steps=$train_steps_phase2"
CMD+=" --warmup_proportion=$warmup_proportion_phase2"
CMD+=" --num_steps_per_checkpoint=$save_checkpoint_steps"
CMD+=" --learning_rate=$learning_rate_phase2"
CMD+=" --seed=$seed"
CMD+=" $PREC"
CMD+=" $ACCUMULATE_GRADIENTS"
CMD+=" $CHECKPOINT"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION"
CMD+=" $ALL_REDUCE_POST_ACCUMULATION_FP16"
CMD+=" --do_train --phase2 --resume_from_checkpoint --phase1_end_step=$train_steps"
CMD+=" --json-summary ${RESULTS_DIR}/dllogger.json "

CMD="python3 -m torch.distributed.launch --nproc_per_node=$num_gpus $CMD"

if [ "$create_logfile" = "true" ] ; then
  export GBS=$(expr $train_batch_size_phase2 \* $num_gpus)
  printf -v TAG "pyt_bert_pretraining_phase2_%s_gbs%d" "$precision" $GBS
  DATESTAMP=`date +'%y%m%d%H%M%S'`
  LOGFILE=$RESULTS_DIR/$job_name.$TAG.$DATESTAMP.log
  printf "Logs written to %s\n" "$LOGFILE"
fi

set -x
if [ -z "$LOGFILE" ] ; then
   $CMD
else
   (
     $CMD
   ) |& tee $LOGFILE
fi

set +x

echo "finished phase2"
```

6. Kick off pre-training script:

```
bash scripts/run_pretraining.sh
```

Results:

```
root@anonymous:/workspace/bert# bash scripts/run_pretraining.sh 
Container nvidia build =  13419386
/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/
Logs written to /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase1_fp16_gbs131072.200814191701.log
+ '[' -z /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase1_fp16_gbs131072.200814191701.log ']'
+ tee /workspace/bert/results/bert_lamb_pretraining.pyt_bert_pretraining_phase1_fp16_gbs131072.200814191701.log
+ python3 -m torch.distributed.launch --nproc_per_node=16 /workspace/bert/run_pretraining.py --input_dir=/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/ --output_dir=/workspace/bert/results/checkpoints --config_file=bert_config.json --bert_model=bert-large-uncased --train_batch_size=8192 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=7038 --warmup_proportion=0.2843 --num_steps_per_checkpoint=200 --learning_rate=6e-3 --seed=42 --fp16 --gradient_accumulation_steps=128 --resume_from_checkpoint --allreduce_post_accumulation --allreduce_post_accumulation_fp16 --init_checkpoint=/workspace/bert/DLE_BERT_FP16_PyT_LAMB_92_hard_scaling_node.pt --do_train --json-summary /workspace/bert/results/dllogger.json
device: cuda:9 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:10 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:15 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:12 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:11 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:1 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:4 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:2 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:6 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:14 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:13 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:5 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:3 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:7 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:8 n_gpu: 1, distributed training: True, 16-bits training: True
device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: True
DLL 2020-08-14 19:17:19.159906 - PARAMETER Config : ["Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, amp=False, bert_model='bert-large-uncased', checkpoint_activations=False, config_file='bert_config.json', disable_progress_bar=False, do_train=True, fp16=True, gradient_accumulation_steps=128, init_checkpoint='/workspace/bert/DLE_BERT_FP16_PyT_LAMB_92_hard_scaling_node.pt', init_loss_scale=1048576, input_dir='/workspace/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/', json_summary='/workspace/bert/results/dllogger.json', learning_rate=0.006, local_rank=0, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=20, max_seq_length=128, max_steps=7038.0, n_gpu=1, num_steps_per_checkpoint=200, num_train_epochs=3.0, output_dir='/workspace/bert/results/checkpoints', phase1_end_step=7038, phase2=False, resume_from_checkpoint=True, resume_step=-1, seed=42, skip_checkpoint=False, steps_this_run=7038.0, train_batch_size=64, use_env=False, warmup_proportion=0.2843)"] 
resume step from  -1
Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 678, in <module>
    args, final_loss, train_time_raw, global_step = main()
  File "/workspace/bert/run_pretraining.py", line 506, in main
    model, optimizer, lr_scheduler, checkpoint, global_step, criterion = prepare_model_and_optimizer(args, device)
  File "/workspace/bert/run_pretraining.py", line 409, in prepare_model_and_optimizer
    optimizer.load_state_dict(checkpoint['optimizer'])  # , strict=False)
  File "/opt/conda/lib/python3.6/site-packages/torch/optim/optimizer.py", line 111, in load_state_dict
    raise ValueError("loaded state dict has a different number of "
ValueError: loaded state dict has a different number of parameter groups
```

**Expected behavior**
Should just work.

**Environment**
Please provide at least:
* Container version (e.g. 19.05-py3): nvcr.io/nvidia/pytorch-py3:20.06
* GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 16x A100 on GCP
* CUDA driver version (e.g. 418.67): 450.51.05


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BERT/PyTorch] ValueError: loaded state dict has a different number of parameter groups #651

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BERT/PyTorch] ValueError: loaded state dict has a different number of parameter groups #651

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions