Skip to content

Save the checkpoint folder and meta only on the main process#2132

Merged
mravanelli merged 11 commits intospeechbrain:developfrom
pplantinga:bugfix/checkpoint-folder-on-main
Aug 18, 2023
Merged

Save the checkpoint folder and meta only on the main process#2132
mravanelli merged 11 commits intospeechbrain:developfrom
pplantinga:bugfix/checkpoint-folder-on-main

Conversation

@pplantinga
Copy link
Copy Markdown
Collaborator

PR #2059 has an issue where checkpoint folders are created on all processes, this PR fixes the issue by saving them only on the main process.

@pplantinga pplantinga added the bug Something isn't working label Aug 15, 2023
@pplantinga pplantinga self-assigned this Aug 15, 2023
Comment thread speechbrain/utils/checkpoints.py Outdated
)

# Communicate ckpt_dir to all procs
torch.distributed.broadcast_object_list([ckpt_dir], src=0)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure if this works when using multiple machines, perhaps we would need to use the concept of process groups and only communicate within the group somehow.

Comment thread speechbrain/utils/checkpoints.py
@Adel-Moumen
Copy link
Copy Markdown
Collaborator

One note: with DDP we also have some little issues with our Loggers e.g. wandb. Each process is running the init of the logger while it should only be on the main process. CC @Chaanks

@pplantinga
Copy link
Copy Markdown
Collaborator Author

One note: with DDP we also have some little issues with our Loggers e.g. wandb. Each process is running the init of the logger while it should only be on the main process.

For any methods/functions that should only run on main, we have a decorator: @main_process_only that can be added. We could open a new PR to fix the loggers if that's the issue.

Copy link
Copy Markdown
Collaborator

@Gastron Gastron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you document the way this works in the docstring for save_checkpoint (or for the whole class). Other than that, looks good to me.

As discussed privately, we should look into running multi-process tests. Perhaps there could multi-process unittests for classess like the Checkpointer.

Comment thread speechbrain/utils/checkpoints.py
@pplantinga
Copy link
Copy Markdown
Collaborator Author

I attempted to write a multi-process test but found it difficult due to the way my computer is set up. I'm wondering if it may be hard to write a general enough multi-process test that works on most machines. Anyone else have ideas about how this could be done?

@Gastron
Copy link
Copy Markdown
Collaborator

Gastron commented Aug 17, 2023

Looks good to me, I recommend we merging soon as this is an important fix.

Thanks for looking into multi-process tests! I guess we should take a look at what others are doing, perhaps someone has writetn something on multi-process unittests that we should read. Let's leave that as a separate quest.

This is marked as a draft though, if there is something you wish to change go ahead, but otherwise just merge I guess.

Copy link
Copy Markdown
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Is it ready to be tested or merged?

@pplantinga
Copy link
Copy Markdown
Collaborator Author

Sounds good to me. Is it ready to be tested or merged?

Unfortunately I don't have a great way right now to test this. If you're able, could you try it and see if it works for you?

@mravanelli
Copy link
Copy Markdown
Collaborator

mravanelli commented Aug 17, 2023 via email

@pplantinga
Copy link
Copy Markdown
Collaborator Author

The trouble I ran into with trying to run DDP on my machine (which may or may not also be the case in the GitHub test environment) was that to initialize DDP you have to specify a web address (e.g. "localhost") and a port (e.g. 12355), but these were blocked on my computer. I could add the test I mocked up and just see if it works without testing locally.

@mravanelli
Copy link
Copy Markdown
Collaborator

mravanelli commented Aug 17, 2023 via email

Comment thread tests/unittests/test_checkpoints.py Outdated
Comment thread speechbrain/utils/checkpoints.py Outdated
Comment thread speechbrain/utils/checkpoints.py Outdated
@pplantinga pplantinga marked this pull request as ready for review August 17, 2023 15:13
@pplantinga
Copy link
Copy Markdown
Collaborator Author

pplantinga commented Aug 17, 2023

Is it normal for this test to hang sometimes? test_categorical_encoder.py

or perhaps is this the last one to complete before the one that hangs?

@mravanelli
Copy link
Copy Markdown
Collaborator

I re-ran the tests, and unfortunately, the same issue persists. I'm uncertain about the cause (I can attempt to reproduce it on my local environment as well). It might be worthwhile to investigate whether this issue exists in the latest development branch as well. There's a possibility that the problem comes from updates in certain dependencies.

On another note, I'd like to discuss the possibility of upgrading the Python versions we test against. Currently, we're testing on relatively outdated Python versions. Could we consider expanding our testing to include Python 3.9, 3.10, and 3.11? This would help ensure compatibility with more modern Python environments.

@mravanelli
Copy link
Copy Markdown
Collaborator

I tried with a local environment with python 3.8 and it looks like the tests are passing. I'm not sure why that unittest hanging here. My suggestion is to change the CI such that we test with python 3.9-3.11 (I don't remember what we need to change). Then, we can merge this PR.

@mravanelli
Copy link
Copy Markdown
Collaborator

ok, that's annoying. It looks like now everything gets stuck with python 3.10. This should be something related to this PR as the tests on dev branch are passing. Any idea?

@pplantinga
Copy link
Copy Markdown
Collaborator Author

Wondering if it is related to this: pytorch/pytorch#75097

If so, I need to investigate if its okay for us to just leave off this part.

@mravanelli
Copy link
Copy Markdown
Collaborator

Interesting. That looks related. Curiously, on my local computer I'm able to run all the tests successfully.

Comment thread tests/unittests/test_checkpoints.py Outdated
Comment thread tests/unittests/test_checkpoints.py Outdated
@pplantinga
Copy link
Copy Markdown
Collaborator Author

Interesting. That looks related. Curiously, on my local computer I'm able to run all the tests successfully.

I think this is due to all 3 versions sharing the same port. I've edited the DDP to use file-based synchronization instead, which should avoid resource contention.

Comment thread tests/unittests/test_checkpoints.py Outdated
@pplantinga
Copy link
Copy Markdown
Collaborator Author

Tests are passing, so I guess this is good to go. The error did indeed seem to be related to destroy_process_group() which might be necessary for some cases but doesn't seem necessary here when the machine is shutting down after tests are done.

@mravanelli mravanelli self-requested a review August 18, 2023 22:44
@mravanelli
Copy link
Copy Markdown
Collaborator

Thank you @pplantinga. It looks like now all the tests pass. I think we can merge it!

@mravanelli mravanelli merged commit 815a025 into speechbrain:develop Aug 18, 2023
@pplantinga pplantinga deleted the bugfix/checkpoint-folder-on-main branch August 19, 2023 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants