Save the checkpoint folder and meta only on the main process#2132
Save the checkpoint folder and meta only on the main process#2132mravanelli merged 11 commits intospeechbrain:developfrom
Conversation
…unicate to all procs
| ) | ||
|
|
||
| # Communicate ckpt_dir to all procs | ||
| torch.distributed.broadcast_object_list([ckpt_dir], src=0) |
There was a problem hiding this comment.
I'm a little unsure if this works when using multiple machines, perhaps we would need to use the concept of process groups and only communicate within the group somehow.
|
One note: with DDP we also have some little issues with our Loggers e.g. wandb. Each process is running the init of the logger while it should only be on the main process. CC @Chaanks |
For any methods/functions that should only run on main, we have a decorator: |
Gastron
left a comment
There was a problem hiding this comment.
Could you document the way this works in the docstring for save_checkpoint (or for the whole class). Other than that, looks good to me.
As discussed privately, we should look into running multi-process tests. Perhaps there could multi-process unittests for classess like the Checkpointer.
|
I attempted to write a multi-process test but found it difficult due to the way my computer is set up. I'm wondering if it may be hard to write a general enough multi-process test that works on most machines. Anyone else have ideas about how this could be done? |
|
Looks good to me, I recommend we merging soon as this is an important fix. Thanks for looking into multi-process tests! I guess we should take a look at what others are doing, perhaps someone has writetn something on multi-process unittests that we should read. Let's leave that as a separate quest. This is marked as a draft though, if there is something you wish to change go ahead, but otherwise just merge I guess. |
Adel-Moumen
left a comment
There was a problem hiding this comment.
Sounds good to me. Is it ready to be tested or merged?
Unfortunately I don't have a great way right now to test this. If you're able, could you try it and see if it works for you? |
|
I think we need to test it before merging it. Do you think we can create
some unittest?) as well?
…On Thu, Aug 17, 2023, 8:01 AM Peter Plantinga ***@***.***> wrote:
Sounds good to me. Is it ready to be tested or merged?
Unfortunately I don't have a great way right now to test this. If you're
able, could you try it and see if it works for you?
—
Reply to this email directly, view it on GitHub
<#2132 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA2ZVUKVA5NH72KOIBU3PLXVYBZFANCNFSM6AAAAAA3RFUOWY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
The trouble I ran into with trying to run DDP on my machine (which may or may not also be the case in the GitHub test environment) was that to initialize DDP you have to specify a web address (e.g. "localhost") and a port (e.g. 12355), but these were blocked on my computer. I could add the test I mocked up and just see if it works without testing locally. |
|
that makes sense.
…On Thu, Aug 17, 2023, 8:16 AM Peter Plantinga ***@***.***> wrote:
The trouble I ran into with trying to run DDP on my machine (which may or
may not also be the case in the GitHub test environment) was that to
initialize DDP you have to specify a web address (e.g. "localhost") and a
port (e.g. 12355), but these were blocked on my computer. I could add the
test I mocked up and just see if it works without testing locally.
—
Reply to this email directly, view it on GitHub
<#2132 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA2ZVWSQZKOTR3UV4Q7EM3XVYDRPANCNFSM6AAAAAA3RFUOWY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
Is it normal for this test to hang sometimes? or perhaps is this the last one to complete before the one that hangs? |
|
I re-ran the tests, and unfortunately, the same issue persists. I'm uncertain about the cause (I can attempt to reproduce it on my local environment as well). It might be worthwhile to investigate whether this issue exists in the latest development branch as well. There's a possibility that the problem comes from updates in certain dependencies. On another note, I'd like to discuss the possibility of upgrading the Python versions we test against. Currently, we're testing on relatively outdated Python versions. Could we consider expanding our testing to include Python 3.9, 3.10, and 3.11? This would help ensure compatibility with more modern Python environments. |
|
I tried with a local environment with python 3.8 and it looks like the tests are passing. I'm not sure why that unittest hanging here. My suggestion is to change the CI such that we test with python 3.9-3.11 (I don't remember what we need to change). Then, we can merge this PR. |
|
ok, that's annoying. It looks like now everything gets stuck with python 3.10. This should be something related to this PR as the tests on dev branch are passing. Any idea? |
|
Wondering if it is related to this: pytorch/pytorch#75097 If so, I need to investigate if its okay for us to just leave off this part. |
|
Interesting. That looks related. Curiously, on my local computer I'm able to run all the tests successfully. |
I think this is due to all 3 versions sharing the same port. I've edited the DDP to use file-based synchronization instead, which should avoid resource contention. |
|
Tests are passing, so I guess this is good to go. The error did indeed seem to be related to |
|
Thank you @pplantinga. It looks like now all the tests pass. I think we can merge it! |
PR #2059 has an issue where checkpoint folders are created on all processes, this PR fixes the issue by saving them only on the main process.