Skip to content

DPL: enable shmem (O2-1434)#3667

Merged
ktf merged 2 commits into
AliceO2Group:devfrom
ktf:shmem-on
May 30, 2020
Merged

DPL: enable shmem (O2-1434)#3667
ktf merged 2 commits into
AliceO2Group:devfrom
ktf:shmem-on

Conversation

@ktf
Copy link
Copy Markdown
Member

@ktf ktf commented May 27, 2020

This enables shared memory when on the same host and makes sure that the --shm-segment-size option is the same and the largest specified value for every device. If no device requires a special size, the default of 2GB is used.

ktf added 2 commits May 27, 2020 23:41
* If no value is specified, use the default, i.e. 2GB.
* If one of the subworkflows in the pipe chain specifies a value,
  use that value for everyone.
* If more than one subworkflow in the pipe chain specifies a value,
  use the largest value for everyone.
@ktf ktf requested a review from a team as a code owner May 27, 2020 21:45
@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 27, 2020

@davidrohr this should give you the behaviour you requested. Can you please test it before I merge it?

@davidrohr
Copy link
Copy Markdown
Collaborator

OK, thanks, I will try it. Before we merge this, we should wait for #3552 and alisw/alidist#2262 to have proper buffer sizes requested, and propper error handling when the buffer size is exceeded.

info.cmdLineArgs.erase(it, it + 2);
}
if (segmentSize == 0) {
segmentSize = 2000000000LL;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll eventually want to back the shm segments with huge pages, where the size become important (multiple of page size). Perhaps this should be (2ULL << 30)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want GB huge pages or 2/4 MB? GB huge pages need to be allocated at boot time if I remember correctly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2M would be fine for us. Don't think there would be much benefit going to 1G. There are several reservation methods with different flexibility: kernel boot parameter, early during boot, etc. We'll have to see what works best for us.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that that's the current default in FairMQ, but I agree a 1ULL <<31 is probably better in any case. This could should actually be handled at FairMQ level, IMHO: always snap and align to page boundaries...

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

@ironMann do you understand why StfSender dies?

@ironMann
Copy link
Copy Markdown
Contributor

@kft @davidrohr AliceO2Group/DataDistribution@a18de74 ?

is this not needed here anymore?

@davidrohr
Copy link
Copy Markdown
Collaborator

@ironMann : Yes, I assume so. AMD ROCm is still building with CC7 devtoolset 7. Has this workaround been removed, or why is it failing now?

@ironMann
Copy link
Copy Markdown
Contributor

No it's still there. I think this problem spread to other binaries. It's weird, because the boost program_options is done in all binaries... Looks like linking of StfBuilder fails this time.
Can you rerun the builder with DataDistribution/dev ?

@davidrohr
Copy link
Copy Markdown
Collaborator

I cannot do it easily, perhaps @ktf can. But I am quite sure it is the same problem. So I'd just merge the change in DataDistribution and then retry here.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

I cannot do it easily because it would require doing so in some sort of alidist / gpu builder, which at the moment we do not have (that's why it trickled through the CI).

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

Anyways, I think it's unrelated to the PR and more to the 0.7.5 bump I did yesterday. @davidrohr can you confirm this behaves as expected for you?

@ironMann
Copy link
Copy Markdown
Contributor

Then each test would require alidist bump?
Can you instead build DataDistribution/dev in the O2-gpu builder?

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

Also, I suggest we do the change for 2ULL << 30 in a separate PR once FairMQ changes it's default, otherwise people might be puzzled that the help says one thing and then we do another.

@rbx any objection to change the default in FairMQ?

@davidrohr
Copy link
Copy Markdown
Collaborator

I can build it manually in the gpucibuilder container, but I'll have to build from scratch since I don't have a setup anymore and don't have much time now. I can do it over the weekend.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

@ironMann the correct thing would be to have a CI for AliceO2Group/DataDistribution which builds any PR entering there with the current master of alidist. I can set it up, but not in the next few hours.

@davidrohr
Copy link
Copy Markdown
Collaborator

Anyways, I think it's unrelated to the PR and more to the 0.7.5 bump I did yesterday. @davidrohr can you confirm this behaves as expected for you?

Which 0.7.5 bump? Will test it asap.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

Which 0.7.5 bump? Will test it asap.

alisw/alidist#2267

@ironMann
Copy link
Copy Markdown
Contributor

Which 0.7.5 bump? Will test it asap.

This already tried to build 0.7.5 of DataDistribution (in alidist). However the changes in between versions broke the gpu builder.

@ironMann
Copy link
Copy Markdown
Contributor

@ironMann the correct thing would be to have a CI for AliceO2Group/DataDistribution which builds any PR entering there with the current master of alidist. I can set it up, but not in the next few hours.

This would be cool. Let me know if you need anything from me.
But, also run the GPU builder for alidist PRs?

@davidrohr
Copy link
Copy Markdown
Collaborator

We anyway decided in WP3 to merge the builders for checkcode / gpu / and linux o2.
I'll take care of this as soon as AMD supports a proper compiler instead of devtoolset7.
They have been delaying this since March. Let's see...
Then we shouldn't need additional builder or extensions of the current GPU builder.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

GPU builder for alidist PRs

yeah, probably we should... The problem is that we are juggling too many builds at the moment and I need to do some reshuffling in the CI code before I am confident it will not break everything. For DataDistribution it's easy, because it's a small package at the tip of the dependency chain...

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 28, 2020

I will let you know if I need anything, but I should have superpowers.

@davidrohr
Copy link
Copy Markdown
Collaborator

OK, since this is urgent as O2 GPU CI is broken generally, I started a local build in the container now. Will report back in 2 hours when it is hopefully done.

@davidrohr
Copy link
Copy Markdown
Collaborator

@ktf : Building locally, it failed with arrow (see attached log).
I could fix it by adding an explicit ARROW_ROOT to alidist/o2.sh.
a.gz

@davidrohr
Copy link
Copy Markdown
Collaborator

@ironMann : The below patch fixes the DataDistribution compilation in the GPU CI for me (as expected). Could you add it to DataDistribution and bump it in alidist?

diff --git a/src/StfBuilder/runStfBuilderDevice.cxx b/src/StfBuilder/runStfBuilderDevice.cxx
index 623e6fd..916267f 100644
--- a/src/StfBuilder/runStfBuilderDevice.cxx
+++ b/src/StfBuilder/runStfBuilderDevice.cxx
@@ -22,6 +22,7 @@
 #include <runFairMQDevice.h>
 
 namespace bpo = boost::program_options;
+template class std::basic_string<char, std::char_traits<char>, std::allocator<char> >; // Workaround for bug in CC7 devtoolset7
 
 void addCustomOptions(bpo::options_description& options)
 {

@rbx
Copy link
Copy Markdown
Contributor

rbx commented May 28, 2020

@rbx any objection to change the default in FairMQ?

No objection, will change it to 2ULL << 30.

@ironMann
Copy link
Copy Markdown
Contributor

@davidrohr , I added the same for each executable. alisw/alidist#2273

@davidrohr
Copy link
Copy Markdown
Collaborator

@ktf : It seems applying the option works, but e.g.

o2-its-reco-workflow --disable-root-output | o2-tpc-reco-workflow --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type clusters,tracks,disable-writer --ca-clusterer | o2-tpcits-match-workflow --disable-root-input --disable-root-output | o2-tof-reco-workflow --disable-root-input --disable-root-output | o2-tpc-scdcalib-interpolation-workflow --disable-root-input --shm-segment-size 10000000

fails with

[ERROR] error while setting up workflow: Error while parsing serialised workflow

Shouldn't I get at some point an error that the memory size was exceeded?

@davidrohr
Copy link
Copy Markdown
Collaborator

@ktf : I tried to reduce the shared memory size because I wanted to see the memory allocation failing, but instead I god the message that the workflow failed to initialize. Dump attached.
dump.gz
log.gz

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 29, 2020

ah, ok.... I think it's unrelated from this change. You get that message when something goes wrong with the merging of the workflow. I will try to see if I can create a better error message.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 29, 2020

I suggest we nevertheless merge this, objections?

@davidrohr
Copy link
Copy Markdown
Collaborator

So this seems to work for me.
Can be merged after #3552 is merged, otherwise the TPC workflow fails at default settings.

Just one comment: If I set the shared segment size too small, the workflow fails to initialize. I get errors like [ERROR] Unable to pass configuration to children at startup. I assume this is because the shared memory is also used to pass the configuration? Perhaps it would be good to have an error indicating one runs out of memory instead of that generic error?

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 30, 2020

No, the configuration is actually passed via stdout / stdin between the parent process and the children. I suspect when the memory is too low some printout scrambles the json passed between the two. I need to check.

@ktf ktf changed the title DPL: enable shmem DPL: enable shmem (O2-1434) May 30, 2020
@davidrohr
Copy link
Copy Markdown
Collaborator

@ktf : The TPC change is done, you can merge this as soon as you want.

@ktf ktf merged commit 37a1ad6 into AliceO2Group:dev May 30, 2020
@ktf ktf deleted the shmem-on branch May 30, 2020 11:21
@shahor02
Copy link
Copy Markdown
Collaborator

With this PR I processingFcn of some devices not called.
For instance, o2-its-reco-workflow runs correctly digits reader, clusterizer, cluster-writer but not the tracker. I see its init() and endOfStream() called but not run(). No error message is printed.

If I revert a5815c3, everything works.

I've rebuild everything with fresh alidist, does not help.

Problem (ubuntu18.04) can be reproduced by

o2-sim -m PIPE ITS -n 10 -g pythia8
o2-sim-digitizer-workflow
o2-its-reco-workflow

with a5815c3 reverted there are lot of messages like

[16454:its-cooked-tracker]:  - Vertexer initialisation completed in: 0.004946 ms
[16454:its-cooked-tracker]:  - Tracklet finding completed in: 0.00351 ms
[16454:its-cooked-tracker]:  - Adjacent tracklets validation completed in: 0.000945 ms

and o2trac_its.root with a tree is written. W/o reverting there are no log and the output file is empty (created in init() and closed in endOfStream(), the

void CookedTrackerDPL::run(ProcessingContext& pc)
is not called.

@ktf any hint how to debug this?

@ktf
Copy link
Copy Markdown
Member Author

ktf commented May 31, 2020

I've some issues as well on macOS. I suspect it's a regression in FairMQ 1.4.18, but have no further clue, for now. I made #3680 to revert.

@davidrohr
Copy link
Copy Markdown
Collaborator

@ktf : I tried @shahor02 commands above and I can reproduce the issue he sees.
I went back to FairMQ 1.4.16, and also there I see the issue, so it is not a regression in 1.4.18.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 2, 2020

Ok, I will have a look.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

So, I can now confirm there were two problems:

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

I confirm that without transport=shmem, input.payload is non null.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

@matthiasrichter does it ring any bell?

@shahor02
Copy link
Copy Markdown
Collaborator

shahor02 commented Jun 4, 2020

@ktf yes, ITSMFT clusterized at the moment sends an empty vector ITS/CLUSTERS, this is a legacy feature which will be suppressed once MFT will be ready to use COMPCLUSTERS

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

ok, but why does "empty vector" mean nullptr in the shmem case?

@davidrohr
Copy link
Copy Markdown
Collaborator

Actually I remember there was a similar issue for the TPC. When there was no data in a sector, the event was lost before, and this was fixed by @matthiasrichter

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

Yes, I remember. Let me find matthias fix. Maybe it was in a separate part of the code.

@davidrohr
Copy link
Copy Markdown
Collaborator

* One is the one being discussed in https://alice.its.cern.ch/jira/browse/O2-1485. This was preventing me to reproduce the issue in a consistent manner. We therefore need [FairRootGroup/FairMQ#276](https://github.com/FairRootGroup/FairMQ/pull/276) or equivalent fix.

Is that only relevant for mac or also for linux?

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

I think it's relevant for both. What might mitigate it on linux is that maybe boost::hash(<uid> + dpl_<pid>) has more entropy due to the fact <uid> and <pid> are larger (e.g. on lxplus or simply because systemd uses more <pid>s). This might actually explain some random failures we have, as pids wrap around. Anyways, the best solution would be to have sha256 or some other cryptographic hash.

@davidrohr
Copy link
Copy Markdown
Collaborator

Anyways, the best solution would be to have sha256 or some other cryptographic hash.

Fully agree, so what prevents us from using sha2?

@matthiasrichter
Copy link
Copy Markdown
Collaborator

Indeed, its a similar problem like I have fixed for the InputRecord in #3578, we should have looked for other places immediately.

Empty vector means zero-length message, and such messages have been handled differently in FairMQ. While ZMQ transport provides a valid pointer and message length zero, the SHM transport returns nullptr for zero length messages.

@rbx
Copy link
Copy Markdown
Contributor

rbx commented Jun 4, 2020

I also feel crypto hash would be the best. My only hesitation is requiring openssl just for that. Another option is to use crc32, it's included in boost and gives different output for small/similar inputs (just tested on mac and linux). While not as robust as SHA, it may suffice for this use case. It is also short enough that it doesn't need the truncation.

@rbx
Copy link
Copy Markdown
Contributor

rbx commented Jun 4, 2020

Regarding nullptr, our intention is to change zmq transport to also return nullptr for empty messages.

@davidrohr
Copy link
Copy Markdown
Collaborator

Hm, but what exactly is the problem with openssl? I mean, it should be available everywhere, and for just SHA2, there should be no strict version requirement. On the other hand, for SHA2, we could also just use an ad-hoc implementation. Should not be that much code.

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

@ktf
Copy link
Copy Markdown
Member Author

ktf commented Jun 4, 2020

@matthiasrichter There is one more place where the payload is checked for nullptr. It should not matter (because it's in the timers path and timers always have a payload) however I will remove it, also in light of @rbx comment about the API evolution.

@rbx
Copy link
Copy Markdown
Contributor

rbx commented Jun 5, 2020

FairMQ v1.4.19 is now available with the fix for the shmem alignment (via MemoryResource), the nullptr return value for empty messages and the improved hashing algorithm (SHA2) to generate the shared memory id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants