DPL: enable shmem (O2-1434) by ktf · Pull Request #3667 · AliceO2Group/AliceO2

ktf · 2020-05-27T21:45:23Z

This enables shared memory when on the same host and makes sure that the --shm-segment-size option is the same and the largest specified value for every device. If no device requires a special size, the default of 2GB is used.

* If no value is specified, use the default, i.e. 2GB. * If one of the subworkflows in the pipe chain specifies a value, use that value for everyone. * If more than one subworkflow in the pipe chain specifies a value, use the largest value for everyone.

ktf · 2020-05-27T21:46:25Z

@davidrohr this should give you the behaviour you requested. Can you please test it before I merge it?

davidrohr · 2020-05-28T07:25:08Z

OK, thanks, I will try it. Before we merge this, we should wait for #3552 and alisw/alidist#2262 to have proper buffer sizes requested, and propper error handling when the buffer size is exceeded.

ironMann · 2020-05-28T08:01:37Z

+    info.cmdLineArgs.erase(it, it + 2);
+  }
+  if (segmentSize == 0) {
+    segmentSize = 2000000000LL;


We'll eventually want to back the shm segments with huge pages, where the size become important (multiple of page size). Perhaps this should be (2ULL << 30)

Do we want GB huge pages or 2/4 MB? GB huge pages need to be allocated at boot time if I remember correctly?

I think 2M would be fine for us. Don't think there would be much benefit going to 1G. There are several reservation methods with different flexibility: kernel boot parameter, early during boot, etc. We'll have to see what works best for us.

Notice that that's the current default in FairMQ, but I agree a 1ULL <<31 is probably better in any case. This could should actually be handled at FairMQ level, IMHO: always snap and align to page boundaries...

ktf · 2020-05-28T08:30:54Z

@ironMann do you understand why StfSender dies?

ironMann · 2020-05-28T08:35:59Z

@kft @davidrohr AliceO2Group/DataDistribution@a18de74 ?

is this not needed here anymore?

davidrohr · 2020-05-28T08:40:28Z

@ironMann : Yes, I assume so. AMD ROCm is still building with CC7 devtoolset 7. Has this workaround been removed, or why is it failing now?

ironMann · 2020-05-28T08:55:41Z

No it's still there. I think this problem spread to other binaries. It's weird, because the boost program_options is done in all binaries... Looks like linking of StfBuilder fails this time.
Can you rerun the builder with DataDistribution/dev ?

davidrohr · 2020-05-28T08:57:52Z

I cannot do it easily, perhaps @ktf can. But I am quite sure it is the same problem. So I'd just merge the change in DataDistribution and then retry here.

ktf · 2020-05-28T09:13:14Z

I cannot do it easily because it would require doing so in some sort of alidist / gpu builder, which at the moment we do not have (that's why it trickled through the CI).

ktf · 2020-05-28T09:16:37Z

Anyways, I think it's unrelated to the PR and more to the 0.7.5 bump I did yesterday. @davidrohr can you confirm this behaves as expected for you?

ironMann · 2020-05-28T09:19:42Z

Then each test would require alidist bump?
Can you instead build DataDistribution/dev in the O2-gpu builder?

ktf · 2020-05-28T09:20:14Z

Also, I suggest we do the change for 2ULL << 30 in a separate PR once FairMQ changes it's default, otherwise people might be puzzled that the help says one thing and then we do another.

@rbx any objection to change the default in FairMQ?

davidrohr · 2020-05-28T09:21:33Z

I can build it manually in the gpucibuilder container, but I'll have to build from scratch since I don't have a setup anymore and don't have much time now. I can do it over the weekend.

ktf · 2020-05-28T09:22:58Z

@ironMann the correct thing would be to have a CI for AliceO2Group/DataDistribution which builds any PR entering there with the current master of alidist. I can set it up, but not in the next few hours.

davidrohr · 2020-05-28T09:23:33Z

Anyways, I think it's unrelated to the PR and more to the 0.7.5 bump I did yesterday. @davidrohr can you confirm this behaves as expected for you?

Which 0.7.5 bump? Will test it asap.

ktf · 2020-05-28T09:24:41Z

Which 0.7.5 bump? Will test it asap.

alisw/alidist#2267

ironMann · 2020-05-28T09:25:28Z

Which 0.7.5 bump? Will test it asap.

This already tried to build 0.7.5 of DataDistribution (in alidist). However the changes in between versions broke the gpu builder.

ironMann · 2020-05-28T09:29:23Z

@ironMann the correct thing would be to have a CI for AliceO2Group/DataDistribution which builds any PR entering there with the current master of alidist. I can set it up, but not in the next few hours.

This would be cool. Let me know if you need anything from me.
But, also run the GPU builder for alidist PRs?

davidrohr · 2020-05-28T09:31:51Z

We anyway decided in WP3 to merge the builders for checkcode / gpu / and linux o2.
I'll take care of this as soon as AMD supports a proper compiler instead of devtoolset7.
They have been delaying this since March. Let's see...
Then we shouldn't need additional builder or extensions of the current GPU builder.

ktf · 2020-05-28T09:32:31Z

GPU builder for alidist PRs

yeah, probably we should... The problem is that we are juggling too many builds at the moment and I need to do some reshuffling in the CI code before I am confident it will not break everything. For DataDistribution it's easy, because it's a small package at the tip of the dependency chain...

ktf · 2020-05-28T09:32:51Z

I will let you know if I need anything, but I should have superpowers.

davidrohr · 2020-05-28T09:37:14Z

OK, since this is urgent as O2 GPU CI is broken generally, I started a local build in the container now. Will report back in 2 hours when it is hopefully done.

davidrohr · 2020-05-28T12:22:19Z

@ktf : Building locally, it failed with arrow (see attached log).
I could fix it by adding an explicit ARROW_ROOT to alidist/o2.sh.
a.gz

davidrohr · 2020-05-28T12:35:12Z

@ironMann : The below patch fixes the DataDistribution compilation in the GPU CI for me (as expected). Could you add it to DataDistribution and bump it in alidist?

diff --git a/src/StfBuilder/runStfBuilderDevice.cxx b/src/StfBuilder/runStfBuilderDevice.cxx
index 623e6fd..916267f 100644
--- a/src/StfBuilder/runStfBuilderDevice.cxx
+++ b/src/StfBuilder/runStfBuilderDevice.cxx
@@ -22,6 +22,7 @@
 #include <runFairMQDevice.h>
 
 namespace bpo = boost::program_options;
+template class std::basic_string<char, std::char_traits<char>, std::allocator<char> >; // Workaround for bug in CC7 devtoolset7
 
 void addCustomOptions(bpo::options_description& options)
 {

rbx · 2020-05-28T13:09:06Z

@rbx any objection to change the default in FairMQ?

No objection, will change it to 2ULL << 30.

ironMann · 2020-05-28T16:51:57Z

@davidrohr , I added the same for each executable. alisw/alidist#2273

davidrohr · 2020-05-28T19:40:28Z

@ktf : It seems applying the option works, but e.g.

o2-its-reco-workflow --disable-root-output | o2-tpc-reco-workflow --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type clusters,tracks,disable-writer --ca-clusterer | o2-tpcits-match-workflow --disable-root-input --disable-root-output | o2-tof-reco-workflow --disable-root-input --disable-root-output | o2-tpc-scdcalib-interpolation-workflow --disable-root-input --shm-segment-size 10000000

fails with

[ERROR] error while setting up workflow: Error while parsing serialised workflow

Shouldn't I get at some point an error that the memory size was exceeded?

davidrohr · 2020-05-29T05:04:34Z

@ktf : I tried to reduce the shared memory size because I wanted to see the memory allocation failing, but instead I god the message that the workflow failed to initialize. Dump attached.
dump.gz
log.gz

ktf · 2020-05-29T06:21:31Z

ah, ok.... I think it's unrelated from this change. You get that message when something goes wrong with the merging of the workflow. I will try to see if I can create a better error message.

ktf · 2020-05-29T11:59:56Z

I suggest we nevertheless merge this, objections?

davidrohr · 2020-05-30T09:29:39Z

So this seems to work for me.
Can be merged after #3552 is merged, otherwise the TPC workflow fails at default settings.

Just one comment: If I set the shared segment size too small, the workflow fails to initialize. I get errors like [ERROR] Unable to pass configuration to children at startup. I assume this is because the shared memory is also used to pass the configuration? Perhaps it would be good to have an error indicating one runs out of memory instead of that generic error?

ktf · 2020-05-30T09:54:56Z

No, the configuration is actually passed via stdout / stdin between the parent process and the children. I suspect when the memory is too low some printout scrambles the json passed between the two. I need to check.

davidrohr · 2020-05-30T11:20:29Z

@ktf : The TPC change is done, you can merge this as soon as you want.

shahor02 · 2020-05-31T00:13:29Z

With this PR I processingFcn of some devices not called.
For instance, o2-its-reco-workflow runs correctly digits reader, clusterizer, cluster-writer but not the tracker. I see its init() and endOfStream() called but not run(). No error message is printed.

If I revert a5815c3, everything works.

I've rebuild everything with fresh alidist, does not help.

Problem (ubuntu18.04) can be reproduced by

o2-sim -m PIPE ITS -n 10 -g pythia8
o2-sim-digitizer-workflow
o2-its-reco-workflow

with a5815c3 reverted there are lot of messages like

[16454:its-cooked-tracker]:  - Vertexer initialisation completed in: 0.004946 ms
[16454:its-cooked-tracker]:  - Tracklet finding completed in: 0.00351 ms
[16454:its-cooked-tracker]:  - Adjacent tracklets validation completed in: 0.000945 ms

and o2trac_its.root with a tree is written. W/o reverting there are no log and the output file is empty (created in init() and closed in endOfStream(), the

AliceO2/Detectors/ITSMFT/ITS/workflow/src/CookedTrackerSpec.cxx

Line 90 in ba09524

void CookedTrackerDPL::run(ProcessingContext& pc)

is not called.

@ktf any hint how to debug this?

ktf · 2020-05-31T01:05:20Z

I've some issues as well on macOS. I suspect it's a regression in FairMQ 1.4.18, but have no further clue, for now. I made #3680 to revert.

davidrohr · 2020-06-02T13:32:57Z

@ktf : I tried @shahor02 commands above and I can reproduce the issue he sees.
I went back to FairMQ 1.4.16, and also there I see the issue, so it is not a regression in 1.4.18.

ktf · 2020-06-02T13:56:13Z

Ok, I will have a look.

ktf · 2020-06-04T13:32:31Z

So, I can now confirm there were two problems:

One is the one being discussed in https://alice.its.cern.ch/jira/browse/O2-1485. This was preventing me to reproduce the issue in a consistent manner. We therefore need Try to improve session name hashing FairRootGroup/FairMQ#276 or equivalent fix.
The one which affects this PR seems to be that in the forwarding the payload of ITS/CLUSTERS is nullptr when using shared memory (this is in DataProcessingDevice.cxx:forwardInputs(...)) and the check for nullptr silently skips it. Notice that the input.header is correct.

ktf · 2020-06-04T13:35:24Z

I confirm that without transport=shmem, input.payload is non null.

ktf · 2020-06-04T13:35:58Z

@matthiasrichter does it ring any bell?

shahor02 · 2020-06-04T13:46:03Z

@ktf yes, ITSMFT clusterized at the moment sends an empty vector ITS/CLUSTERS, this is a legacy feature which will be suppressed once MFT will be ready to use COMPCLUSTERS

ktf · 2020-06-04T13:47:11Z

ok, but why does "empty vector" mean nullptr in the shmem case?

davidrohr · 2020-06-04T13:49:44Z

Actually I remember there was a similar issue for the TPC. When there was no data in a sector, the event was lost before, and this was fixed by @matthiasrichter

ktf · 2020-06-04T13:50:15Z

Yes, I remember. Let me find matthias fix. Maybe it was in a separate part of the code.

davidrohr · 2020-06-04T14:02:37Z

* One is the one being discussed in https://alice.its.cern.ch/jira/browse/O2-1485. This was preventing me to reproduce the issue in a consistent manner. We therefore need [FairRootGroup/FairMQ#276](https://github.com/FairRootGroup/FairMQ/pull/276) or equivalent fix.

Is that only relevant for mac or also for linux?

ktf · 2020-06-04T14:25:25Z

I think it's relevant for both. What might mitigate it on linux is that maybe boost::hash(<uid> + dpl_<pid>) has more entropy due to the fact <uid> and <pid> are larger (e.g. on lxplus or simply because systemd uses more <pid>s). This might actually explain some random failures we have, as pids wrap around. Anyways, the best solution would be to have sha256 or some other cryptographic hash.

davidrohr · 2020-06-04T14:30:47Z

Anyways, the best solution would be to have sha256 or some other cryptographic hash.

Fully agree, so what prevents us from using sha2?

matthiasrichter · 2020-06-04T14:57:16Z

Indeed, its a similar problem like I have fixed for the InputRecord in #3578, we should have looked for other places immediately.

Empty vector means zero-length message, and such messages have been handled differently in FairMQ. While ZMQ transport provides a valid pointer and message length zero, the SHM transport returns nullptr for zero length messages.

rbx · 2020-06-04T15:09:21Z

I also feel crypto hash would be the best. My only hesitation is requiring openssl just for that. Another option is to use crc32, it's included in boost and gives different output for small/similar inputs (just tested on mac and linux). While not as robust as SHA, it may suffice for this use case. It is also short enough that it doesn't need the truncation.

rbx · 2020-06-04T15:10:00Z

Regarding nullptr, our intention is to change zmq transport to also return nullptr for empty messages.

davidrohr · 2020-06-04T15:41:47Z

Hm, but what exactly is the problem with openssl? I mean, it should be available everywhere, and for just SHA2, there should be no strict version requirement. On the other hand, for SHA2, we could also just use an ad-hoc implementation. Should not be that much code.

ktf · 2020-06-04T17:36:30Z

https://github.com/martynafford/sha-2/blob/master/src/sha2.hpp actually...

ktf · 2020-06-04T17:39:04Z

@matthiasrichter There is one more place where the payload is checked for nullptr. It should not matter (because it's in the timers path and timers always have a payload) however I will remove it, also in light of @rbx comment about the API evolution.

rbx · 2020-06-05T16:27:31Z

FairMQ v1.4.19 is now available with the fix for the shmem alignment (via MemoryResource), the nullptr return value for empty messages and the improved hashing algorithm (SHA2) to generate the shared memory id.

ktf added 2 commits May 27, 2020 23:41

DPL: make shmem default transport when running on same node

a5815c3

ktf requested a review from a team as a code owner May 27, 2020 21:45

ironMann reviewed May 28, 2020

View reviewed changes

ktf changed the title ~~DPL: enable shmem~~ DPL: enable shmem (O2-1434) May 30, 2020

ktf merged commit 37a1ad6 into AliceO2Group:dev May 30, 2020

ktf deleted the shmem-on branch May 30, 2020 11:21

Conversation

ktf commented May 27, 2020

Uh oh!

ktf commented May 27, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

ironMann May 28, 2020

Choose a reason for hiding this comment

Uh oh!

davidrohr May 28, 2020

Choose a reason for hiding this comment

Uh oh!

ironMann May 28, 2020

Choose a reason for hiding this comment

Uh oh!

ktf May 28, 2020

Choose a reason for hiding this comment

Uh oh!

ktf commented May 28, 2020

Uh oh!

ironMann commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

ironMann commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

ktf commented May 28, 2020

Uh oh!

ktf commented May 28, 2020

Uh oh!

ironMann commented May 28, 2020

Uh oh!

ktf commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

ktf commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

ktf commented May 28, 2020

Uh oh!

ironMann commented May 28, 2020

Uh oh!

ironMann commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

ktf commented May 28, 2020

Uh oh!

ktf commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

rbx commented May 28, 2020

Uh oh!

ironMann commented May 28, 2020

Uh oh!

davidrohr commented May 28, 2020

Uh oh!

davidrohr commented May 29, 2020

Uh oh!

ktf commented May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ktf commented May 29, 2020

Uh oh!

davidrohr commented May 30, 2020

Uh oh!

ktf commented May 30, 2020

Uh oh!

davidrohr commented May 30, 2020

Uh oh!

ktf commented May 29, 2020 •

edited

Loading

ktf commented Jun 4, 2020 •

edited

Loading