Updated format for transport of TPCNativeCluster by matthiasrichter · Pull Request #3663 · AliceO2Group/AliceO2

matthiasrichter · 2020-05-27T10:05:41Z

This PR is sketching the work for moving to an updated transport format as it is produced by the CAclusterer. The format consists of a header with the index of cluster counts for the full TPC and one monolithic buffer for all cluster objects. The cluster offsets per {sector,row} coordinates corresponds to the accumulated cluster counts. The preliminary version seems to work when MC labels are disabled.

The main changes are:

the ClusterDecoderRaw directly writes the updated format
ClusterNativeHelper can read the updated format
handling of the MC labels, this is a bit convoluted with multiple copies, needs to be sorted out
set ClusterNativeAccess from monolithic buffer without copy
publish the CA clusterer output in the updated format
test with subsequent workflows (ITS-TPC matching, TPC interpolation)
make the writing and reading of ClusterNative files working for all cases

matthiasrichter · 2020-05-27T11:13:08Z

I found a bug in the copying of the MCLabels on the sender side, but seems there is one more problem reading the labels.

matthiasrichter · 2020-05-28T13:54:43Z

Just updated the PR with a first working version. Have tested it including the ITTPS matching

o2-its-reco-workflow --disable-root-output | o2-tpc-reco-workflow --infile tpcdigits.root --input-type digits --output-type clusters,tracks,disable-writer --ca-clusterer | o2-tpcits-match-workflow --disable-root-input

Also without the ca-clusterer, the number of matches is slightly different then.

The ca clusters are copied to the output message right now, this can be changed in a follow up taking the relevant code from #3552 . No copy in the subscribing processes if the ca-clusterer is used.

I'm starting now to clean-up. Have to check how to get the writing and reading of the new transport format can be handled with the RootFileWriter. Changing the configuration of the writer is probably trivial, the reader needs to be adjusted to support reading from both sector branches and from the single branch.

shahor02 · 2020-05-28T15:44:14Z

@matthiasrichter: a small difference with or w/o --ca-clusterer is expected, sine the both the TPC clusters and tracks are different. On the level of matching in 10 pp events I see 62 vs 61 matches difference, I think it is ok.

But, when running TPC reco with MC ON, I see an error in the RootTreeWriter:

o2-tpc-reco-workflow  --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type tracks,clusters,encoded-clusters --tpc-track-writer "--treename events --track-branch-name Tracks"
...
[29673:tpc-native-cluster-writer]: [METRIC] data_relayer/1140,0 0 1590676271135 hostname=alicers02,name=o2-tpc-reco-workError in <TMessage::ReadClass>: The on-file class is "'o2::dataformats::MCTruthContainer<o2::MCCompLabel>" which is not compatible with the requested class: "vector<o2::dataformats::MCTruthContainer<o2::MCCompLabel> >"
[29673:tpc-native-cluster-writer]: Error in <TMessage::ReadObject>: got object of wrong class! requested vector<o2::dataformats::MCTruthContainer<o2::MCCompLabel> > but got o2::dataformats::MCTruthContainer<o2::MCCompLabel>
....

Apart from that, cluster loading in the TPCInterpolationSpec.cxx needs the same changes as in the TPCITSMatchingSpec, right now it produces

[3758:tpc-track-interpolation]: [17:25:00][ERROR] Exception caught: Expecting data for single sectors

when running with the raw input (with TPC reco using --ca-clusterer)

and

[15694:tpc-track-interpolation]: [17:39:32][ERROR] Exception caught: Did not receive TPC clusters data for all sectors

when running with ca-clusterer from digits.

o2-its-reco-workflow  --disable-root-output --disable-mc | o2-tpc-reco-workflow  --ca-clusterer --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type clusters,tracks,disable-writer  --disable-mc| o2-tpcits-match-workflow  --disable-root-input --disable-root-output  --disable-mc| o2-tof-reco-workflow --disable-root-input --disable-root-output  --disable-mc| o2-tpc-scdcalib-interpolation-workflow --disable-root-input

Doing the same with external clusterer:

o2-its-reco-workflow  --disable-root-output | o2-tpc-reco-workflow  --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type clusters,tracks,disable-writer | o2-tpcits-match-workflow  --disable-root-input --disable-root-output | o2-tof-reco-workflow --disable-root-input --disable-root-output | o2-tpc-scdcalib-interpolation-workflow --disable-root-input

works fine

matthiasrichter · 2020-05-28T18:45:29Z

thanks, yes, only the TPCITS matching was expected to work, I was not clear in my post.
The errors you have seen are simply related to the things not yet implemented. Doing this now.

matthiasrichter · 2020-05-29T05:22:31Z

Hi @shahor02 , @davidrohr , next iteration is pushed, now the writing/reading from/to file is tested with the TPC reco workflow, e.g. native cluster file produced with and without CA clusterer as input to the CA tracker. Also the TPC interpolation workflow is adjusted. Please check with the reader in the TPCITS matching and TPC interpolation workflows.

davidrohr · 2020-05-29T05:40:11Z

I had only a quick look. Most cases seem to work. I am not so sure about the below one, seem the track interpolation doesn't run in this case:

o2-its-reco-workflow --disable-root-output | o2-tpc-reco-workflow --tpc-digit-reader '--infile tpcdigits.root' --ca-clusterer --input-type digits --output-type clusters,tracks | o2-tpcits-match-workflow --disable-root-input --disable-root-output | o2-tof-reco-workflow --disable-root-input --disable-root-output | o2-tpc-scdcalib-interpolation-workflow --disable-root-input

Will have a closer look later today.

shahor02 · 2020-05-29T11:16:51Z

For me all modes work, including the one mentioned by @davidrohr (both with and w/o disable-writer in the TPC w/flow).

…kingWorkflows In the future, we can drop the logic for checking valid and active sectors, this is done by the completion policy. We keep it as an extra check during the heavy development phase.

matthiasrichter · 2020-05-29T11:54:15Z

Thanks @shahor02 for checking. The PR is now cleaned up, so we can merge when the CI is ready.

The format is produced by the CA clusterer and consists of an index header ClusterCountIndex with cluster counts per {sector,row} for the full TPC followed by a linear buffer of ClusterNative objects. The offsets within the linear buffer are derived from the cluster counts. The MCLabelContainer corresponds to the sequence of clusters. Main changes: - Reworking HardwareClusterDecoder to use cluster index - Adjusting ClusterNativeHelper to read updated ClusterNative transport format Further notes: Still the helper can read the cluster index from a list of inputs, but if only one input is provided (data and optional MC labels), the index is set directly from the single buffer without copy. If multiple inputs are available a copy is needed to create linear buffers for clusters and labels In the updated format, a full sector or even the full TPC is sent so we do not need the vector of MCLabelContainers to transport all the MCLabelContainers for the padrows

Sending out the updated transport format which will allow to transport the data between devices without copy. For the moment, the CA tracker copies the data into the output message, but the GPU worker will be changed to write directly to the provided buffer. Further improvements to the input handling in the CA tracker spec: - Get rid of the fixed arrays for collecting the input objects - Adjusting cluster completeness check in the TPC CA tracker to use the sector bitmask from the sector header The check can be dropped in the future as it is the task of the completion policy to provide complete data sets. We keep it as a check for a while during the hot development phase.

With the updated transport format for ClusterNative data, the full TPC data is shipped in one block and also written to a single branch. Adjusting writer for native clusters to use one single MCLabelContainer Only one block of clusters and labels is produced by the CA clusterer and the writer is configured accordingly. Added a runtime check to the PublisherSpec to detect whether a fill is using Full or Sector mode. If only one branch of the configured branch name is found without the sector number suffix, the publisher switches to SectorMode::Full. We still need to define all output routes because the dynamic adjustment is at a later stage than the definition of routes. The full block will be published on one route, while zero-length data is created on all the other routes. By this we ensure that a completion policy waiting for input on all data can fire.

Allocating the buffer for cluster output before calling the GPU worker. It can be filled directly from the worker instead of copying after the worker.

matthiasrichter · 2020-05-29T13:49:41Z

+      // a byte size resizable vector object, the DataAllocator returns reference to internal object
+      // initialize optional pointer to the vector object
+      using ClusterOutputChunkType = std::decay_t<decltype(pc.outputs().make<std::vector<char>>(Output{"", "", 0}))>;
+      ClusterOutputChunkType* clusterOutput = nullptr;


@davidrohr , providing the buffer for cluster output now before calling the GPU worker, so you can use it later to directly write to the vector. It has zero length now, although created there.

davidrohr

Tried all combinations I could think of and in the newest version they seem to work.

matthiasrichter · 2020-05-29T16:17:28Z

trying to find out what's the problem with the checkcode CI but I can't find a hint that the problem is related.

shahor02 · 2020-05-29T17:07:12Z

For some reason it looks for non-existent directory, but not clear why (perhaps @aphecetche can suggest):

CMake Error at Detectors/MUON/MCH/Tracking/cmake_install.cmake:64 (file):
  file INSTALL cannot find
  "/mnt/mesos/sandbox/sandbox/sw/SOURCES/O2/build_o2checkcode_o2/0/Detectors/MUON/MCH/Tracking/include/MCHTracking":
  No such file or directory.
Call Stack (most recent call first):
  Detectors/MUON/MCH/cmake_install.cmake:67 (include)
  Detectors/MUON/cmake_install.cmake:42 (include)
  Detectors/cmake_install.cmake:102 (include)
  cmake_install.cmake:81 (include)


gmake: *** [install] Error 1

aphecetche · 2020-05-29T17:28:05Z

@ktf looks like another PR interference ? with #3648 (that did move those includes)

rpezzi · 2020-05-30T00:40:19Z

@ktf looks like another PR interference ? with #3648 (that did move those includes)

o2checkcode is misbehaving on #3648 as well. I am pretty sure that o2checkcode completed OK earlier today after a new push. It was a dummy commit created with the sole intent of triggering checks. However, it is now showing as failed with signs of a very broken build, as this PR does.

matthiasrichter requested review from davidrohr, shahor02 and wiechula as code owners May 27, 2020 10:05

matthiasrichter force-pushed the dev-clusternative-format branch 2 times, most recently from 957ff9b to 90ead9c Compare May 28, 2020 13:42

matthiasrichter force-pushed the dev-clusternative-format branch from 90ead9c to f6a1539 Compare May 28, 2020 23:28

shahor02 previously approved these changes May 29, 2020

View reviewed changes

matthiasrichter added 2 commits May 29, 2020 13:23

Adjusting the TPC sector completion policy to use the bit mask

e5d7c17

Adjusting the completeness check for TPC cluster inputs in GlobalTrac…

fb2ac21

…kingWorkflows In the future, we can drop the logic for checking valid and active sectors, this is done by the completion policy. We keep it as an extra check during the heavy development phase.

matthiasrichter dismissed shahor02’s stale review via a74e400 May 29, 2020 11:48

matthiasrichter force-pushed the dev-clusternative-format branch from f6a1539 to a74e400 Compare May 29, 2020 11:48

matthiasrichter changed the title ~~WIP: Updated format for transport of TPCNativeCluster~~ Updated format for transport of TPCNativeCluster May 29, 2020

matthiasrichter added 3 commits May 29, 2020 15:31

matthiasrichter force-pushed the dev-clusternative-format branch from a74e400 to 13bc183 Compare May 29, 2020 13:31

Rearranging the allocation of the cluster output buffer

3ddf80b

Allocating the buffer for cluster output before calling the GPU worker. It can be filled directly from the worker instead of copying after the worker.

matthiasrichter commented May 29, 2020

View reviewed changes

davidrohr approved these changes May 29, 2020

View reviewed changes

davidrohr merged commit 7f6eb41 into AliceO2Group:dev May 30, 2020

matthiasrichter deleted the dev-clusternative-format branch June 2, 2020 11:21

Conversation

matthiasrichter commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthiasrichter commented May 27, 2020

Uh oh!

matthiasrichter commented May 28, 2020

Uh oh!

shahor02 commented May 28, 2020

Uh oh!

matthiasrichter commented May 28, 2020

Uh oh!

matthiasrichter commented May 29, 2020

Uh oh!

davidrohr commented May 29, 2020

Uh oh!

shahor02 commented May 29, 2020

Uh oh!

matthiasrichter commented May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthiasrichter May 29, 2020

Choose a reason for hiding this comment

Uh oh!

davidrohr left a comment

Choose a reason for hiding this comment

Uh oh!

matthiasrichter commented May 29, 2020

Uh oh!

shahor02 commented May 29, 2020

Uh oh!

aphecetche commented May 29, 2020

Uh oh!

rpezzi commented May 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

matthiasrichter commented May 27, 2020 •

edited

Loading

matthiasrichter commented May 29, 2020 •

edited

Loading