Skip to content

Updated format for transport of TPCNativeCluster#3663

Merged
davidrohr merged 6 commits into
AliceO2Group:devfrom
matthiasrichter:dev-clusternative-format
May 30, 2020
Merged

Updated format for transport of TPCNativeCluster#3663
davidrohr merged 6 commits into
AliceO2Group:devfrom
matthiasrichter:dev-clusternative-format

Conversation

@matthiasrichter
Copy link
Copy Markdown
Collaborator

@matthiasrichter matthiasrichter commented May 27, 2020

This PR is sketching the work for moving to an updated transport format as it is produced by the CAclusterer. The format consists of a header with the index of cluster counts for the full TPC and one monolithic buffer for all cluster objects. The cluster offsets per {sector,row} coordinates corresponds to the accumulated cluster counts. The preliminary version seems to work when MC labels are disabled.

The main changes are:

  • the ClusterDecoderRaw directly writes the updated format

  • ClusterNativeHelper can read the updated format

  • handling of the MC labels, this is a bit convoluted with multiple copies, needs to be sorted out

  • set ClusterNativeAccess from monolithic buffer without copy

  • publish the CA clusterer output in the updated format

  • test with subsequent workflows (ITS-TPC matching, TPC interpolation)

  • make the writing and reading of ClusterNative files working for all cases

@matthiasrichter
Copy link
Copy Markdown
Collaborator Author

I found a bug in the copying of the MCLabels on the sender side, but seems there is one more problem reading the labels.

@matthiasrichter matthiasrichter force-pushed the dev-clusternative-format branch 2 times, most recently from 957ff9b to 90ead9c Compare May 28, 2020 13:42
@matthiasrichter
Copy link
Copy Markdown
Collaborator Author

Just updated the PR with a first working version. Have tested it including the ITTPS matching

o2-its-reco-workflow --disable-root-output | o2-tpc-reco-workflow --infile tpcdigits.root --input-type digits --output-type clusters,tracks,disable-writer --ca-clusterer | o2-tpcits-match-workflow --disable-root-input

Also without the ca-clusterer, the number of matches is slightly different then.

The ca clusters are copied to the output message right now, this can be changed in a follow up taking the relevant code from #3552 . No copy in the subscribing processes if the ca-clusterer is used.

I'm starting now to clean-up. Have to check how to get the writing and reading of the new transport format can be handled with the RootFileWriter. Changing the configuration of the writer is probably trivial, the reader needs to be adjusted to support reading from both sector branches and from the single branch.

@shahor02
Copy link
Copy Markdown
Collaborator

@matthiasrichter: a small difference with or w/o --ca-clusterer is expected, sine the both the TPC clusters and tracks are different. On the level of matching in 10 pp events I see 62 vs 61 matches difference, I think it is ok.

But, when running TPC reco with MC ON, I see an error in the RootTreeWriter:

o2-tpc-reco-workflow  --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type tracks,clusters,encoded-clusters --tpc-track-writer "--treename events --track-branch-name Tracks"
...
[29673:tpc-native-cluster-writer]: [METRIC] data_relayer/1140,0 0 1590676271135 hostname=alicers02,name=o2-tpc-reco-workError in <TMessage::ReadClass>: The on-file class is "'o2::dataformats::MCTruthContainer<o2::MCCompLabel>" which is not compatible with the requested class: "vector<o2::dataformats::MCTruthContainer<o2::MCCompLabel> >"
[29673:tpc-native-cluster-writer]: Error in <TMessage::ReadObject>: got object of wrong class! requested vector<o2::dataformats::MCTruthContainer<o2::MCCompLabel> > but got o2::dataformats::MCTruthContainer<o2::MCCompLabel>
....

Apart from that, cluster loading in the TPCInterpolationSpec.cxx needs the same changes as in the TPCITSMatchingSpec, right now it produces

[3758:tpc-track-interpolation]: [17:25:00][ERROR] Exception caught: Expecting data for single sectors

when running with the raw input (with TPC reco using --ca-clusterer)

and

[15694:tpc-track-interpolation]: [17:39:32][ERROR] Exception caught: Did not receive TPC clusters data for all sectors

when running with ca-clusterer from digits.

o2-its-reco-workflow  --disable-root-output --disable-mc | o2-tpc-reco-workflow  --ca-clusterer --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type clusters,tracks,disable-writer  --disable-mc| o2-tpcits-match-workflow  --disable-root-input --disable-root-output  --disable-mc| o2-tof-reco-workflow --disable-root-input --disable-root-output  --disable-mc| o2-tpc-scdcalib-interpolation-workflow --disable-root-input

Doing the same with external clusterer:

o2-its-reco-workflow  --disable-root-output | o2-tpc-reco-workflow  --tpc-digit-reader '--infile tpcdigits.root' --input-type digits --output-type clusters,tracks,disable-writer | o2-tpcits-match-workflow  --disable-root-input --disable-root-output | o2-tof-reco-workflow --disable-root-input --disable-root-output | o2-tpc-scdcalib-interpolation-workflow --disable-root-input

works fine

@matthiasrichter
Copy link
Copy Markdown
Collaborator Author

thanks, yes, only the TPCITS matching was expected to work, I was not clear in my post.
The errors you have seen are simply related to the things not yet implemented. Doing this now.

@matthiasrichter matthiasrichter force-pushed the dev-clusternative-format branch from 90ead9c to f6a1539 Compare May 28, 2020 23:28
@matthiasrichter
Copy link
Copy Markdown
Collaborator Author

Hi @shahor02 , @davidrohr , next iteration is pushed, now the writing/reading from/to file is tested with the TPC reco workflow, e.g. native cluster file produced with and without CA clusterer as input to the CA tracker. Also the TPC interpolation workflow is adjusted. Please check with the reader in the TPCITS matching and TPC interpolation workflows.

@davidrohr
Copy link
Copy Markdown
Collaborator

I had only a quick look. Most cases seem to work. I am not so sure about the below one, seem the track interpolation doesn't run in this case:

o2-its-reco-workflow --disable-root-output | o2-tpc-reco-workflow --tpc-digit-reader '--infile tpcdigits.root' --ca-clusterer --input-type digits --output-type clusters,tracks | o2-tpcits-match-workflow --disable-root-input --disable-root-output | o2-tof-reco-workflow --disable-root-input --disable-root-output | o2-tpc-scdcalib-interpolation-workflow --disable-root-input

Will have a closer look later today.

@shahor02
Copy link
Copy Markdown
Collaborator

For me all modes work, including the one mentioned by @davidrohr (both with and w/o disable-writer in the TPC w/flow).

shahor02
shahor02 previously approved these changes May 29, 2020
…kingWorkflows

In the future, we can drop the logic for checking valid and active sectors,
this is done by the completion policy. We keep it as an extra check during the
heavy development phase.
@matthiasrichter
Copy link
Copy Markdown
Collaborator Author

matthiasrichter commented May 29, 2020

Thanks @shahor02 for checking. The PR is now cleaned up, so we can merge when the CI is ready.

@matthiasrichter matthiasrichter changed the title WIP: Updated format for transport of TPCNativeCluster Updated format for transport of TPCNativeCluster May 29, 2020
The format is produced by the CA clusterer and consists of an index
header ClusterCountIndex with cluster counts per {sector,row} for the
full TPC followed by a linear buffer of ClusterNative objects. The
offsets within the linear buffer are derived from the cluster counts.
The MCLabelContainer corresponds to the sequence of clusters.

Main changes:
- Reworking HardwareClusterDecoder to use cluster index
- Adjusting ClusterNativeHelper to read updated ClusterNative transport format

Further notes:
Still the helper can read the cluster index from a list of inputs, but if
only one input is provided (data and optional MC labels), the index is set directly
from the single buffer without copy. If multiple inputs are available a copy is needed
to create linear buffers for clusters and labels

In the updated format, a full sector or even the full TPC is sent so we do not need the
vector of MCLabelContainers to transport all the MCLabelContainers for the padrows
Sending out the updated transport format which will allow to transport
the data between devices without copy. For the moment, the CA tracker
copies the data into the output message, but the GPU worker will be
changed to write directly to the provided buffer.

Further improvements to the input handling in the CA tracker spec:
- Get rid of the fixed arrays for collecting the input objects
- Adjusting cluster completeness check in the TPC CA tracker to use
  the sector bitmask from the sector header

The check can be dropped in the future as it is the task of the
completion policy to provide complete data sets. We keep it as a check
for a while during the hot development phase.
With the updated transport format for ClusterNative data, the full TPC data
is shipped in one block and also written to a single branch.

Adjusting writer for native clusters to use one single MCLabelContainer
Only one block of clusters and labels is produced by the CA clusterer and
the writer is configured accordingly.

Added a runtime check to the PublisherSpec to detect whether a fill is using
Full or Sector mode. If only one branch of the configured branch name is found
without the sector number suffix, the publisher switches to SectorMode::Full.

We still need to define all output routes because the dynamic adjustment is
at a later stage than the definition of routes. The full block will be published
on one route, while zero-length data is created on all the other routes. By this
we ensure that a completion policy waiting for input on all data can fire.
@matthiasrichter matthiasrichter force-pushed the dev-clusternative-format branch from a74e400 to 13bc183 Compare May 29, 2020 13:31
Allocating the buffer for cluster output before calling the GPU worker.
It can be filled directly from the worker instead of copying after the
worker.
// a byte size resizable vector object, the DataAllocator returns reference to internal object
// initialize optional pointer to the vector object
using ClusterOutputChunkType = std::decay_t<decltype(pc.outputs().make<std::vector<char>>(Output{"", "", 0}))>;
ClusterOutputChunkType* clusterOutput = nullptr;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidrohr , providing the buffer for cluster output now before calling the GPU worker, so you can use it later to directly write to the vector. It has zero length now, although created there.

Copy link
Copy Markdown
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried all combinations I could think of and in the newest version they seem to work.

@matthiasrichter
Copy link
Copy Markdown
Collaborator Author

trying to find out what's the problem with the checkcode CI but I can't find a hint that the problem is related.

@shahor02
Copy link
Copy Markdown
Collaborator

For some reason it looks for non-existent directory, but not clear why (perhaps @aphecetche can suggest):

CMake Error at Detectors/MUON/MCH/Tracking/cmake_install.cmake:64 (file):
  file INSTALL cannot find
  "/mnt/mesos/sandbox/sandbox/sw/SOURCES/O2/build_o2checkcode_o2/0/Detectors/MUON/MCH/Tracking/include/MCHTracking":
  No such file or directory.
Call Stack (most recent call first):
  Detectors/MUON/MCH/cmake_install.cmake:67 (include)
  Detectors/MUON/cmake_install.cmake:42 (include)
  Detectors/cmake_install.cmake:102 (include)
  cmake_install.cmake:81 (include)


gmake: *** [install] Error 1

@aphecetche
Copy link
Copy Markdown
Collaborator

@ktf looks like another PR interference ? with #3648 (that did move those includes)

@rpezzi
Copy link
Copy Markdown
Collaborator

rpezzi commented May 30, 2020

@ktf looks like another PR interference ? with #3648 (that did move those includes)

o2checkcode is misbehaving on #3648 as well. I am pretty sure that o2checkcode completed OK earlier today after a new push. It was a dummy commit created with the sole intent of triggering checks. However, it is now showing as failed with signs of a very broken build, as this PR does.

@davidrohr davidrohr merged commit 7f6eb41 into AliceO2Group:dev May 30, 2020
@matthiasrichter matthiasrichter deleted the dev-clusternative-format branch June 2, 2020 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants