Skip to content

Commit 2607277

Browse files
authored
DPL Analysis: allow writing tables to files (#3197)
1 parent 2347eba commit 2607277

14 files changed

Lines changed: 714 additions & 181 deletions

Framework/Core/ANALYSIS.md

Lines changed: 73 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ defineDataProcessing() {
3535

3636
> **Implementation details**: `AnalysisTask` is simply a `struct`. Since `struct` default inheritance policy is `public`, we can omit specifying it when declaring MyTask.
3737
>
38-
> `AnalysisTask` will not actually provide any virtual method, as the `adaptAnalysis` helper relyes on template argument matching to discover the properties of the task. It will come clear in the next paragraph how this allow is used to avoid the proliferation of data subscription methods.
38+
> `AnalysisTask` will not actually provide any virtual method, as the `adaptAnalysis` helper relies on template argument matching to discover the properties of the task. It will come clear in the next paragraph how this allow is used to avoid the proliferation of data subscription methods.
3939

4040
## Processing data
4141

@@ -55,7 +55,7 @@ struct MyTask : AnalysisTask {
5555
};
5656
```
5757
58-
will allow you to get a per timeframe collection of tracks. You can then iterate on the tracks using the syntax:
58+
will allow you to get a per time frame collection of tracks. You can then iterate on the tracks using the syntax:
5959
6060
```cpp
6161
for (auto &track : tracks) {
@@ -77,7 +77,7 @@ This has the advantage that you might be able to benefit from vectorization / pa
7777
7878
> **Implementation notes**: as mentioned before, the arguments of the process method are inspected using template argument matching. This way the system knows at compile time what data types are requested by a given `process` method and can create the relevant DPL data descriptions.
7979
>
80-
> The distiction between `Tracks` and `Track` above is simply that one refers to the whole collection, while the second is an alias to `Tracks::iterator`. Notice that we assume that each collection is of type `o2::soa::Table` which carries metadata about the dataOrigin and dataDescription to be used by DPL to subscribe to the associated data stream.
80+
> The distinction between `Tracks` and `Track` above is simply that one refers to the whole collection, while the second is an alias to `Tracks::iterator`. Notice that we assume that each collection is of type `o2::soa::Table` which carries meta data about the dataOrigin and dataDescription to be used by DPL to subscribe to the associated data stream.
8181
8282
### Navigating data associations
8383
@@ -89,7 +89,7 @@ void process(o2::aod::Collision const& collision, o2::aod::Tracks &tracks) {
8989
}
9090
```
9191

92-
the above will be called once per collision found in the timeframe, and `tracks` will allow you to iterate on all the tracks associated to the given collision.
92+
the above will be called once per collision found in the time frame, and `tracks` will allow you to iterate on all the tracks associated to the given collision.
9393

9494
Alternatively, you might not require to have all the tracks at once and you could do with:
9595

@@ -176,7 +176,7 @@ struct MyTask : AnalysisTask {
176176
};
177177
```
178178

179-
the `etaphi` object is a functor that will effectively act as a cursor which allows to populate the `EtaPhi` table. Each invokation of the functor will create a new row in the table, using the arguments as contents of the given column. By default the arguments must be given in order, but one can give them in any order by using the correct column type. E.g. in the example above:
179+
the `etaphi` object is a functor that will effectively act as a cursor which allows to populate the `EtaPhi` table. Each invocation of the functor will create a new row in the table, using the arguments as contents of the given column. By default the arguments must be given in order, but one can give them in any order by using the correct column type. E.g. in the example above:
180180

181181
```cpp
182182
etaphi(track::Phi(calculatePhi(track), track::Eta(calculateEta(track)));
@@ -186,7 +186,7 @@ etaphi(track::Phi(calculatePhi(track), track::Eta(calculateEta(track)));
186186
187187
Sometimes columns are not backed by actual persisted data, but they are merely
188188
derived from it. For example you might want to have different representations
189-
(e.g. spherical, cylindrical) for a given persisten representation. You can
189+
(e.g. spherical, cylindrical) for a given persistent representation. You can
190190
do that by using the `DECLARE_SOA_DYNAMIC_COLUMN` macro.
191191
192192
```cpp
@@ -200,10 +200,10 @@ DECLARE_SOA_DYNAMIC_COLUMN(R2, r2, [](float x, float y) { return x*x + y+y; });
200200
DECLARE_SOA_TABLE(Point, "MISC", "POINT", X, Y, (R2<X,Y>));
201201
```
202202

203-
Notice how the dynamic column is defined as a standalone column and binds to X and Y
203+
Notice how the dynamic column is defined as a stand alone column and binds to X and Y
204204
only when you attach it as part of a table.
205205

206-
### Executing a finalisation method, post run
206+
### Executing a finalization method, post run
207207

208208
Sometimes it's handy to perform an action when all the data has been processed, for example executing a fit on a histogram we filled during the processing. This can be done by implementing the postRun method.
209209

@@ -335,7 +335,7 @@ struct MyTask : AnalysisTask {
335335
### Getting combinations (pairs, triplets, ...)
336336
To get combinations of distinct tracks, helper functions from `ASoAHelpers.h` can be used. Presently, there are 3 combinations policies available: strictly upper, upper and full.
337337
338-
The number of elements in a combination is deduced from the number of arguments pased to `combinations()` call. For example, to get pairs of tracks from the same source, one must specify `tracks` table twice:
338+
The number of elements in a combination is deduced from the number of arguments passed to `combinations()` call. For example, to get pairs of tracks from the same source, one must specify `tracks` table twice:
339339
340340
```cpp
341341
struct MyTask : AnalysisTask {
@@ -362,7 +362,7 @@ struct MyTask : AnalysisTask {
362362
};
363363
```
364364
365-
It will be possible to specify a filter for a combination as a whole, and only matching combinations will be then outputed. Currently, the filter is applied to each element separately. Note that for filter version the input tables are mentioned twice, both in policy constructor and in `combinations()` call itself.
365+
It will be possible to specify a filter for a combination as a whole, and only matching combinations will be then output. Currently, the filter is applied to each element separately. Note that for filter version the input tables are mentioned twice, both in policy constructor and in `combinations()` call itself.
366366
367367
```cpp
368368
struct MyTask : AnalysisTask {
@@ -384,6 +384,69 @@ combinations(tracks, tracks); // equivalent to combinations(CombinationsStrictly
384384
combinations(filter, tracks, covs); // equivalent to combinations(CombinationsFullIndexPolicy(tracks, covs), filter, tracks, covs);
385385
```
386386
387+
### Saving tables to file
388+
389+
Produced tables can be saved to file as TTrees. This process is customized by the command line option `--keep` (of the internal-dpl-AOD-writer). **Please be aware, that the format of the `keep` option as described here is preliminary and might be changed in future.**
390+
391+
`keep` is a comma-separated list of `DataOuputDescriptions`.
392+
393+
`keep`
394+
```csh
395+
DataOuputDescription1,DataOuputDescription2, ...
396+
```
397+
398+
Each `DataOuputDescription` is a semicolon-separated list of 4 items
399+
400+
`DataOuputDescription`
401+
```csh
402+
table:tree:columns:file
403+
```
404+
and instructs the internal-dpl-AOD-writer, to save the columns `columns` of table `table` as TTree `tree` into files `file_x.root`, where `x` is an incremental number. The selected columns are saved as separate TBranches of TTree `tree`.
405+
406+
By default `x` is incremented with every time frame. This behavior can be modified with the command line option `--ntfmerge`. The value of `ntfmerge` specifies the number of time frames to merge into one file.
407+
408+
The first item of a `DataOuputDescription` is mandatory and needs to be specified, otherwise the `DataOuputDescription` is ignored. The other three items are optional and are filled by default values if missing.
409+
410+
The format of `table` is
411+
412+
`table`
413+
```csh
414+
AOD/tablename/0
415+
```
416+
`tablename` is the name of the table as defined in the workflow definition.
417+
418+
The format of `tree` is a simple string which names the TTree the table will be saved to. If `tree` is not specified then `tablename` will be used as TTree name.
419+
420+
`columns` is a slash(/)-separated list of column names., e.g.
421+
422+
`columns`
423+
```csh
424+
col1/col2/col3
425+
```
426+
The column names are expected to match column names of table `tablename` as defined in the respective workflow. Non-matching columns are ignored. The selected table columns are saved as separate TBranches with the same names as the corresponding table columns. If `columns` is not specified then all table columns will be saved.
427+
428+
`file` finally specifies the base name of the files the tables are saved to. The actual file names are composed as `file`_`x`.root, where 'x' is an incremental number. If `file` is not specified the default file name is used. The default file name can be set with the command line option `--res-file`. However, if `res-file` is missing then the default file name is set to `AnalysisResults`.
429+
430+
#### Valid example command line options
431+
432+
```csh
433+
--keep AOD/UNO/0
434+
# save all columns of table 'UNO' to TTree 'UNO' in files 'AnalysisResults'_x.root
435+
436+
--keep AOD/UNO/0::c2/c4:unoresults
437+
# save columns 'c2' and 'c4' of table 'UNO' to TTree 'UNO' in files 'unoresults'_x.root
438+
439+
--res-file myskim --ntfmerge 50 --keep AOD/UNO/0:trsel1:c1/c2,AOD/DUE/0:trsel2:c6/c7/c8
440+
# save columns 'c1' and 'c2' of table 'UNO' to TTree 'trsel1' in files 'myskim'_x.root and
441+
# save columns 'c6', 'c7' and 'c8' of table 'DUE' to TTree 'trsel2' in files 'myskim'_x.root.
442+
# Merge 50 time frames in each file.
443+
444+
```
445+
446+
#### Limitations
447+
448+
If the provided `--keep` option contains two `DataOuputDescriptions` with equal combination of `tree` and `file` then the processing will be stopped! It is not pssible to save two trees with equal name to a given file.
449+
387450
### Possible ideas
388451

389452
We could add a template `<typename C...> reshuffle()` method to the Table class which allows you to reduce the number of columns or attach new dynamic columns. A template wrapper could

Framework/Core/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ o2_add_library(Framework
8686
src/TMessageSerializer.cxx
8787
src/TableBuilder.cxx
8888
src/TableConsumer.cxx
89+
src/DataOutputDirector.cxx
8990
src/Task.cxx
9091
src/TextControlService.cxx
9192
src/Variant.cxx

Framework/Core/include/Framework/CommonDataProcessors.h

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212

1313
#include "Framework/DataProcessorSpec.h"
1414
#include "Framework/InputSpec.h"
15+
#include "Framework/DataOutputDirector.h"
1516
#include "TTree.h"
1617

1718
#include <vector>
@@ -34,11 +35,9 @@ struct CommonDataProcessors {
3435
/// not going to be used by the returned DataProcessorSpec.
3536
static DataProcessorSpec getGlobalFileSink(std::vector<InputSpec> const& danglingInputs,
3637
std::vector<InputSpec>& unmatched);
37-
/// Helper function to create and write TTree
38-
static void table2tree(TTree* tout,
39-
std::shared_ptr<arrow::Table> table,
40-
bool tupdate);
41-
static DataProcessorSpec getGlobalAODSink(std::vector<InputSpec> const& danglingInputs);
38+
/// writes inputs of kind AOD to file
39+
static DataProcessorSpec getGlobalAODSink(std::vector<InputSpec> const& OutputInputs,
40+
std::vector<bool> const& isdangling);
4241
/// @return a dummy DataProcessorSpec which requires all the passed @a InputSpec
4342
/// and simply discards them.
4443
static DataProcessorSpec getDummySink(std::vector<InputSpec> const& danglingInputs);

Framework/Core/include/Framework/DataDescriptorQueryBuilder.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#include <string>
1616
#include <vector>
1717
#include <memory>
18+
#include <regex>
1819

1920
namespace o2
2021
{
@@ -66,6 +67,9 @@ struct DataDescriptorQueryBuilder {
6667
///
6768
/// Example for config: TPC/CLUSTER/0;ITS/TRACKS/1
6869
static DataDescriptorQuery buildFromKeepConfig(std::string const& config);
70+
static DataDescriptorQuery buildFromExtendedKeepConfig(std::string const& config);
71+
static std::unique_ptr<data_matcher::DataDescriptorMatcher> buildNode(std::string const& nodeString);
72+
static std::smatch getTokens(std::string const& nodeString);
6973
};
7074

7175
} // namespace framework
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
// Copyright CERN and copyright holders of ALICE O2. This software is
2+
// distributed under the terms of the GNU General Public License v3 (GPL
3+
// Version 3), copied verbatim in the file "COPYING".
4+
//
5+
// See http://alice-o2.web.cern.ch/license for full licensing information.
6+
//
7+
// In applying this license CERN does not waive the privileges and immunities
8+
// granted to it by virtue of its status as an Intergovernmental Organization
9+
// or submit itself to any jurisdiction.
10+
#ifndef o2_framework_DataOutputDirector_H_INCLUDED
11+
#define o2_framework_DataOutputDirector_H_INCLUDED
12+
13+
#include "TFile.h"
14+
15+
#include "Framework/DataDescriptorQueryBuilder.h"
16+
#include "Framework/DataDescriptorMatcher.h"
17+
#include "Framework/DataSpecUtils.h"
18+
#include "Framework/InputSpec.h"
19+
20+
#include <regex>
21+
22+
namespace o2
23+
{
24+
namespace framework
25+
{
26+
namespace data_matcher
27+
{
28+
29+
struct DataOutputDescriptor {
30+
/// Holds information concerning the writing of aod tables.
31+
/// The information includes the table specification, treename,
32+
/// columns to save, and the file name
33+
34+
std::string tablename = "";
35+
std::string treename = "";
36+
std::string filename = "";
37+
std::vector<std::string> colnames;
38+
std::unique_ptr<DataDescriptorMatcher> matcher;
39+
40+
DataOutputDescriptor(std::string sin)
41+
{
42+
// sin is an item consisting of 4 parts which are separated by a ':'
43+
// "origin/description/subSpec:treename:col1/col2/col3:filename"
44+
// the 1st part is used to create a DataDescriptorMatcher
45+
// the other parts are used to fill treename, colnames, and filename
46+
47+
// remove all spaces
48+
auto s = remove_ws(sin);
49+
50+
// reset
51+
treename = "";
52+
colnames.clear();
53+
filename = "";
54+
55+
// analyze the parts of the input string
56+
static const std::regex delim1(":");
57+
std::sregex_token_iterator end;
58+
std::sregex_token_iterator iter1(s.begin(),
59+
s.end(),
60+
delim1,
61+
-1);
62+
63+
// create the DataDescriptorMatcher
64+
if (iter1 == end)
65+
return;
66+
auto a = iter1->str();
67+
matcher = DataDescriptorQueryBuilder::buildNode(a);
68+
69+
// get the table name
70+
auto m = DataDescriptorQueryBuilder::getTokens(a);
71+
if (!std::string(m[2]).empty())
72+
tablename = m[2];
73+
74+
// get the tree name
75+
// defaul tree name is the table name
76+
treename = tablename;
77+
++iter1;
78+
if (iter1 == end)
79+
return;
80+
if (!iter1->str().empty())
81+
treename = iter1->str();
82+
83+
// get column names
84+
++iter1;
85+
if (iter1 == end)
86+
return;
87+
if (!iter1->str().empty()) {
88+
auto cns = iter1->str();
89+
90+
static const std::regex delim2("/");
91+
std::sregex_token_iterator iter2(cns.begin(),
92+
cns.end(),
93+
delim2,
94+
-1);
95+
for (; iter2 != end; ++iter2)
96+
if (!iter2->str().empty())
97+
colnames.emplace_back(iter2->str());
98+
}
99+
100+
// get the filename
101+
++iter1;
102+
if (iter1 == end)
103+
return;
104+
if (!iter1->str().empty())
105+
filename = iter1->str();
106+
}
107+
108+
void setFilename(std::string fn) { filename = fn; }
109+
110+
void printOut()
111+
{
112+
LOG(INFO) << "DataOutputDescriptor";
113+
LOG(INFO) << " table name: " << tablename.c_str();
114+
LOG(INFO) << " file name : " << filename.c_str();
115+
LOG(INFO) << " tree name : " << treename.c_str();
116+
LOG(INFO) << " columns : " << colnames.size();
117+
for (auto cn : colnames)
118+
LOG(INFO) << " " << cn.c_str();
119+
}
120+
121+
std::string remove_ws(const std::string& s)
122+
{
123+
std::string s_wns;
124+
for (auto c : s)
125+
if (!std::isspace(c))
126+
s_wns += c;
127+
return s_wns;
128+
}
129+
};
130+
131+
struct DataOutputDirector {
132+
133+
int ndod = 0;
134+
std::string defaultfname;
135+
std::vector<DataOutputDescriptor*> dodescrs;
136+
137+
std::vector<std::string> tnfns;
138+
139+
std::vector<std::string> fnames;
140+
std::vector<int> fcnts;
141+
std::vector<TFile*> fouts;
142+
143+
DataOutputDirector();
144+
void reset();
145+
146+
// fill the DataOutputDirector with information from a
147+
// keep-string
148+
void readString(std::string const& keepString);
149+
150+
// fill the DataOutputDirector with information from a
151+
// list of InputSpec
152+
void readSpecs(std::vector<InputSpec> inputs);
153+
154+
// fill the DataOutputDirector with information from a json file
155+
//readJson (std::string const& fnjson) {};
156+
157+
// get matching DataOutputDescriptors
158+
std::vector<DataOutputDescriptor*> getDataOutputDescriptors(header::DataHeader dh);
159+
std::vector<DataOutputDescriptor*> getDataOutputDescriptors(InputSpec spec);
160+
161+
// get the matching TFile
162+
TFile* getDataOutputFile(DataOutputDescriptor* dod,
163+
int ntf, int ntfmerge, std::string filemode);
164+
void closeDataOutputFiles();
165+
166+
void setDefaultfname(std::string dfn) { defaultfname = dfn; }
167+
168+
void printOut();
169+
};
170+
171+
} // namespace data_matcher
172+
} // namespace framework
173+
} // namespace o2
174+
175+
#endif // o2_framework_DataOutputDirector_H_INCLUDED

0 commit comments

Comments
 (0)