# Transforms 

[All transforms](../python/src/data_processing/transform/abstract_transform.py)
are generalized to operate on generically typed `DATA.`
[Ray](ray-runtime.md) and [Python](python-runtime.md) runtimes 
currently support `DATA` as both byte arrays 
and [pyarrow Tables](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). 
The [Spark runtime](spark-runtime.md) currently supports the native Spark `DataFrame`.

All transforms convert their input `DATA` to a list of transformed `DATA` objects
and optional metadata about the transformation of the `DATA` instance.
The Transform itself need only be concerned with the conversion 
of one `DATA` instance at a time. 
Transforms, where possible, should be implemented without regard to the
runtime it will run in or where its configuration originates.  

In the discussion that follows, we'll focus on the transformation of pyarrow Tables
using the `AbstractTableTransform` class (see below), supported by both
the Ray and Python runtimes.
Mapping from this tutorial to a Spark runtime can be done by using 
`data-prep-kit-spark`'s [AbstractSparkTransform](../spark/src/data_processing_spark/runtime/spark/spark_transform.py)
which operates on a Spark DataFrame instead of a pyarrow Table.


#### AbstractTableTransform class
[AbstractTableTransform](../python/src/data_processing/transform/table_transform.py) 
is expected to be extended when implementing a transform of pyarrow Tables.
In general, when possible a transform should be independent of the runtime in
which it runs, and the mechanism used to define its configuration
(e.g., the `TransformConfiguration` class below, or other mechanism).
That said, some transforms may require facilities provided by the runtime
(shared memory, distribution, etc.), but as a starting point, think of the
transform as an independent operator.

The following methods are defined:

* ```__init__(self, config:dict)``` - an initializer through which the transform can be created 
with implementation-specific configuration.  For example, the location of a model, maximum number of
rows in a table, column(s) to use, etc.   Error checking of configuration should be done here.
* ```transform(self, table:pyarrow.Table) -> tuple(list[pyarrow.Table], dict)``` - this method is responsible
for the actual transformation of a given table to zero or more output tables, and optional 
metadata regarding the transformation applied.  Zero tables might be returned when
merging tables across calls to `transform()` and more than 1 table might be returned
when splitting tables by size or other criteria.
  * _output tables list_ - the RayWork handles the various number of returned tables as follows: 
    * 0 - no file will be written out and the input file name will not be used in the output directory.
    * 1 - one parquet file will be written to the output directory with 
    * N - N parquet files are written to the output with `_<index>` appended to the base file name
  * _dict_ - is a dictionary of transform-specific data keyed to numeric values.  A statistics component will
         accumulate/add dictionaries across all calls to transform across all calls to all transforms running
         in a given _runtime_ (see below). As an example, a
         transform might wish to track the number of instances of PII entities detected and might return 
         this as `{ "entities" : 1234 }`.
* ```flush() -> tuple(list[pyarrow.Table], dict)``` - this is provided for transforms that
make use of buffering (e.g. to resize the tables) across calls 
to `transform()` and need to be flushed of all buffered data at the end of processing of input tables.  
The return values are handled the same waa as the return values for `transform()`.  Since most transforms will likely
not need this feature, a default implementation is provided to return an empty list and empty dictionary.
 
#### TransformConfiguration class
The [TransformConfiguration](../python/src/data_processing/transform/transform_configuration.py)
serves as an interface and must be implemented by the any `AbstractTableTransform`
implementation to enable running within and runtime or from a command line to capture 
transform configuration.  It provides the following configuration:

* the transform class to be used,
* command line arguments used to initialize the Transform Runtime and generally, the Transform.
* Transform Runtime class to use
* transform short name 

It is expected that transforms are initialized with a fixed name, the class of its corresponding
`AbstractTableTransform` implementation and optionally the configuration keys that should not
be exposed as metadata for a run.
To support command line configuration, the `TransformConfiguration` extends the
[CLIArgumentProvider](../python/src/data_processing/utils/cli_utils.py) class.
The set of methods of interest are

* ```__init__(self, name:str, transform_class:type[AbstractTableTransform], list[str]:remove_from_metadata )``` - sets the required fields
* ```add_input_params(self, parser:ArgumentParser)``` - adds transform-specific command line options that will
be made available in the dictionary provided to the transform's initializer.
* ```apply_input_params(self, args: argparse.Namespace)``` - verifies  and captures the relevant transform parameters.
* ```get_input_params(self ) -> dict[str,Anny]``` - returns the dictionary of configuration values that
should be used to initialize the transform.