SeldonIO
diff --git a/‎feature-pipeline.md‎
Lines changed: 38 additions & 117 deletions b/‎feature-pipeline.md‎
Lines changed: 38 additions & 117 deletions
diff --git a/‎img/predictive-data-pipelines.png‎
-5.55 KB b/‎img/predictive-data-pipelines.png‎
-5.55 KB
@@ -8,15 +8,15 @@ title: Feature Pipelines
  [setup server](/seldon-server-setup.html) --> [events](prediction-api.html) --> **feature extraction pipeline** --> [offline model](offline-prediction-models.html) --> [runtime scorer](/runtime-prediction.html) --> [microservice scorer](/pluggable-prediction-algorithms.html) --> [predictions](prediction-api.html)
 
 
-# Feature Extraction Pipelines 
+# Predictive Pipelines 
 Feature extraction pipelines allow you to define a repeatable process to transform a set of input features before you build a machine learning model on a final set of features. When the resulting model is put into production the feature pipeline will need to be rerun on each input feature set before being passed to the model for scoring.
 
 Seldon feature pipelines are presently available in python. We plan to provide Spark based pipelines in the future.
 
-# Python modules
-Seldon provides a set of [python modules](python-package.html) to help construct feature pipelines for use inside Seldon.
+## Python modules
+Seldon provides a set of [python modules](python-package.html) to help construct feature pipelines for use inside Seldon. We use [scikit-learn](http://scikit-learn.org/stable/) pipelines and [Pandas](http://pandas.pydata.org/). For feature extraction and transformation we provide a starter set of python scikit-learn Tranformers that take Pandas dataframes as input apply some transformations and output Pandas dataframes. 
 
-A pipeline consists of a series of Feature_Transforms. The currently available transforms are:
+The currently available example transforms are:
 
  * **Include_features_transform** : include a subset of features
  * **Exclude_features_transform** : exclude some subset of features
@@ -27,141 +27,62 @@ A pipeline consists of a series of Feature_Transforms. The currently available t
  * **Feature_id_transform** : create an id feature from some input feature
  * **Tfidf_transform** : create TFIDF features from an input feature
  * **Auto_transform** : attempt to automatically normalize and create numeric, categorical and date features
+ * **sklearn_transform** : apply a [sklearn Transformer](http://scikit-learn.org/stable/data_transforms.html) to a Pandas Dataframe
 
-# Simple Example
-An example pipeline to do very simple extraction on the Iris dataset is contained within the code at `external/predictor/python/docker/examples/iris`. This contains a pipeline that extends Seldon's Docker pipeline base with the following python pipeline:
+## Creating a Machine Learning model
+As a final stage of any pipeline you would usually add a scikit learn Estimtor. We provide 3 builtin Estimators which allow Pandas dataframes as input and a general Estimator that can take any sckit-learn compatible estimator.
+
+ * **XGBoostClassifier** : XGBoost classifier which allows Pandas Dataframes as input
+ * **VWClassifier** : VW classifier which allows Pandas Dataframes as input
+ * **KerasClassifier** : Keras classifier which allows Pandas Dataframes as input
+ * **SKLearnClassifier** : General classifier that runs any [sklearn classifier](http://scikit-learn.org/stable/supervised_learning.html) taking Pandas dataframes as input.
+
+# Simple Predictive Pipeline using Iris Dataset
+An example pipeline to do very simple extraction on the Iris dataset is contained within the code at `external/predictor/python/docker/examples/iris`. This contains pipelines that utilize Seldon's Docker pipeline and create the following python pipelines:
 
  1. Create an id feature from the name feature
  1. Create an SVMLight feature from the four core predictive features
+ 1. Create a model with either XGBoost, Vowpal Wabbit or Keras
+
+The pipeline utilizing XGBoost is shown below
 
 {% highlight python %}
 import sys, getopt, argparse
 import seldon.pipeline.basic_transforms as bt
-import seldon.pipeline.pipelines as pl
+import seldon.pipeline.util as sutl
+import seldon.pipeline.auto_transforms as pauto
+from sklearn.pipeline import Pipeline
+import seldon.xgb as xg
 import sys
 
-def run_pipeline(events,features,models):
-    p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_tmp",models_folder=models)
-
-    tNameId = bt.Feature_id_transform(min_size=0,exclude_missing=True)
-    tNameId.set_input_feature("name")
-    tNameId.set_output_feature("nameId")
-    svmTransform = bt.Svmlight_transform(included=["f1","f2","f3","f4"] )
-    svmTransform.set_output_feature("svmfeatures")
-    p.add(tNameId)
-    p.add(svmTransform)
-    p.fit_transform()
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(prog='bbm_pipeline')
-    parser.add_argument('--events', help='events folder', required=True)
-    parser.add_argument('--features', help='output features folder', required=True)
-    parser.add_argument('--models', help='output models folder', required=True)
-
-    args = parser.parse_args()
-    opts = vars(args)
-
-    run_pipeline([args.events],args.features,args.models)
-{% endhighlight %}
-
-The example code provides methods to download the Iris dataset and create the JSON events. Then the pipeline can be run as:
+def run_pipeline(events,models):
 
-{% highlight bash %}
-docker run --rm -t -v ${PWD}/data:/data seldonio/iris_pipeline bash -c "python /pipeline/iris_pipeline.py --events /data/iris/events/1 --features /data/iris/features/1 --models /data/iris/models/1"
-{% endhighlight %}
+    tNameId = bt.Feature_id_transform(min_size=0,exclude_missing=True,zero_based=True,input_feature="name",output_feature="nameId")
+    tAuto = pauto.Auto_transform(max_values_numeric_categorical=2,exclude=["nameId","name"])
+    xgb = xg.XGBoostClassifier(target="nameId",target_readable="name",excluded=["name"],learning_rate=0.1,silent=0)
 
-The first few lines of the input events JSON look like:
+    transformers = [("tName",tNameId),("tAuto",tAuto),("xgb",xgb)]
+    p = Pipeline(transformers)
 
-{% highlight json %}
-{"f1": 6.1, "f2": 3.0, "f3": 4.9, "f4": 1.8, "name": "Iris-virginica"}
-{"f1": 5.0, "f2": 3.2, "f3": 1.2, "f4": 0.2, "name": "Iris-setosa"}
-{"f1": 5.7, "f2": 2.9, "f3": 4.2, "f4": 1.3, "name": "Iris-versicolor"}
-{"f1": 5.2, "f2": 2.7, "f3": 3.9, "f4": 1.4, "name": "Iris-versicolor"}
-{% endhighlight %}
-
-This is transformed by the pipeline into:
-
-{% highlight json %}
-{"f1": 6.1, "f2": 3.0, "f3": 4.9, "f4": 1.8, "name": "Iris-virginica", "nameId": 1, "svmfeatures": {"1": 6.1, "2": 3.0, "3": 4.9, "4": 1.8}}
-{"f1": 5.0, "f2": 3.2, "f3": 1.2, "f4": 0.2, "name": "Iris-setosa", "nameId": 2, "svmfeatures": {"1": 5.0, "2": 3.2, "3": 1.2, "4": 0.2}}
-{"f1": 5.7, "f2": 2.9, "f3": 4.2, "f4": 1.3, "name": "Iris-versicolor", "nameId": 3, "svmfeatures": {"1": 5.7, "2": 2.9, "3": 4.2, "4": 1.3}}
-{"f1": 5.2, "f2": 2.7, "f3": 3.9, "f4": 1.4, "name": "Iris-versicolor", "nameId": 3, "svmfeatures": {"1": 5.2, "2": 2.7, "3": 3.9, "4": 1.4}}
-{% endhighlight %}
-
-# Advanced Example
-The following reasonably complex example create a pipeline from training data and runs the same series of transformations on test data. The input events files should contain JSON data.
-
-The pipeline does the following transformations:
-
- 1. Keeps only rows containing a likeids feature
- 1. Filters features to only a subset using the Include_features_transform
- 1. Splits 3 textual features to make a feature containing tokens
- 1. Creates an id feature from the "group" feature, keeping only those that appear at least 200 times in the training data
- 1. Creates an id feature from the "category" feature
- 1. Creates a TFIDF feature, doing a chi-squared test against the groupId feature created above
- 1. Create a SVM feature from the the like TFIDF feature
-
-For the testing data the pipeline is reloaded and run against the test data to perform the same transformations.
-
-{% highlight python %}
-import sys, getopt, argparse
-import seldon.pipeline.basic_transforms as bt
-import seldon.pipeline.tfidf_transform as ptfidf
-import seldon.pipeline.pipelines as pl
-
-def training_pipeline(events,features,models,awskey,awssecret):
-    p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_train",models_folder=models,aws_key=awskey,aws_secret=awssecret)
-
-    tExist = bt.Exist_features_transform(included=['likeids'])
-    tFilter = bt.Include_features_transform(included=["likeids","group","category","utm_source","utm_medium","utm_campaign","friend_uuids"])
-    tSplit = bt.Split_transform(split_expression=" |\-|\_",ignore_numbers=True,input_features=["utm_source","utm_medium","utm_campaign"])
-    tSplit.set_output_feature("campaign")
-    tGroupId = bt.Feature_id_transform(min_size=200,exclude_missing=True)
-    tGroupId.set_input_feature("group")
-    tGroupId.set_output_feature("groupId")
-    tCatId = bt.Feature_id_transform(min_size=0)
-    tCatId.set_input_feature("category")
-    tCatId.set_output_feature("categoryId")
-    tTfidf = ptfidf.Tfidf_transform(select_features=True,target_feature="groupId")
-    tTfidf.set_input_feature("likeids")
-    tTfidf.set_output_feature("likeid_tfidf")
-    svmTransform = bt.Svmlight_transform(included=["likeid_tfidf"] )
-    svmTransform.set_output_feature("svmfeatures")
-    tFilterFinal = bt.Include_features_transform(included=["likeid_tfidf","svmfeatures","groupId","group"])
-    p.add(tExist)
-    p.add(tFilter)
-    p.add(tSplit)
-    p.add(tGroupId)
-    p.add(tCatId)
-    p.add(tTfidf)
-    p.add(svmTransform)
-    p.add(tFilterFinal)
-    p.fit_transform()
-
-def testing_pipeline(events,features,models,awskey,awssecret):
-    p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_test",models_folder=models,aws_key=awskey,aws_secret=awssecret)
-    p.transform()
-    p.store_features()
+    pw = sutl.Pipeline_wrapper()
+    df = pw.create_dataframe(events)
+    df2 = p.fit(df)
+    pw.save_pipeline(p,models)
 
 
 if __name__ == '__main__':
-    parser = argparse.ArgumentParser(prog='pipeline_example')
-    parser.add_argument('--train_events', help='training events folder', required=True)
-    parser.add_argument('--train_features', help='output features folder', required=True)
-    parser.add_argument('--test_events', help='testing events folder', required=True)
-    parser.add_argument('--test_features', help='output features folder', required=True)
-    parser.add_argument('--train_models', help='output models folder', required=True)
-    parser.add_argument('--aws_key', help='aws key', default=None)
-    parser.add_argument('--aws_secret', help='aws secret', default=None)
+    parser = argparse.ArgumentParser(prog='xgb_pipeline')
+    parser.add_argument('--events', help='events folder', required=True)
+    parser.add_argument('--models', help='output models folder', required=True)
 
     args = parser.parse_args()
     opts = vars(args)
 
-    training_pipeline([args.train_events],args.train_features,args.train_models,args.aws_key,args.aws_secret)
-    testing_pipeline([args.test_events],args.test_features,args.train_models,args.aws_key,args.aws_secret)
+    run_pipeline([args.events],args.models)
+{% endhighlight %}
 
+The example is explained in more detail [here](iris-demo.html)
 
-{% endhighlight %}