Skip to content

Commit d87588c

Browse files
committed
doc updates for python sklearn pipelines
1 parent ea1b4a2 commit d87588c

31 files changed

Lines changed: 3381 additions & 2763 deletions

feature-pipeline.md

Lines changed: 38 additions & 117 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ title: Feature Pipelines
88
[setup server](/seldon-server-setup.html) --> [events](prediction-api.html) --> **feature extraction pipeline** --> [offline model](offline-prediction-models.html) --> [runtime scorer](/runtime-prediction.html) --> [microservice scorer](/pluggable-prediction-algorithms.html) --> [predictions](prediction-api.html)
99

1010

11-
# Feature Extraction Pipelines
11+
# Predictive Pipelines
1212
Feature extraction pipelines allow you to define a repeatable process to transform a set of input features before you build a machine learning model on a final set of features. When the resulting model is put into production the feature pipeline will need to be rerun on each input feature set before being passed to the model for scoring.
1313

1414
Seldon feature pipelines are presently available in python. We plan to provide Spark based pipelines in the future.
1515

16-
# Python modules
17-
Seldon provides a set of [python modules](python-package.html) to help construct feature pipelines for use inside Seldon.
16+
## Python modules
17+
Seldon provides a set of [python modules](python-package.html) to help construct feature pipelines for use inside Seldon. We use [scikit-learn](http://scikit-learn.org/stable/) pipelines and [Pandas](http://pandas.pydata.org/). For feature extraction and transformation we provide a starter set of python scikit-learn Tranformers that take Pandas dataframes as input apply some transformations and output Pandas dataframes.
1818

19-
A pipeline consists of a series of Feature_Transforms. The currently available transforms are:
19+
The currently available example transforms are:
2020

2121
* **Include_features_transform** : include a subset of features
2222
* **Exclude_features_transform** : exclude some subset of features
@@ -27,141 +27,62 @@ A pipeline consists of a series of Feature_Transforms. The currently available t
2727
* **Feature_id_transform** : create an id feature from some input feature
2828
* **Tfidf_transform** : create TFIDF features from an input feature
2929
* **Auto_transform** : attempt to automatically normalize and create numeric, categorical and date features
30+
* **sklearn_transform** : apply a [sklearn Transformer](http://scikit-learn.org/stable/data_transforms.html) to a Pandas Dataframe
3031

31-
# Simple Example
32-
An example pipeline to do very simple extraction on the Iris dataset is contained within the code at `external/predictor/python/docker/examples/iris`. This contains a pipeline that extends Seldon's Docker pipeline base with the following python pipeline:
32+
## Creating a Machine Learning model
33+
As a final stage of any pipeline you would usually add a scikit learn Estimtor. We provide 3 builtin Estimators which allow Pandas dataframes as input and a general Estimator that can take any sckit-learn compatible estimator.
34+
35+
* **XGBoostClassifier** : XGBoost classifier which allows Pandas Dataframes as input
36+
* **VWClassifier** : VW classifier which allows Pandas Dataframes as input
37+
* **KerasClassifier** : Keras classifier which allows Pandas Dataframes as input
38+
* **SKLearnClassifier** : General classifier that runs any [sklearn classifier](http://scikit-learn.org/stable/supervised_learning.html) taking Pandas dataframes as input.
39+
40+
# Simple Predictive Pipeline using Iris Dataset
41+
An example pipeline to do very simple extraction on the Iris dataset is contained within the code at `external/predictor/python/docker/examples/iris`. This contains pipelines that utilize Seldon's Docker pipeline and create the following python pipelines:
3342

3443
1. Create an id feature from the name feature
3544
1. Create an SVMLight feature from the four core predictive features
45+
1. Create a model with either XGBoost, Vowpal Wabbit or Keras
46+
47+
The pipeline utilizing XGBoost is shown below
3648

3749
{% highlight python %}
3850
import sys, getopt, argparse
3951
import seldon.pipeline.basic_transforms as bt
40-
import seldon.pipeline.pipelines as pl
52+
import seldon.pipeline.util as sutl
53+
import seldon.pipeline.auto_transforms as pauto
54+
from sklearn.pipeline import Pipeline
55+
import seldon.xgb as xg
4156
import sys
4257

43-
def run_pipeline(events,features,models):
44-
p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_tmp",models_folder=models)
45-
46-
tNameId = bt.Feature_id_transform(min_size=0,exclude_missing=True)
47-
tNameId.set_input_feature("name")
48-
tNameId.set_output_feature("nameId")
49-
svmTransform = bt.Svmlight_transform(included=["f1","f2","f3","f4"] )
50-
svmTransform.set_output_feature("svmfeatures")
51-
p.add(tNameId)
52-
p.add(svmTransform)
53-
p.fit_transform()
54-
55-
if __name__ == '__main__':
56-
parser = argparse.ArgumentParser(prog='bbm_pipeline')
57-
parser.add_argument('--events', help='events folder', required=True)
58-
parser.add_argument('--features', help='output features folder', required=True)
59-
parser.add_argument('--models', help='output models folder', required=True)
60-
61-
args = parser.parse_args()
62-
opts = vars(args)
63-
64-
run_pipeline([args.events],args.features,args.models)
65-
{% endhighlight %}
66-
67-
The example code provides methods to download the Iris dataset and create the JSON events. Then the pipeline can be run as:
58+
def run_pipeline(events,models):
6859

69-
{% highlight bash %}
70-
docker run --rm -t -v ${PWD}/data:/data seldonio/iris_pipeline bash -c "python /pipeline/iris_pipeline.py --events /data/iris/events/1 --features /data/iris/features/1 --models /data/iris/models/1"
71-
{% endhighlight %}
60+
tNameId = bt.Feature_id_transform(min_size=0,exclude_missing=True,zero_based=True,input_feature="name",output_feature="nameId")
61+
tAuto = pauto.Auto_transform(max_values_numeric_categorical=2,exclude=["nameId","name"])
62+
xgb = xg.XGBoostClassifier(target="nameId",target_readable="name",excluded=["name"],learning_rate=0.1,silent=0)
7263

73-
The first few lines of the input events JSON look like:
64+
transformers = [("tName",tNameId),("tAuto",tAuto),("xgb",xgb)]
65+
p = Pipeline(transformers)
7466

75-
{% highlight json %}
76-
{"f1": 6.1, "f2": 3.0, "f3": 4.9, "f4": 1.8, "name": "Iris-virginica"}
77-
{"f1": 5.0, "f2": 3.2, "f3": 1.2, "f4": 0.2, "name": "Iris-setosa"}
78-
{"f1": 5.7, "f2": 2.9, "f3": 4.2, "f4": 1.3, "name": "Iris-versicolor"}
79-
{"f1": 5.2, "f2": 2.7, "f3": 3.9, "f4": 1.4, "name": "Iris-versicolor"}
80-
{% endhighlight %}
81-
82-
This is transformed by the pipeline into:
83-
84-
{% highlight json %}
85-
{"f1": 6.1, "f2": 3.0, "f3": 4.9, "f4": 1.8, "name": "Iris-virginica", "nameId": 1, "svmfeatures": {"1": 6.1, "2": 3.0, "3": 4.9, "4": 1.8}}
86-
{"f1": 5.0, "f2": 3.2, "f3": 1.2, "f4": 0.2, "name": "Iris-setosa", "nameId": 2, "svmfeatures": {"1": 5.0, "2": 3.2, "3": 1.2, "4": 0.2}}
87-
{"f1": 5.7, "f2": 2.9, "f3": 4.2, "f4": 1.3, "name": "Iris-versicolor", "nameId": 3, "svmfeatures": {"1": 5.7, "2": 2.9, "3": 4.2, "4": 1.3}}
88-
{"f1": 5.2, "f2": 2.7, "f3": 3.9, "f4": 1.4, "name": "Iris-versicolor", "nameId": 3, "svmfeatures": {"1": 5.2, "2": 2.7, "3": 3.9, "4": 1.4}}
89-
{% endhighlight %}
90-
91-
# Advanced Example
92-
The following reasonably complex example create a pipeline from training data and runs the same series of transformations on test data. The input events files should contain JSON data.
93-
94-
The pipeline does the following transformations:
95-
96-
1. Keeps only rows containing a likeids feature
97-
1. Filters features to only a subset using the Include_features_transform
98-
1. Splits 3 textual features to make a feature containing tokens
99-
1. Creates an id feature from the "group" feature, keeping only those that appear at least 200 times in the training data
100-
1. Creates an id feature from the "category" feature
101-
1. Creates a TFIDF feature, doing a chi-squared test against the groupId feature created above
102-
1. Create a SVM feature from the the like TFIDF feature
103-
104-
For the testing data the pipeline is reloaded and run against the test data to perform the same transformations.
105-
106-
{% highlight python %}
107-
import sys, getopt, argparse
108-
import seldon.pipeline.basic_transforms as bt
109-
import seldon.pipeline.tfidf_transform as ptfidf
110-
import seldon.pipeline.pipelines as pl
111-
112-
def training_pipeline(events,features,models,awskey,awssecret):
113-
p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_train",models_folder=models,aws_key=awskey,aws_secret=awssecret)
114-
115-
tExist = bt.Exist_features_transform(included=['likeids'])
116-
tFilter = bt.Include_features_transform(included=["likeids","group","category","utm_source","utm_medium","utm_campaign","friend_uuids"])
117-
tSplit = bt.Split_transform(split_expression=" |\-|\_",ignore_numbers=True,input_features=["utm_source","utm_medium","utm_campaign"])
118-
tSplit.set_output_feature("campaign")
119-
tGroupId = bt.Feature_id_transform(min_size=200,exclude_missing=True)
120-
tGroupId.set_input_feature("group")
121-
tGroupId.set_output_feature("groupId")
122-
tCatId = bt.Feature_id_transform(min_size=0)
123-
tCatId.set_input_feature("category")
124-
tCatId.set_output_feature("categoryId")
125-
tTfidf = ptfidf.Tfidf_transform(select_features=True,target_feature="groupId")
126-
tTfidf.set_input_feature("likeids")
127-
tTfidf.set_output_feature("likeid_tfidf")
128-
svmTransform = bt.Svmlight_transform(included=["likeid_tfidf"] )
129-
svmTransform.set_output_feature("svmfeatures")
130-
tFilterFinal = bt.Include_features_transform(included=["likeid_tfidf","svmfeatures","groupId","group"])
131-
p.add(tExist)
132-
p.add(tFilter)
133-
p.add(tSplit)
134-
p.add(tGroupId)
135-
p.add(tCatId)
136-
p.add(tTfidf)
137-
p.add(svmTransform)
138-
p.add(tFilterFinal)
139-
p.fit_transform()
140-
141-
def testing_pipeline(events,features,models,awskey,awssecret):
142-
p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_test",models_folder=models,aws_key=awskey,aws_secret=awssecret)
143-
p.transform()
144-
p.store_features()
67+
pw = sutl.Pipeline_wrapper()
68+
df = pw.create_dataframe(events)
69+
df2 = p.fit(df)
70+
pw.save_pipeline(p,models)
14571

14672

14773
if __name__ == '__main__':
148-
parser = argparse.ArgumentParser(prog='pipeline_example')
149-
parser.add_argument('--train_events', help='training events folder', required=True)
150-
parser.add_argument('--train_features', help='output features folder', required=True)
151-
parser.add_argument('--test_events', help='testing events folder', required=True)
152-
parser.add_argument('--test_features', help='output features folder', required=True)
153-
parser.add_argument('--train_models', help='output models folder', required=True)
154-
parser.add_argument('--aws_key', help='aws key', default=None)
155-
parser.add_argument('--aws_secret', help='aws secret', default=None)
74+
parser = argparse.ArgumentParser(prog='xgb_pipeline')
75+
parser.add_argument('--events', help='events folder', required=True)
76+
parser.add_argument('--models', help='output models folder', required=True)
15677

15778
args = parser.parse_args()
15879
opts = vars(args)
15980

160-
training_pipeline([args.train_events],args.train_features,args.train_models,args.aws_key,args.aws_secret)
161-
testing_pipeline([args.test_events],args.test_features,args.train_models,args.aws_key,args.aws_secret)
81+
run_pipeline([args.events],args.models)
82+
{% endhighlight %}
16283

84+
The example is explained in more detail [here](iris-demo.html)
16385

164-
{% endhighlight %}
16586

16687

16788

img/predictive-data-pipelines.png

-5.55 KB
Loading

0 commit comments

Comments
 (0)