You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature extraction pipelines allow you to define a repeatable process to transform a set of input features before you build a machine learning model on a final set of features. When the resulting model is put into production the feature pipeline will need to be rerun on each input feature set before being passed to the model for scoring.
13
13
14
14
Seldon feature pipelines are presently available in python. We plan to provide Spark based pipelines in the future.
15
15
16
-
# Python modules
17
-
Seldon provides a set of [python modules](python-package.html) to help construct feature pipelines for use inside Seldon.
16
+
##Python modules
17
+
Seldon provides a set of [python modules](python-package.html) to help construct feature pipelines for use inside Seldon. We use [scikit-learn](http://scikit-learn.org/stable/) pipelines and [Pandas](http://pandas.pydata.org/). For feature extraction and transformation we provide a starter set of python scikit-learn Tranformers that take Pandas dataframes as input apply some transformations and output Pandas dataframes.
18
18
19
-
A pipeline consists of a series of Feature_Transforms. The currently available transforms are:
19
+
The currently available example transforms are:
20
20
21
21
***Include_features_transform** : include a subset of features
22
22
***Exclude_features_transform** : exclude some subset of features
@@ -27,141 +27,62 @@ A pipeline consists of a series of Feature_Transforms. The currently available t
27
27
***Feature_id_transform** : create an id feature from some input feature
28
28
***Tfidf_transform** : create TFIDF features from an input feature
29
29
***Auto_transform** : attempt to automatically normalize and create numeric, categorical and date features
30
+
***sklearn_transform** : apply a [sklearn Transformer](http://scikit-learn.org/stable/data_transforms.html) to a Pandas Dataframe
30
31
31
-
# Simple Example
32
-
An example pipeline to do very simple extraction on the Iris dataset is contained within the code at `external/predictor/python/docker/examples/iris`. This contains a pipeline that extends Seldon's Docker pipeline base with the following python pipeline:
32
+
## Creating a Machine Learning model
33
+
As a final stage of any pipeline you would usually add a scikit learn Estimtor. We provide 3 builtin Estimators which allow Pandas dataframes as input and a general Estimator that can take any sckit-learn compatible estimator.
34
+
35
+
***XGBoostClassifier** : XGBoost classifier which allows Pandas Dataframes as input
36
+
***VWClassifier** : VW classifier which allows Pandas Dataframes as input
37
+
***KerasClassifier** : Keras classifier which allows Pandas Dataframes as input
38
+
***SKLearnClassifier** : General classifier that runs any [sklearn classifier](http://scikit-learn.org/stable/supervised_learning.html) taking Pandas dataframes as input.
39
+
40
+
# Simple Predictive Pipeline using Iris Dataset
41
+
An example pipeline to do very simple extraction on the Iris dataset is contained within the code at `external/predictor/python/docker/examples/iris`. This contains pipelines that utilize Seldon's Docker pipeline and create the following python pipelines:
33
42
34
43
1. Create an id feature from the name feature
35
44
1. Create an SVMLight feature from the four core predictive features
45
+
1. Create a model with either XGBoost, Vowpal Wabbit or Keras
46
+
47
+
The pipeline utilizing XGBoost is shown below
36
48
37
49
{% highlight python %}
38
50
import sys, getopt, argparse
39
51
import seldon.pipeline.basic_transforms as bt
40
-
import seldon.pipeline.pipelines as pl
52
+
import seldon.pipeline.util as sutl
53
+
import seldon.pipeline.auto_transforms as pauto
54
+
from sklearn.pipeline import Pipeline
55
+
import seldon.xgb as xg
41
56
import sys
42
57
43
-
def run_pipeline(events,features,models):
44
-
p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_tmp",models_folder=models)
The following reasonably complex example create a pipeline from training data and runs the same series of transformations on test data. The input events files should contain JSON data.
93
-
94
-
The pipeline does the following transformations:
95
-
96
-
1. Keeps only rows containing a likeids feature
97
-
1. Filters features to only a subset using the Include_features_transform
98
-
1. Splits 3 textual features to make a feature containing tokens
99
-
1. Creates an id feature from the "group" feature, keeping only those that appear at least 200 times in the training data
100
-
1. Creates an id feature from the "category" feature
101
-
1. Creates a TFIDF feature, doing a chi-squared test against the groupId feature created above
102
-
1. Create a SVM feature from the the like TFIDF feature
103
-
104
-
For the testing data the pipeline is reloaded and run against the test data to perform the same transformations.
p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_train",models_folder=models,aws_key=awskey,aws_secret=awssecret)
p = pl.Pipeline(input_folders=events,output_folder=features,local_models_folder="models_test",models_folder=models,aws_key=awskey,aws_secret=awssecret)
0 commit comments