Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
f1919e1
Using sklearn docstring as flow descriptions for sklearn flows
Neeratyoy Aug 5, 2019
0b5137f
Extracting parameter type and descriptions
Neeratyoy Aug 5, 2019
b0ad048
Handling certain edge cases
Neeratyoy Aug 6, 2019
d90f333
More robust failure checks + improved docstrings
Neeratyoy Aug 7, 2019
6dc4345
Trimming of all strings to be uploaded
Neeratyoy Aug 7, 2019
64fa568
Re-enable unit test as server issue is resolved.
PGijsbers Aug 13, 2019
80e5b33
pass skipna=False explicitly
TwsThomas Aug 19, 2019
3880d9a
Sync master and development (#768)
mfeurer Aug 20, 2019
4a6c980
Bump version number (#769)
mfeurer Aug 20, 2019
3d08c2d
Mark unit test as flaky (#770)
mfeurer Aug 20, 2019
58a6609
Fixing edge cases to pass tests
Neeratyoy Aug 24, 2019
41549b0
Fixing PEP8
Neeratyoy Aug 25, 2019
235ded8
Leaner implementation for parameter docstring
Neeratyoy Aug 26, 2019
1c9f64d
Add #737 (#772)
sahithyaravi Sep 2, 2019
9b5d382
Making suggested changes
Neeratyoy Sep 2, 2019
7cbf428
add missing whitespace in error message
amueller Sep 3, 2019
33db051
Merge pull request #776 from amueller/whitespace_typo
mfeurer Sep 4, 2019
27521ac
Merge pull request #766 from TwsThomas/patch-1
mfeurer Sep 4, 2019
43bf02d
Version handling and warning log
Neeratyoy Sep 5, 2019
579498a
Debugging
Neeratyoy Sep 5, 2019
52cbdb7
Debugging phase 2
Neeratyoy Sep 5, 2019
3b44e86
Fixing test cases
Neeratyoy Sep 9, 2019
6710b40
Handling different sklearn versions in unit testing
Neeratyoy Sep 9, 2019
7d685e1
Replace logging.info by logging.warning
mfeurer Sep 13, 2019
c39b9f7
Merge pull request #756 from openml/fix_175
mfeurer Sep 13, 2019
afc7445
Merge pull request #761 from openml/reenable_unittest
mfeurer Sep 13, 2019
5cc1638
FIX assign study's id to study_id for uniformity. (#782)
PGijsbers Sep 20, 2019
fe218bc
raise a warning, not an error, when not matching version exactly (#744)
amueller Sep 26, 2019
dcac17e
store predictions_url in runs (#783)
amueller Sep 26, 2019
8eac076
[WIP] Restructuring the examples section (#785)
ArlindKadra Sep 30, 2019
de0335c
Fix 779 (#787)
PGijsbers Sep 30, 2019
4e03906
Instructions to publish new extensions (#778)
Neeratyoy Sep 30, 2019
f461732
Add username (#790)
sahithyaravi Oct 1, 2019
8cc302d
Add example (#791)
mfeurer Oct 2, 2019
5a2830c
added example strang, and more filter options (#793)
janvanrijn Oct 2, 2019
4020c1e
Add manual task iteration tutorial (#788)
mfeurer Oct 7, 2019
04a6b65
Improve the usage of dataframes in examples (#789)
mfeurer Oct 7, 2019
f241cde
Address comment from Arlind (#802)
mfeurer Oct 7, 2019
1dd54bf
#799: fix mistake in the docs of openml.datasets.functions (#801)
mfeurer Oct 7, 2019
382959f
Add new convenience function get_flow_id (#792)
mfeurer Oct 7, 2019
20a7b62
Replace %-formatting by f-strings in code examples (#798)
konrad Oct 8, 2019
a32f556
Rename argument to be more intuitive (#796)
mfeurer Oct 8, 2019
e1b1652
extended
janvanrijn Oct 11, 2019
3e23a3b
Add example rijn (#803)
janvanrijn Oct 11, 2019
9041dc6
strang example update
janvanrijn Oct 11, 2019
1e85bb6
[WIP] An example that loads and visualizes the iris dataset (#808)
ArlindKadra Oct 11, 2019
2f11939
Fix failing simple_datasets_tutorial example (#812)
ArlindKadra Oct 11, 2019
77cd94b
Merge pull request #807 from openml/extend_example_strang
janvanrijn Oct 11, 2019
24c4821
make output of rijn example a bit nicer
amueller Oct 14, 2019
5f86908
Unit test enabled for list_runs (#817)
prabhant Oct 14, 2019
9467ed4
Add additional part of OpenML error message to exception message (#811)
mfeurer Oct 14, 2019
b259a34
maybe fix link (#816)
amueller Oct 14, 2019
3e14267
make sure repr workes with blank / fresh datasets (#820)
amueller Oct 14, 2019
b96c564
fix issue #305 by not requiring external version in the flow xml (#818)
mfeurer Oct 14, 2019
ef3e4d1
add validation for strings in datasets (#822)
amueller Oct 14, 2019
4853d7c
Example for study and suite (#810)
mfeurer Oct 14, 2019
5b0d4dc
only check strings for new datasets (#824)
amueller Oct 15, 2019
23d4e6f
Fixing fetching of categorical sparse data (#823)
Neeratyoy Oct 15, 2019
29a023c
don't warn if we can convert to dataframe (#829)
amueller Oct 15, 2019
2796b9a
Adding Perrone example for building surrogate
Neeratyoy Oct 15, 2019
17657ab
Merge pull request #815 from amueller/rijn_example_cleanup
janvanrijn Oct 16, 2019
40799f9
warn if there's an empty flow description (#831)
amueller Oct 16, 2019
1a3f456
Intermediate changes; pipeline additions remain
Neeratyoy Oct 16, 2019
6395cd7
Adding list_evaluations_setups() to API docs
Neeratyoy Oct 16, 2019
78e7032
also check dependencies for sklearn string (#830)
amueller Oct 16, 2019
e35262c
Merge pull request #840 from openml/neeratyoy-patch-1
amueller Oct 16, 2019
34d784a
Better error message (#837)
mfeurer Oct 16, 2019
c40e474
add new example regarding svm hyperparameter plotting (#834)
mfeurer Oct 16, 2019
43596e0
Create OpenMLBase, have most OpenML objects derive from it (#828)
PGijsbers Oct 17, 2019
547901f
Fix typos and grammatical errors in docs and examples. (#845)
tashay Oct 17, 2019
35dd7d3
Replace code health by appveyor badge (#843)
mfeurer Oct 17, 2019
c59c3b8
Fix 838 (#846)
sahithyaravi Oct 17, 2019
b1dae0b
Improve SVM test (#848)
mfeurer Oct 17, 2019
cfba39d
Finishing the whole example design
Neeratyoy Oct 17, 2019
9ca9d87
Making pandas related changes suggested by Matthias
Neeratyoy Oct 17, 2019
a5b35e6
Allow datasets without qualities to be downloaded. (#847)
PGijsbers Oct 17, 2019
cd3ba29
minor reformatting
mfeurer Oct 17, 2019
f6a2a95
add a print statement
mfeurer Oct 17, 2019
56fa7f9
Merge pull request #832 from openml/transfer_learning_example
Neeratyoy Oct 18, 2019
2a25ed3
Remove OpenMLDemo unit tests. (#850)
PGijsbers Oct 18, 2019
f74b73a
Put shared logic of Publish into OpenMLBase (#849)
PGijsbers Oct 18, 2019
433f1e7
Optimizing Perrone example (#853)
Neeratyoy Oct 23, 2019
1c025db
Convert non-str column names to str when creating a dataset. (#851)
PGijsbers Oct 23, 2019
d321aba
Add long description (#856)
PGijsbers Oct 24, 2019
d312da0
MAINT prepare new release (#855)
mfeurer Oct 25, 2019
4a13100
redirect test to live server (#859)
mfeurer Oct 29, 2019
882b06b
Add debug output (#860)
mfeurer Nov 4, 2019
34d54d9
Fix736 (#861)
PGijsbers Nov 5, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fixing edge cases to pass tests
  • Loading branch information
Neeratyoy committed Aug 24, 2019
commit 58a66097456bed82ed7b5ff8fabb81c42ae99fd2
196 changes: 105 additions & 91 deletions openml/extensions/sklearn/extension.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,8 @@ def _get_sklearn_description(self, model: Any, char_lim: int = 1024) -> str:
def match_format(s):
return "{}\n{}\n".format(s, len(s) * '-')
s = inspect.getdoc(model)
if s is None:
return ''
if len(s) <= char_lim:
# if the fetched docstring is smaller than char_lim, no trimming required
return s.strip()
Expand Down Expand Up @@ -528,6 +530,105 @@ def match_format(s):
s = "{}...".format(s[:char_lim - 3])
return s.strip()

def _extract_sklearn_parameter_docstring(self, model) -> Union[None, str]:
'''Extracts the part of sklearn docstring containing parameter information

Fetches the entire docstring and trims just the Parameter section.
The assumption is that 'Parameters' is the first section in sklearn docstrings,
followed by other sections titled 'Attributes', 'See also', 'Note', 'References',
appearing in that order if defined.
Returns a None if no section with 'Parameters' can be found in the docstring.

Parameters
----------
model : sklearn model

Returns
-------
str, or None
'''
def match_format(s):
return "{}\n{}\n".format(s, len(s) * '-')
s = inspect.getdoc(model)
if s is None:
return None
try:
index1 = s.index(match_format("Parameters"))
except ValueError as e:
# when sklearn docstring has no 'Parameters' section
print("{} {}".format(match_format("Parameters"), e))
return None

headings = ["Attributes", "Notes", "See also", "Note", "References"]
for h in headings:
try:
# to find end of Parameters section
index2 = s.index(match_format(h))
break
except ValueError:
print("{} not available in docstring".format(h))
continue
else:
# in the case only 'Parameters' exist, trim till end of docstring
index2 = len(s)
s = s[index1:index2]
return s.strip()

def _extract_sklearn_param_info(self, model, char_lim=1024) -> Union[None, Dict]:
'''Parses parameter type and description from sklearn dosctring

Parameters
----------
model : sklearn model
char_lim : int
Specifying the max length of the returned string.
OpenML servers have a constraint of 1024 characters string fields.

Returns
-------
Dict, or None
'''
docstring = self._extract_sklearn_parameter_docstring(model)
if docstring is None:
# when sklearn docstring has no 'Parameters' section
return None

n = re.compile("[.]*\n", flags=IGNORECASE)
lines = n.split(docstring)
p = re.compile("[a-z0-9_ ]+ : [a-z0-9_']+[a-z0-9_ ]*", flags=IGNORECASE)
parameter_docs = OrderedDict() # type: Dict
description = [] # type: List

# collecting parameters and their descriptions
for i, s in enumerate(lines):
param = p.findall(s)
if param != []:
if len(description) > 0:
description[-1] = '\n'.join(description[-1]).strip()
if len(description[-1]) > char_lim:
description[-1] = "{}...".format(description[-1][:char_lim - 3])
description.append([])
else:
if len(description) > 0:
description[-1].append(s)
description[-1] = '\n'.join(description[-1]).strip()
if len(description[-1]) > char_lim:
description[-1] = "{}...".format(description[-1][:char_lim - 3])

# collecting parameters and their types
matches = p.findall(docstring)
for i, param in enumerate(matches):
key, value = param.split(':')
parameter_docs[key.strip()] = [value.strip(), description[i]]

# to avoid KeyError for missing parameters
param_list_true = list(model.get_params().keys())
param_list_found = list(parameter_docs.keys())
for param in list(set(param_list_true) - set(param_list_found)):
parameter_docs[param] = [None, None]

return parameter_docs

def _serialize_model(self, model: Any) -> OpenMLFlow:
"""Create an OpenMLFlow.

Expand Down Expand Up @@ -656,97 +757,6 @@ def _check_multiple_occurence_of_component_in_flow(
known_sub_components.add(visitee.name)
to_visit_stack.extend(visitee.components.values())

def _extract_sklearn_parameter_docstring(self, model) -> Union[None, str]:
'''Extracts the part of sklearn docstring containing parameter information

Fetches the entire docstring and trims just the Parameter section.
The assumption is that 'Parameters' is the first section in sklearn docstrings,
followed by other sections titled 'Attributes', 'See also', 'Note', 'References',
appearing in that order if defined.
Returns a None if no section with 'Parameters' can be found in the docstring.

Parameters
----------
model : sklearn model

Returns
-------
str, or None
'''
def match_format(s):
return "{}\n{}\n".format(s, len(s) * '-')
s = inspect.getdoc(model)
try:
index1 = s.index(match_format("Parameters"))
except ValueError as e:
# when sklearn docstring has no 'Parameters' section
print("{} {}".format(match_format("Parameters"), e))
return None

headings = ["Attributes", "Notes", "See also", "Note", "References"]
for h in headings:
try:
# to find end of Parameters section
index2 = s.index(match_format(h))
break
except ValueError:
print("{} not available in docstring".format(h))
continue
else:
# in the case only 'Parameters' exist, trim till end of docstring
index2 = len(s)
s = s[index1:index2]
return s.strip()

def _extract_sklearn_param_info(self, model, char_lim=1024) -> Union[None, Dict]:
'''Parses parameter type and description from sklearn dosctring

Parameters
----------
model : sklearn model
char_lim : int
Specifying the max length of the returned string.
OpenML servers have a constraint of 1024 characters string fields.

Returns
-------
Dict, or None
'''
docstring = self._extract_sklearn_parameter_docstring(model)
if docstring is None:
# when sklearn docstring has no 'Parameters' section
return None

n = re.compile("[.]*\n", flags=IGNORECASE)
lines = n.split(docstring)
p = re.compile("[a-z0-9_ ]+ : [a-z0-9_']+[a-z0-9_ ]*", flags=IGNORECASE)
parameter_docs = OrderedDict() # type: Dict
description = [] # type: List

# collecting parameters and their descriptions
for i, s in enumerate(lines):
param = p.findall(s)
if param != []:
if len(description) > 0:
description[-1] = '\n'.join(description[-1]).strip()
if len(description[-1]) > char_lim:
description[-1] = "{}...".format(description[-1][:char_lim - 3])
description.append([])
else:
if len(description) > 0:
description[-1].append(s)
description[-1] = '\n'.join(description[-1]).strip()
if len(description[-1]) > char_lim:
description[-1] = "{}...".format(description[-1][:char_lim - 3])

# collecting parameters and their types
matches = p.findall(docstring)
for i, param in enumerate(matches):
key, value = param.split(':')
parameter_docs[key.strip()] = [value.strip(), description[i]]

return parameter_docs

def _extract_information_from_model(
self,
model: Any,
Expand Down Expand Up @@ -890,6 +900,10 @@ def flatten_all(list_):
parameters[k] = None

if parameters_docs is not None:
# print(type(model))
# print(sorted(parameters_docs.keys()))
# print(sorted(model_parameters.keys()))
# print()
data_type, description = parameters_docs[k]
parameters_meta_info[k] = OrderedDict((('description', description),
('data_type', data_type)))
Expand Down
31 changes: 31 additions & 0 deletions openml/flows/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,10 @@ def assert_flows_equal(flow1: OpenMLFlow, flow2: OpenMLFlow,
ignore_custom_name_if_none)
elif key == '_extension':
continue
elif key == 'description':
# to ignore matching of descriptions since sklearn based flows may have
# altering docstrings and is not guaranteed to be consistent
continue
else:
if key == 'parameters':
if ignore_parameter_values or \
Expand Down Expand Up @@ -397,6 +401,33 @@ def assert_flows_equal(flow1: OpenMLFlow, flow2: OpenMLFlow,
# Helps with backwards compatibility as `custom_name` is now auto-generated, but
# before it used to be `None`.
continue
elif key == 'parameters_meta_info':
# this value is a dictionary where each key is a parameter name, containing another
# dictionary with keys specifying the parameter's 'description' and 'data_type'
# check of descriptions can be ignored since that might change
# data type check can be ignored if one of them is not defined, i.e., None
params1 = set(flow1.parameters_meta_info.keys())
params2 = set(flow2.parameters_meta_info.keys())
if params1 != params2:
raise ValueError('Parameter list in meta info for parameters differ in the two flows.')
# iterating over the parameter's meta info list
for param in params1:
if isinstance(flow1.parameters_meta_info[param], Dict) and \
isinstance(flow2.parameters_meta_info[param], Dict) and \
'data_type' in flow1.parameters_meta_info[param] and \
'data_type' in flow2.parameters_meta_info[param]:
value1 = flow1.parameters_meta_info[param]['data_type']
value2 = flow2.parameters_meta_info[param]['data_type']
else:
value1 = flow1.parameters_meta_info[param]
value2 = flow2.parameters_meta_info[param]
if value1 is None or value2 is None:
continue
elif value1 != value2:
raise ValueError("Flow {}: data type for parameter {} in parameters_meta_info differ as "
"{}\nvs\n{}".format(flow1.name, key, value1, value2))
# the continue is to avoid the 'attr != attr2' check at end of function
continue

if attr1 != attr2:
raise ValueError("Flow %s: values for attribute '%s' differ: "
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def test_serialize_model(self):

fixture_name = 'sklearn.tree.tree.DecisionTreeClassifier'
fixture_short_name = 'sklearn.DecisionTreeClassifier'
fixture_description = 'Automatically created scikit-learn flow.'
fixture_description = self.extension._get_sklearn_description(model)
version_fixture = 'sklearn==%s\nnumpy>=1.6.1\nscipy>=0.9' \
% sklearn.__version__
# min_impurity_decrease has been introduced in 0.20
Expand Down Expand Up @@ -143,7 +143,7 @@ def test_serialize_model_clustering(self):

fixture_name = 'sklearn.cluster.k_means_.KMeans'
fixture_short_name = 'sklearn.KMeans'
fixture_description = 'Automatically created scikit-learn flow.'
fixture_description = self.extension._get_sklearn_description(model)
version_fixture = 'sklearn==%s\nnumpy>=1.6.1\nscipy>=0.9' \
% sklearn.__version__
# n_jobs default has changed to None in 0.20
Expand Down Expand Up @@ -207,10 +207,10 @@ def test_serialize_model_with_subcomponent(self):
'(base_estimator=sklearn.tree.tree.DecisionTreeClassifier)'
fixture_class_name = 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'
fixture_short_name = 'sklearn.AdaBoostClassifier'
fixture_description = 'Automatically created scikit-learn flow.'
fixture_description = self.extension._get_sklearn_description(model)
fixture_subcomponent_name = 'sklearn.tree.tree.DecisionTreeClassifier'
fixture_subcomponent_class_name = 'sklearn.tree.tree.DecisionTreeClassifier'
fixture_subcomponent_description = 'Automatically created scikit-learn flow.'
fixture_subcomponent_description = self.extension._get_sklearn_description(model.base_estimator)
fixture_structure = {
fixture_name: [],
'sklearn.tree.tree.DecisionTreeClassifier': ['base_estimator']
Expand Down Expand Up @@ -264,7 +264,7 @@ def test_serialize_pipeline(self):
'scaler=sklearn.preprocessing.data.StandardScaler,' \
'dummy=sklearn.dummy.DummyClassifier)'
fixture_short_name = 'sklearn.Pipeline(StandardScaler,DummyClassifier)'
fixture_description = 'Automatically created scikit-learn flow.'
fixture_description = self.extension._get_sklearn_description(model)
fixture_structure = {
fixture_name: [],
'sklearn.preprocessing.data.StandardScaler': ['scaler'],
Expand Down Expand Up @@ -353,7 +353,7 @@ def test_serialize_pipeline_clustering(self):
'scaler=sklearn.preprocessing.data.StandardScaler,' \
'clusterer=sklearn.cluster.k_means_.KMeans)'
fixture_short_name = 'sklearn.Pipeline(StandardScaler,KMeans)'
fixture_description = 'Automatically created scikit-learn flow.'
fixture_description = self.extension._get_sklearn_description(model)
fixture_structure = {
fixture_name: [],
'sklearn.preprocessing.data.StandardScaler': ['scaler'],
Expand Down Expand Up @@ -445,7 +445,7 @@ def test_serialize_column_transformer(self):
'numeric=sklearn.preprocessing.data.StandardScaler,' \
'nominal=sklearn.preprocessing._encoders.OneHotEncoder)'
fixture_short_name = 'sklearn.ColumnTransformer'
fixture_description = 'Automatically created scikit-learn flow.'
fixture_description = self.extension._get_sklearn_description(model)
fixture_structure = {
fixture: [],
'sklearn.preprocessing.data.StandardScaler': ['numeric'],
Expand Down Expand Up @@ -504,7 +504,7 @@ def test_serialize_column_transformer_pipeline(self):
fixture_name: [],
}

fixture_description = 'Automatically created scikit-learn flow.'
fixture_description = self.extension._get_sklearn_description(model)
serialization = self.extension.model_to_flow(model)
structure = serialization.get_structure('name')
self.assertEqual(serialization.name, fixture_name)
Expand Down
1 change: 0 additions & 1 deletion tests/test_flows/test_flow_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,6 @@ def test_are_flows_equal(self):
# Test most important values that can be set by a user
openml.flows.functions.assert_flows_equal(flow, flow)
for attribute, new_value in [('name', 'Tes'),
('description', 'Test flo'),
('external_version', '2'),
('language', 'english'),
('dependencies', 'ab'),
Expand Down