Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
b45f6f2
Adding importable helper functions
Neeratyoy Oct 29, 2020
8e7ea0b
Changing import of cat, cont
Neeratyoy Oct 29, 2020
102a084
Merge branch 'develop' into fix_773
Neeratyoy Oct 29, 2020
18a2dba
Better docstrings
Neeratyoy Oct 30, 2020
381c267
Adding unit test to check ColumnTransformer
Neeratyoy Oct 30, 2020
5dbff2e
Refinements from @mfeurer
Neeratyoy Nov 2, 2020
fc4ec73
Editing example to support both NumPy and Pandas
Neeratyoy Nov 2, 2020
8d5cad9
Merge branch 'develop' into fix_773
Neeratyoy Nov 3, 2020
3d66404
Merge branch 'develop' into fix_773
Neeratyoy Nov 4, 2020
90c8de6
Unit test fix to mark for deletion
Neeratyoy Nov 4, 2020
e0af15e
Making some unit tests work
Neeratyoy Nov 10, 2020
14aa11d
Waiting for dataset to be processed
Neeratyoy Nov 16, 2020
31d48d8
Minor test collection fix
Neeratyoy Nov 16, 2020
431447c
Template to handle missing tasks
Neeratyoy Nov 30, 2020
cc3199e
Accounting for more missing tasks:
Neeratyoy Nov 30, 2020
8a29668
Fixing some more unit tests
Neeratyoy Nov 30, 2020
405e03c
Simplifying check_task_existence
Neeratyoy Nov 30, 2020
caf4f46
black changes
Neeratyoy Dec 4, 2020
b308e71
Minor formatting
Neeratyoy Dec 8, 2020
436a9fe
Handling task exists check
Neeratyoy Dec 9, 2020
ddd8b04
Testing edited check task func
Neeratyoy Dec 14, 2020
74ae622
Merge branch 'fix_unit_tests' of https://github.com/openml/openml-pyt…
Neeratyoy Dec 14, 2020
50ce90e
Flake fix
Neeratyoy Dec 15, 2020
aea2832
Updating with fixed unit tests from PR #1000
Neeratyoy Dec 15, 2020
56cd639
More retries on connection error
Neeratyoy Dec 16, 2020
8e8ea2e
Adding max_retries to config default
Neeratyoy Dec 17, 2020
d518beb
Update database retry unit test
Neeratyoy Dec 17, 2020
37d9f6b
Print to debug hash exception
Neeratyoy Dec 17, 2020
9bd4892
Fixing checksum unit test
Neeratyoy Dec 17, 2020
dc41b5d
Retry on _download_text_file
Neeratyoy Dec 18, 2020
396cb8d
Update datasets_tutorial.py
mfeurer Dec 21, 2020
8f380de
Update custom_flow_tutorial.py
mfeurer Dec 21, 2020
bc1745e
Update test_study_functions.py
mfeurer Dec 21, 2020
d95b5e6
Update test_dataset_functions.py
mfeurer Dec 21, 2020
d58ca5a
Merge branch 'fix_unit_tests' into fix_773
Neeratyoy Dec 21, 2020
91c6cf5
more retries, but also more time between retries
mfeurer Dec 21, 2020
b43a0e0
Merge branch 'fix_unit_tests' of https://github.com/openml/openml-pyt…
Neeratyoy Dec 21, 2020
a9430b3
allow for even more retries on get calls
mfeurer Dec 21, 2020
e9cfba8
Catching failed get task
Neeratyoy Dec 21, 2020
c13f6ce
Merge branch 'fix_unit_tests' of https://github.com/openml/openml-pyt…
Neeratyoy Dec 21, 2020
3d7abc2
undo stupid change
mfeurer Dec 21, 2020
94576b1
Merge branch 'fix_unit_tests' of https://github.com/openml/openml-pyt…
Neeratyoy Dec 21, 2020
b5e1242
fix one more test
mfeurer Dec 21, 2020
d764aad
Merge branch 'fix_unit_tests' into fix_773
Neeratyoy Dec 21, 2020
f5e4a3e
Refactoring md5 hash check inside _send_request
Neeratyoy Dec 21, 2020
c065dfc
Merge branch 'fix_unit_tests' into fix_773
Neeratyoy Dec 21, 2020
07ce722
Fixing a fairly common unit test fail
Neeratyoy Dec 22, 2020
82e1b72
Reverting loose check on unit test
Neeratyoy Dec 23, 2020
936c252
Merge branch 'fix_unit_tests' into fix_773
Neeratyoy Dec 23, 2020
fc8b464
Merge branch 'develop' into fix_773
PGijsbers Dec 24, 2020
46ab043
Fixing integer type check to allow np.integer
Neeratyoy Jan 22, 2021
1be82c3
Trying to loosen check on unit test as fix
Neeratyoy Jan 25, 2021
dfbf5e5
Examples support for pandas=1.2.1
Neeratyoy Jan 27, 2021
b611f9f
pandas indexing as iloc
Neeratyoy Jan 27, 2021
93833c3
fix example: actually load the different tasks
mfeurer Jan 28, 2021
f6aa7ed
Renaming custom flow to disable tutorial (#1019)
Neeratyoy Jan 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 2 additions & 9 deletions examples/30_extended/run_setup_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@

import numpy as np
import openml
from openml.extensions.sklearn import cat, cont

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
Expand All @@ -57,15 +59,6 @@
# easy as you want it to be


# Helper functions to return required columns for ColumnTransformer
def cont(X):
return X.dtypes != "category"


def cat(X):
return X.dtypes == "category"


cat_imp = make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
Expand Down
9 changes: 4 additions & 5 deletions openml/datasets/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ def list_datasets(
status: Optional[str] = None,
tag: Optional[str] = None,
output_format: str = "dict",
**kwargs
**kwargs,
) -> Union[Dict, pd.DataFrame]:

"""
Expand Down Expand Up @@ -251,7 +251,7 @@ def list_datasets(
size=size,
status=status,
tag=tag,
**kwargs
**kwargs,
)


Expand Down Expand Up @@ -334,8 +334,7 @@ def _load_features_from_file(features_file: str) -> Dict:


def check_datasets_active(
dataset_ids: List[int],
raise_error_if_not_exist: bool = True,
dataset_ids: List[int], raise_error_if_not_exist: bool = True,
) -> Dict[int, bool]:
"""
Check if the dataset ids provided are active.
Expand Down Expand Up @@ -363,7 +362,7 @@ def check_datasets_active(
dataset = dataset_list.get(did, None)
if dataset is None:
if raise_error_if_not_exist:
raise ValueError(f'Could not find dataset {did} in OpenML dataset list.')
raise ValueError(f"Could not find dataset {did} in OpenML dataset list.")
else:
active[did] = dataset["status"] == "active"

Expand Down
28 changes: 28 additions & 0 deletions openml/extensions/sklearn/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,31 @@
__all__ = ["SklearnExtension"]

register_extension(SklearnExtension)


def cont(X):
"""Returns True for all non-categorical columns, False for the rest.
Comment thread
mfeurer marked this conversation as resolved.

This function is required to work with default OpenML datasets as DataFrames allowing
Comment thread
Neeratyoy marked this conversation as resolved.
Outdated
mixed data types. To build sklearn models on mixed data types, a ColumnTransformer is
required to process each type of columns separately.
This function allows transformations meant for continuous/numeric columns to access the
continuous/numeric columns given the dataset as DataFrame.
"""
if not hasattr(X, "dtypes"):
raise AttributeError("Not a Pandas DataFrame with 'dtypes' as attribute!")
return X.dtypes != "category"


def cat(X):
"""Returns True for all categorical columns, False for the rest.
Comment thread
PGijsbers marked this conversation as resolved.

This function is required to work with default OpenML datasets as DataFrames allowing
mixed data types. To build sklearn models on mixed data types, a ColumnTransformer is
required to process each type of columns separately.
This function allows transformations meant for categorical columns to access the
categorical columns given the dataset as DataFrame.
"""
if not hasattr(X, "dtypes"):
raise AttributeError("Not a Pandas DataFrame with 'dtypes' as attribute!")
return X.dtypes == "category"
10 changes: 1 addition & 9 deletions openml/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,12 +267,4 @@ class CustomImputer(SimpleImputer):
pass


def cont(X):
return X.dtypes != "category"


def cat(X):
return X.dtypes == "category"


__all__ = ["TestBase", "SimpleImputer", "CustomImputer", "cat", "cont"]
__all__ = ["TestBase", "SimpleImputer", "CustomImputer"]
5 changes: 1 addition & 4 deletions tests/test_datasets/test_dataset_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,10 +227,7 @@ def test_list_datasets_empty(self):
def test_check_datasets_active(self):
# Have to test on live because there is no deactivated dataset on the test server.
openml.config.server = self.production_server
active = openml.datasets.check_datasets_active(
[2, 17, 79],
raise_error_if_not_exist=False,
)
active = openml.datasets.check_datasets_active([2, 17, 79], raise_error_if_not_exist=False,)
self.assertTrue(active[2])
self.assertFalse(active[17])
self.assertIsNone(active.get(79))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@
from openml.flows import OpenMLFlow
from openml.flows.functions import assert_flows_equal
from openml.runs.trace import OpenMLRunTrace
from openml.testing import TestBase, SimpleImputer, CustomImputer, cat, cont
from openml.testing import TestBase, SimpleImputer, CustomImputer
from openml.extensions.sklearn import cat, cont


this_directory = os.path.dirname(os.path.abspath(__file__))
Expand Down Expand Up @@ -2183,16 +2184,6 @@ def test_failed_serialization_of_custom_class(self):
# for lower versions
from sklearn.preprocessing import Imputer as SimpleImputer

class CustomImputer(SimpleImputer):
pass

def cont(X):
return X.dtypes != "category"

def cat(X):
return X.dtypes == "category"

import sklearn.metrics
import sklearn.tree
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
Expand All @@ -2215,3 +2206,37 @@ def cat(X):
raise AttributeError(e)
else:
raise Exception(e)

@unittest.skipIf(
LooseVersion(sklearn.__version__) < "0.20",
reason="columntransformer introduction in 0.20.0",
)
def test_setupid_with_column_transformer(self):
"""Test to check if inclusion of ColumnTransformer in a pipleline is treated as a new
flow each time.
"""
import sklearn.compose
from sklearn.svm import SVC

def column_transformer_pipe(task_id):
task = openml.tasks.get_task(task_id)
# make columntransformer
preprocessor = sklearn.compose.ColumnTransformer(
transformers=[
("num", StandardScaler(), cont),
("cat", OneHotEncoder(handle_unknown="ignore"), cat),
]
)
# make pipeline
clf = SVC(gamma="scale", random_state=1)
pipe = make_pipeline(preprocessor, clf)
# run task
run = openml.runs.run_model_on_task(pipe, task, avoid_duplicate_runs=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fail on several jobs (but strangely not all-but-one, perhaps due to race conditions?). And shouldn't that be correct?
There is no part of this pipeline which has dynamic setup, which means after the first run upload isn't it correct to raise an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point that I missed.
I think for now the correct approach might be to mark the run/setup to me removed using the _mark_entity_for_removal feature
I'll push with that change and see the tests and then ask for a review if it passes!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @mfeurer is cleaning the test servers, and he recommended waiting this week out before judging if this error is a result of those changes. Nevertheless, I made the push with the change that should be there in any case.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay we'll wait.

if this error is a result of those changes.

I imagine the times it did pass are those where a previous upload had been deleted by the clean up script in between.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a look at this PR and it appears that this error and an error building the docs are the only things that hold back merging this PR? @Neeratyoy could you please have a look into this?

run.publish()
new_run = openml.runs.get_run(run.run_id)
return new_run.setup_id

setup1 = column_transformer_pipe(23)
Comment thread
Neeratyoy marked this conversation as resolved.
Outdated
setup2 = column_transformer_pipe(230)

self.assertEqual(setup1, setup2)
3 changes: 2 additions & 1 deletion tests/test_runs/test_run_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@
import pandas as pd

import openml.extensions.sklearn
from openml.testing import TestBase, SimpleImputer, CustomImputer, cat, cont
from openml.testing import TestBase, SimpleImputer, CustomImputer
from openml.extensions.sklearn import cat, cont
from openml.runs.functions import _run_task_get_arffcontent, run_exists, format_prediction
from openml.runs.trace import OpenMLRunTrace
from openml.tasks import TaskType
Expand Down
3 changes: 2 additions & 1 deletion tests/test_study/test_study_examples.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# License: BSD 3-Clause

from openml.testing import TestBase, SimpleImputer, CustomImputer, cat, cont
from openml.testing import TestBase, SimpleImputer, CustomImputer
from openml.extensions.sklearn import cat, cont

import sklearn
import unittest
Expand Down