Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion openml/datasets/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -456,6 +456,10 @@ def _create_pickle_in_cache(self, data_file: str) -> Tuple[str, str, str]:
# The file is likely corrupt, see #780.
# We deal with this when loading the data in `_load_data`.
return data_pickle_file, data_feather_file, feather_attribute_file
except Exception:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specifically catch ModuleNotFoundError and any others that are expected? I would prefer not to use a catch-all as if new issues arise, we could possibly just fix them instead.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a check for the error of #898 too. And document in comments in one sentence why an error is expected. Including the issue number is good, but since they are easily summarized it is nice to be able to have the documentation within the code, e.g.

catch ModuleNotFoundError:
    # 780: Pickled dataframe is likely of pandas<1.0 while attempting to load with pandas>=1.0
    return ...
catch ValueError:
   (maybe check for the specific message)
   # 898: Pickled dataframe pickled with protocol 5 (Py3.8), but loaded with protocol 4 (<Py3.8).
   return ...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just updated the code.

# There was some issue loading the file, see #918
# We deal with this when loading the data in `_load_data`.
return data_pickle_file, data_feather_file, feather_attribute_file

# Between v0.8 and v0.9 the format of pickled data changed from
# np.ndarray to pd.DataFrame. This breaks some backwards compatibility,
Expand All @@ -473,6 +477,10 @@ def _create_pickle_in_cache(self, data_file: str) -> Tuple[str, str, str]:
# The file is likely corrupt, see #780.
# We deal with this when loading the data in `_load_data`.
return data_pickle_file, data_feather_file, feather_attribute_file
except Exception:
# There was some issue loading the file, see #918
# We deal with this when loading the data in `_load_data`.
return data_pickle_file, data_feather_file, feather_attribute_file

logger.debug("Data feather file already exists and is up to date.")
return data_pickle_file, data_feather_file, feather_attribute_file
Expand Down Expand Up @@ -529,7 +537,7 @@ def _load_data(self):
"Detected a corrupt cache file loading dataset %d: '%s'. "
"We will continue loading data from the arff-file, "
"but this will be much slower for big datasets. "
"Please manually delete the cache file if you want openml-python "
"Please manually delete the cache file if you want OpenML-Python "
"to attempt to reconstruct it."
"" % (self.dataset_id, self.data_pickle_file)
)
Expand All @@ -539,6 +547,16 @@ def _load_data(self):
"Cannot find a pickle file for dataset {} at "
"location {} ".format(self.name, self.data_pickle_file)
)
except Exception as e:
logger.warning(
"Encountered error message when loading cached dataset %d: '%s'. "
"Error message was: %s. "
"We will continue loading data from the arff-file, "
"but this will be much slower for big datasets. "
"Please manually delete the cache file if you want OpenML-Python "
"to attempt to reconstruct it."
"" % (self.dataset_id, self.data_pickle_file, e.args[0]),
)

return data, categorical, attribute_names

Expand Down