Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still not able to use non-sklearn estimators without wrapping them in a pipeline #734

Closed
hp2500 opened this issue Jul 11, 2019 · 7 comments · Fixed by #742
Closed

Still not able to use non-sklearn estimators without wrapping them in a pipeline #734

hp2500 opened this issue Jul 11, 2019 · 7 comments · Fixed by #742
Assignees

Comments

@hp2500
Copy link

hp2500 commented Jul 11, 2019

Hi there, I raised this in issue #724. I have been trying to run experiments with a fairly new sklearn-extra classifier (/~https://github.com/Alex7Li/scikit-learn-extra/tree/master/sklearn_extra). The classifier runs fine on a local dataset. However, when I am trying to run it on an openml task, I am getting an error.

Here is a minimal example:

# define classifier
from sklearn_extra.fast_kernel import FKC_EigenPro
clf = FKC_EigenPro()
# get task
task = openml.tasks.get_task(3)
# run model on task
run = openml.runs.run_model_on_task(clf, task)
# publish run on openml
run.publish()

AttributeError Traceback (most recent call last)
in
4 task = openml.tasks.get_task(3)
5 # run model on task
----> 6 run = openml.runs.run_model_on_task(clf, task)
7 # publish run on openml
8 run.publish()

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_model_on_task(model, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow, return_flow)
104 seed=seed,
105 add_local_measures=add_local_measures,
--> 106 upload_flow=upload_flow,
107 )
108 if return_flow:

/miniconda3/lib/python3.7/site-packages/openml/runs/functions.py in run_flow_on_task(flow, task, avoid_duplicate_runs, flow_tags, seed, add_local_measures, upload_flow)
172 task, flow = flow, task
173
--> 174 flow.model = flow.extension.seed_model(flow.model, seed=seed)
175
176 # We only need to sync with the server right now if we want to upload the flow,

AttributeError: 'NoneType' object has no attribute 'seed_model'

You mentioned that this should be fixed via #722, but I am still encountering the same error.

@amueller

@amueller
Copy link
Contributor

Easier way to reproduce:

import openml
from sklearn.linear_model import LogisticRegression

# there needs to be a version specified but this works lol.
__version__ = 0.1

class MyLR(LogisticRegression):
    pass

clf = MyLR()
# get task
task = openml.tasks.get_task(3)
# run model on task
run = openml.runs.run_model_on_task(clf, task)
# publish run on openml
run.publish()

RuntimeError: No extension could be found for flow None: main.MyLR

So get_extension_by_model returns sklearn because isinstance(MyLR(), BaseEstimator) - which is also not the correct test btw but whatever.

The problem is that the flow that is created from that model is not an sklearn extension flow, because that's created by get_extension_by_flow, and the sklearn.extension module doesn't set include sklearn in its external_version, it's only including the sklearn version in the tags:

flow = OpenMLFlow(name=name,

There are two obvious fixes:
a) When creating the flow, allow setting the extension directly, because we know what the extension is supposed to be.

b) include the sklearn version in the external version

I feel we should be doing both possibly?

@mfeurer
Copy link
Collaborator

mfeurer commented Jul 23, 2019

which is also not the correct test btw but whatever.

what would you test for? The interface?

include the sklearn version in the external version

that would definitively be helpful

When creating the flow, allow setting the extension directly, because we know what the extension is supposed to be.

I hope that this won't be necessary, but we should keep it in mind in case this problem persists.

@mfeurer mfeurer assigned amueller and unassigned Neeratyoy Jul 23, 2019
@amueller amueller reopened this Oct 14, 2019
@amueller
Copy link
Contributor

This issue still persists with older flows:

import openml
openml.flows.get_flow(7660, reinstantiate=True)

I would really like to run flow 7777 because it's used in the definition of CC-18, but I can't because it contains the ConditionalImputer, which can't be reinstantiated with current openml (I tried to use older openml and failed as well).

@mfeurer
Copy link
Collaborator

mfeurer commented Oct 14, 2019

Two questions:

  1. Do you have the ConditionalImputer installed? If yes, could you please paste the error?
  2. What's the behavior that you expect? That the pipeline is partially instantiated, except for the ConditionalImputer?

@amueller
Copy link
Contributor

  1. Yes. The error is that we can only instantiate sklearn flows. This was fixed in add sklearn version to external version in sklearn flows, #742 for new flows, but this is an old flow and doesn't have sklearn in the external version, so the sklearn extension can't detect that it handles it.

  2. That the pipeline is instantiated. The rest can be instantiated.

We can decide that we basically abandon all third-party flows that were created before #742, or we need to change how an extension detects if it can handle a flow. Hacky solution: add study14 to the modules to check for in the module to know that the extension can handle a flow.

@amueller
Copy link
Contributor

cc @janvanrijn

@amueller
Copy link
Contributor

fixed in #830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants