Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transition to plugins for dedupe variables. #1085

Closed
fgregg opened this issue Aug 17, 2022 · 7 comments · Fixed by #1193
Closed

transition to plugins for dedupe variables. #1085

fgregg opened this issue Aug 17, 2022 · 7 comments · Fixed by #1193

Comments

@fgregg
Copy link
Contributor

fgregg commented Aug 17, 2022

newer versions of pip/setuptools are now confused by the old-style namespace packages.

i previously tried to migrate to the implicit style namespace packages, but the problem with that i don't think there's a way to do that this not backwards incompatible.

specifically, i don't see how you could use the implicit style namespaces and have something like from dedupe import Dedupe work.

you could do from dedupe.api import Dedupe but that's not what we have guaranteed.

so, then maybe we have to move to the src-layout

@fgregg fgregg mentioned this issue Aug 17, 2022
@NickCrews
Copy link
Contributor

I don't see a problem with just using a src layout?

@NickCrews
Copy link
Contributor

Also, what was "backwards incompatible" with it? Like any custom variables that got registered with dedupe by being implicitly part of the dedupe package wouldn't get registered anymore?

I would say for that I don't currently like how that registration is implicit via subclassing. I think it would be much better to have explicit @dedupe.register_variable decorator that people could put on their own classes, or even use a plugin system like pluggy

@fgregg
Copy link
Contributor Author

fgregg commented Sep 2, 2022

with implicit namespaces you can't have an init.py in the top directory as far as i understand.

so everything that's under dedupe right now would have to go into a subdirectory like "core" whatever.

the reasons why we use subclassing is that was the main way to do it when we first wrote the code way back.

there's not really any path forward that's not going to be a breaking change for existing plugins, so i think we should feel free to do what we want. i'm very open to more modern approaches here.

@fgregg fgregg changed the title transition to implicit namespaces or src-layout. transition to plugins for dedupe variables. Dec 10, 2022
@fgregg
Copy link
Contributor Author

fgregg commented Dec 10, 2022

@NickCrews, do you have an example of a use of pluggy that would be a good fit for dedupe variables? most of what i'm finding is about hooks, which doesn't seem quite right.

@NickCrews
Copy link
Contributor

NickCrews commented Dec 10, 2022

I don't. And thinking about it more, pluggy seems a bit overkill.

What about we do what pandas does with their plotting backend? https://plotly.com/python/pandas-backend/

Eg they say pandas.options.plotting.backend = "plotly" and pandas goes and tries to import a package named "plotly" and thn use a function called "plot" in that package.

So if you specify the type of a variable as
"mylib.foo:ZipCodeVariable" then dedupe imports mylib.foo and looks for an attribute named ZipCodeVariable. Using a colon as a separator as is used as a convention elsewhere in python eg setuptools entry points.

This is better because it is explicit where to go looking for the source code, there isn't some hidden registration behind the scenes like there is now.

It is simpler for both variable authors and for dedupe maintainers than pluggy or some other system.

This string spec is still not backwards compatible for plugin authors, so no additional benefit there.

@fgregg
Copy link
Contributor Author

fgregg commented Dec 10, 2022 via email

@NickCrews
Copy link
Contributor

I looked at #1121, some specific comments there.

Check out my solution at #1122.

Another benefit I realized of using spec-strings instead of pluggy is that it is way more testable. Your PR didn't have tests, I'm guessing because they would be hard to write :) Mine was actually pretty easy to add some tests.

fgregg added a commit that referenced this issue Jun 27, 2024
set up data models like

```python
[
  dedupe.variables.String("name"),
  dedupe.variables.Exact("address")
 ]
 ```

instead of 

```
[
  {"field", "name", "type": "String"},
  {"field", "address", "type": "String"},
 ]
```

supersedes #1122, #1121. will close #1085
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants