Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Workflow.transform to be run with a DataFrame type #1777

Merged
merged 6 commits into from
Mar 21, 2023

Conversation

oliverholworthy
Copy link
Member

Enable Workflow.transform to be run with a DataFrame type.

@oliverholworthy oliverholworthy added the enhancement New feature or request label Mar 10, 2023
@oliverholworthy oliverholworthy added this to the Merlin 23.03 milestone Mar 10, 2023
@oliverholworthy oliverholworthy self-assigned this Mar 10, 2023
@oliverholworthy oliverholworthy force-pushed the workflow-transform-dataframe branch from 3f4bcf4 to d65e9ce Compare March 10, 2023 19:25
nvtabular/workflow/workflow.py Outdated Show resolved Hide resolved
nvtabular/workflow/workflow.py Outdated Show resolved Hide resolved
@karlhigley
Copy link
Contributor

It looks like the failing test is an unrelated Horovod issue—seems like we’ve been seeing a lot of failures from those tests lately, not sure why.

Beyond the failing test, what else would prevent us from merging this as is?

@oliverholworthy oliverholworthy marked this pull request as ready for review March 13, 2023 12:53
@oliverholworthy
Copy link
Member Author

I haven't looked into the details of why the horovod tests are failing regularly either. Seems like we have an unstable test there.

From a functionality perspective, I think the PR achieves the goal of being able to run the transform on a dataframe.

I'm on the fence about whether the use of singledispatch here is clearer to read than the if statements I added initially to the body of the transform method, I suppose if makes it easier to add more types in future if we anticipate more we'd like to add.

@karlhigley
Copy link
Contributor

Yeah, I’m with you on readability. Two things I can think of:

  • I think transform_df is only used in one place and could be inlined so the code requires less jumping around to understand. I don’t think transform_impl can, because it exists as a workaround for…something.
  • In the interest of quasi-consistency, maybe our guideline could be that we use singledispatch for internal APIs (e.g. in Core) and if/else for public APIs?

@oliverholworthy
Copy link
Member Author

I started to revert back toward the if/else style and I had the feeling that, on balance, sticking with singledistpatchmethod might be ok in the end in terms of readability here. Makes it easier to reason about the inputs and outputs to each registered transform implementation. Either way the docstring of transform will be the same I think which is the main thing people will be reading.

Once we've merged this use of the private method _transform_df can be updated to call transform instead

@karlhigley karlhigley merged commit f14a6a2 into main Mar 21, 2023
@rnyak rnyak deleted the workflow-transform-dataframe branch May 4, 2023 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants