Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

visualization of pyjanitor chained method calls #1176

Closed
asmirnov69 opened this issue Oct 16, 2022 · 11 comments
Closed

visualization of pyjanitor chained method calls #1176

asmirnov69 opened this issue Oct 16, 2022 · 11 comments

Comments

@asmirnov69
Copy link
Contributor

Brief Description

I'd like to propose some new features to pyjanitor with focus to visualization and data organization. I've made proof-of-concept repo to explain what exactly is the proposal contains: https://github.com/asmirnov69/pyjviz-poc

To be really brief - this example of pyjanitor example with corresponding diagram

Examples

more examples are here

A bit more about what is goal beyond immediate focus

Immediate focus is to have png diagram files generation working using rdflib with provided SPARQL and graphviz. There is a bigger idea of rdf logs and similarily collected data to be stored in graph database. This would be database of research&production activity which uses SPARQL and/or opencypher to provide the way to connect collected data to other knowledge graph systems (e.g. Obsidian).

@asmirnov69
Copy link
Contributor Author

I was unable to find the way to attach labels. My intent was to label this issue as 'disscussion-needed'

@asmirnov69 asmirnov69 changed the title visualization of pyjanitor chained methods pipes visualization of pyjanitor chained method pipes Oct 17, 2022
@ericmjl
Copy link
Member

ericmjl commented Nov 14, 2022

Hl @asmirnov69! Thanks for posting this issue. It's a really cool idea! I especially like being able to visualize both pandas and janitor functions simultaneously.

Because the POC implementation is a tad complex, would you be kind enough to point out where the key changes were needed in order to enable keeping track of which function/method calls were made? I read through the code, but was confused, as I didn't see something like a globally-instantiated NetworkX graph object (or analogous thing).

Additionally, I saw the use of a new ChainedMethodPipe object. Is the intent with the ChainedMethodPipe to keep transformations visually isolated from one another?

I'd love to see this functionality implemented. We could probably house this in pyjanitor; if so, I can foresee the PR review process needing to be a bit longer than usual so that there's enough time for other maintainers to be brought up-to-speed on the implementation; we'd probably also need time to have developer/maintainer documentation written, focused particularly on:

  1. what are the most likely ways this code would break in the future, and what's the most likely place to go and fix, and
  2. what are desired possible future improvements not covered in this PR, and where should the code changes be made?
  3. answering questions that we bring up during the review process, for e.g. architecture docs answering the questions I raised above.

@Zeroto521
Copy link
Member

There is a reference from sklearn's Pipeline

image

https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_display.html

@asmirnov69
Copy link
Contributor Author

asmirnov69 commented Nov 17, 2022

Hi @ericmjl
Thanks for expressing interest in the idea. I am going start with further steps. Below are answers on some of your questions.

I'd love to see this functionality implemented. We could probably house this in pyjanitor;

As for housing of viz in pyjanitor - I think it make sense to consider as main approach since visualization suppose to come as set of in-the-box features for all users. In any case it will be up to you to decide based upon the progress of some sort of parallel development.

if so, I can foresee the PR review process needing to be a bit longer than usual so that there's enough time for other maintainers to be brought up-to-speed on the implementation

PR approach seems to work well for bug fixes and small features additions. I think we can plan to use PRs. However I would expect we will need to agree on long-term existence of some parallel feature branches with separate limiter releases.

we'd probably also need time to have developer/maintainer documentation written,

after your revelation elsewhere that you are using Obsidian I decide to take second look at that system. As result all my notes both for office work and home are now in various obsidian vaults.
Do you want to experiment and start using Obsidian for viz features project documentation? I will do some initial setup on how it may look like in pyjviz-poc repo in next few days. Hopefully it will be enough to arrive to informed decision.

I will provide more answers a bit later. Some of the answers actually belong to project documentation so clarity on documentation approach would be nice to have.

Thanks again for your support, I really appreciate that.

@asmirnov69
Copy link
Contributor Author

@Zeroto521 thanks for posting on viz available in sklearn. pyjviz-poc proposal goes further in this direction. Main idea is to use RDF as data format to capture details needed for visualization and other uses. Visualization itself can be made as independent component as result of that.

@asmirnov69
Copy link
Contributor Author

@ericmjl Hi, I put additional answers into https://github.com/asmirnov69/pyjviz-poc/blob/main/docs/pyjviz-poc/Q%26A.md
Also take a look here for code highlights https://github.com/asmirnov69/pyjviz-poc/blob/main/docs/pyjviz-poc/README.md#code-highlights
This is the part of obsidian vault so it may be easier to use that tool.

One thing to mention here: I renamed ChainedMethodsPipe to ChainedMethodsCall. It was actually original name which somehow didn't get to proposal final commit. I think pipe as a term is overused so ChainedMethodsCall looks better and more relevant. Let me know if you think we should still use ChainedMethodsPipe.

@asmirnov69 asmirnov69 changed the title visualization of pyjanitor chained method pipes visualization of pyjanitor chained method calls Dec 3, 2022
@asmirnov69
Copy link
Contributor Author

Hi @ericmjl
After some additional thinking I realized there is a way to proceed with the proposal as separate python module - instead of an attempt to modify pyjanitor to include viz features.

I would suggest to introduce the python module pyjviz which will use pyjanitor as required dependency. I've made some initial experiments and looks like it is possible to do it this way. pyjviz will give users ability to visualize pyjanitor code in a manner close to POC examples

There are changes in the way how visualization will look like from code perspective. It will make current examples and docs useless. All new examples and docs will be in pyjviz. POC repo pyjviz-poc will be archived in next few days.

I start working on pyjviz and will report on the progress. I hope to get something done this week so we can resume discussions on how this new module will look like.

How exactly we can communicate on pyjviz dev efforts? Right now only available option is this current issue. Would it make sense to move further discussion elsewhere?

@ericmjl
Copy link
Member

ericmjl commented Dec 12, 2022

@asmirnov69 thank you for giving it thought! I think it's a good idea. I apologize for silence on my side, we've been busy with physical health issues and our 2nd baby's arrival.

I've been thinking about the development and pyjviz, and having a separate repo with an independent space makes a lot of sense. In that way, you can really drive forward and own the problem space without being too burdened by the existing code base.

Would you like a repo space under the pyjanitor-devs umbrella? I know I'd be happy for your development work to be recognized as part of a growing ecosystem of tooling to cleanly clean data, and I think the rest of the @pyjanitor-devs/core-devs would be excited to see it too. I'm happy to add you to the pyjanitor dev team as well. Please let me know.

@asmirnov69
Copy link
Contributor Author

asmirnov69 commented Dec 13, 2022

@ericmjl Best wishes for your new baby and your family!

Would you like a repo space under the pyjanitor-devs umbrella?

Yes, new repo in pyjanitor-devs for the visualization project would be the best. Also I agree to join pyjanitor dev team as open source contributor.

Feel free to pick better name than pyjviz. Let me know when I can start using that new repo and what are the requirements.

I plan to close this issue and archive pyjviz-poc repo.

@ericmjl
Copy link
Member

ericmjl commented Dec 13, 2022

Thank you, @asmirnov69!

I have added you as a core dev, you should be getting an invite shortly.

I'm not sure what a better name would be, so we can go with pyjviz if you'd like.

One things I hope to encourage as a standard is the use of continual testing, linting, and automated publishing of docs. We can slowly make that happen; there's enough of a great pattern accumulated by other contributors over the years in the pyjanitor repo that we can copy over.

The delivery is today, I will be offline for a bit. In the meantime, could you send me an email at ericmajinglong@gmail.com? (Short-whale, which I usually use, is down.) I will also send you a link to join our discord chat room.

@asmirnov69
Copy link
Contributor Author

new repo pyjviz will be used for further development of ideas described above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants