Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🪲Bug Hunt🪲: dill goes to jupyter. what gets stupyder? #301

Closed
OwenPriceSkelly opened this issue Oct 11, 2023 · 5 comments · Fixed by #353
Closed

🪲Bug Hunt🪲: dill goes to jupyter. what gets stupyder? #301

OwenPriceSkelly opened this issue Oct 11, 2023 · 5 comments · Fixed by #353
Assignees

Comments

@OwenPriceSkelly
Copy link
Member

OwenPriceSkelly commented Oct 11, 2023

Our new publication workflow takes an "everything but the kitchen sink" approach to serializing user code, which significantly reduces the hoops our sdk and the users need to jump through in order to publish their work. The tradeoff is that the new approach leans heavily on dill's ability to reliably save a python interpreter session to a file, and opens us up to far more edge cases. Now that we're going to try and serialize arbitrary jupyter notebooks with dill, we need to know what kinds of jupyter notebook state dill won't be able to handle as advertised.

At a high level, this is the process:

  • Given some jupyter notebook, convert it to a plain python script with nbconvert
  • Run that script in a fresh python interpreter (bringing user definitions, imports etc into scope)
    • Ideally, this should reach exactly the same interpreter state as running each cell in order manually would. When might this not be true?
  • Save the state of that python process to a file using dill.dump_session
  • Later, give that file to dill.load_session in another fresh process in order to call one of the user-defined functions.

What we are not worried about:

  • Problems due to a mismatch between python, jupyter, dill versions or OS. Assume both sides of the serialize/deserialize process are properly containerized, just different python processes
    • NB: Any version-related bugs or incompatibilities that keep the environment fixed on both sides still count!

What we are worried about:

  • jupyter notebook magic/state that is lost when run as a plain python script
  • notebook state that would be preserved but is un-pickleable and causes immediate problems on the serialization side
  • notebook state that seems to pickle fine but causes problems when deserializing later

Acceptance Criteria

  • Description of the bug with as much relevant detail as you have
  • Step-by-step instructions (or, even better, a script) for reproducing the issue.
  • Bonus points for any ideas on workarounds, but not necessary

Deadline and Prize for Best Bug

  • Best Bug is subjective (but we have to be able to reproduce it!). More than open to argument on the basis of size, technical interest, relevance to users, fun factor, style points, etc.
  • Bug description and repro steps should be linked under this issue before November 21
  • the Best Bug submitted before the deadline takes home this lil guy:
    image (1)
@WillEngler
Copy link
Collaborator

Quick thoughts from lunch with Globus Compute engineers:

  • Do classes defined in the notebook get transferred properly?
  • Someone could probably shoot themself in the foot by trying to import something from an adjacent file. I don't think we want to support this, but it would be good to make good error messages for it or make it impossible to do.

@isaac-darling
Copy link
Contributor

isaac-darling commented Nov 8, 2023

I found a bug (unrelated to dill) while syncing with Owen today that prevented users (me) from converting notebooks with magics in them. Apparently, in order for nbconvert to properly handle ipython magics (!pip install and the like) during conversion, ipython must be installed. Since ipython was not in the pyproject.toml, my session did not have it installed and neither would the users' (necessarily). The fix was quickly merged, adding the proper ipython version to the pyproject.toml.

@MaxTuecke
Copy link
Contributor

Not a jupyter issue, but our version of dill cannot serialize classes that inherit from abc. For example

from abc import ABC, abstractmethod
import pandas as pd
import dill

class Preprocessor(ABC):
    @abstractmethod
    def preprocess(self, input_df: pd.DataFrame) -> pd.DataFrame:
        pass
        
class Preprocessor1(Preprocessor):
    def preprocess(self, input_df: pd.DataFrame) -> pd.DataFrame:
        input_df.fillna(0, inplace=True)
        return input_df

p = Preprocessor1()

dill.dump_session("./test.pkl") #fails

will fail to serialize.

Dill will not fail to serialize this on version 0.3.7
Alternatively, dill can serialize this if any classes inheriting from ABC or ABCMeta are in another file. For example

main.py

from a_module import Preprocessor1
import dill

p = Preprocessor1()

dill.dump_session("./test.pkl") #works

a_module.py

from abc import ABC, abstractmethod
import pandas as pd

class Preprocessor(ABC):
    @abstractmethod
    def preprocess(self, input_df: pd.DataFrame) -> pd.DataFrame:
        pass
        
class Preprocessor1(Preprocessor):
    def preprocess(self, input_df: pd.DataFrame) -> pd.DataFrame:
        input_df.fillna(0, inplace=True)
        return input_df

Dill issue for this can be found here: uqfoundation/dill#332

@OwenPriceSkelly
Copy link
Member Author

mistakenly closed by GitHub keywords

@WillEngler
Copy link
Collaborator

All prizes from the bug hunt have been distributed, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants