Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Notes] Some notes on multiprocessing (in particular in the context of tpcp) #119

Open
AKuederle opened this issue Aug 1, 2024 · 0 comments

Comments

@AKuederle
Copy link
Member

  • Assume Multiprocessing done with joblib

Global Vars

In all child processes, global Variables are reset to there import time values (makes sense, as code that might have modified them is not re-executed in the child process). However, this means, that global config and runtime modifications to objects (e.g. applying a decorator to a function or class after import) will be reset. This also affects sklearn and pandas, which both have global configuration. Sklearn has a custom fix for their internal uses of Parallel processing. However, wrapping sklearn models with parallel still causes the issue. tpcp has a custom workaround by "patching" delayed (first implementation here: https://github.com/mad-lab-fau/tpcp/pull/65/files). This is used in tpcp internally, but can also be used in any parallel context, when the tpcp delayed implementation is used. Requires you to configure custom callbacks that get and then set global config (or run any other side effect) in the new process.

Related issue: joblib/joblib#1071, scikit-learn/scikit-learn#17634

Process Pool

When using the loky backend of joblib, a process pool will be created on first parallel call. This process cool will be reused within one Parallel processing context in case you have more items to process that you spawned processes. However, (and this was new to me!) they are also reused in subsequent calls to parallel. This means, if your jobs modify global variables in the child process, this information might "leak" into subsequent parallel processes. In particular in combination with the Global Vars workaround that can lead to surprising results.
If you need to shutdown a process pool immediately you can use get_reusable_executor().shutdown(wait=True). This might be important in tests that require a "clean" process. Otherwise the order in which tests are executed might matter.

Serialization

Note: I don't think I fully understand this yet myself, so some details might be wrong

One part that makes multiprocessing in Python tricky, is that you need to serialize all relevant objects and send them to the executing process. For complicated datatypes like custom classes and even functions (that should be executed in the chiled processes), this can become complicated. Joblib (and other multiprocessing tools) use pickle (or pickle derivatives) for this serialization.
The way pickle works, is that objects are passed by "reference". In case of global values (global variables, global functions, ....) it stores the import path and the name of the object. The child process then attempts to import the object under the specified name and import path.
For objects that are not globals (e.g. instances of classes), pickle also needs to the state of the object. In these cases, it stores the reference to the parent type and the state of the object by calling __get_state__ (defaults to returning __dict__) and then uses the imported parent type + a call to __set_state__ to recreate the object in the child process.

This has a couple of severe limitations.

  1. Instances of types, where the type is not defined globally can not be pickled. For example, when a class was defined within a function or dynamically and not assigned to a global variable.
  2. Things without __qualname__, even if defined globally. e.g. functions wrapped by functools.partial
  3. Functions not stored to a global reference (e.g. inline defined lambdas)
  4. (The surprising one) Objects defined in __main__ aka the file that you are currently executing. The reason for that is, that is that __main__ is a dynamic namespace and is differently defined in the child processes. The fix to this is usually as simple as moving the relevant functions and classes to a different module so that they have a fixed import paths.
    Sometimes this might not be possible, as you only have the required information on how to create the object in the __main__ scope. E.g. when you object is defined based on your run-config. In this case it might be helpful to move the creation of your object into the child process and only pass the config through. This way the dynamically created object (e.g. function or class) does not need to be serialized.
  5. Global objects that are "replaced"/modified before serialization: If you apply runtime modifications to a class and then attempt to serialize it, the deserialization process in the child process will sometimes trigger a could not found x as x error.

If you run in any of the above cases, it can be really frustrating, as they are really hard to debug and require deep understanding on what is going on.
Luckily, you don't run into them that often, as joblib uses cloudpickle as fallback, when serialization of an object fails.
Cloudpickle attemts to pass by value and not by reference.
E.g. it tries to serialize the full object including its entire code structure.
This makes it much slower, but allows to serialize objects that are unable to be serialized by just pickle.
For example, this allows to serialize many dynamically created objects (in particular case 1 and 3). However, if cloudpickle also fails to serialize your object, the error messages just get more obscure and hard to understand. Let's hope this does not happen.

General take aways:

  • If you get a pickle error that says something about __main__ move stuff to a dedicated module
  • If not possible, because objects are generated based on runtime values, attempt to move object creation into the child process so that only the config information needs to be serialized.
  • If nothing works, maybe just use single-core processing. It's probably faster then you spending a week debugging the issue...

Imports

As objects are passed by reference to the child process all relevant imports need be resolved in any process of the pool.
This can create a huge time overhead.
For example tensorflow can take multiple seconds to import. In cases, where these dependencies are actual optional (e.g. you just sometimes use tensorflow), it might make sense to delay the import until the point you actually need it. (see this diff as an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant