- There are now five different projects with Pandas-like APIs (maybe more?): Pandas, Modin, Dask, cuDF, Koalas.
- The Pandas API is huge.
- It would therefore be great to have some way of saying "you can switch to Modin/Dask/etc. with no trouble".
The goal of this repository is to think through how to do this, and prototype an implementation or implementations.
There are multiple goals, for different audiences:
USER-GOAL-CAN-SWITCH
: Application users of Pandas might want to switch to another library like Modin. Will their code work? Are all the APIs supported? It's not clear. Some projects list compatible API calls, but this is still insufficient given how complex some Pandas APIs are in terms of allowed inputs.MAINTAINER-GOAL-ADDRESS-INCOMPAT
: Maintainers of libraries like Modin want to know what edge cases in the Pandas API they aren't supporting at all, or aren't supporting correctly, so they can fix those libraries.
The Data APIs Consortium has a similar but somewhat different project to figure out the intersection of APIs across multiple libraries. However:
- It also includes non-Pandas-compatible dataframes like Vaex.
- It's about the minimal compatible subset.
- It's aimed at library users.
The goal here is not "what is intersection of all dataframe APIs", but rather "does Dask/Koalas/Modin/cuDF actually support the same API as Pandas", perhaps globally, perhaps on a per-application basis.
UX-JUST-WORK
: The ideal user experience is to take some Pandas code, switch an import or two, and have it Just Work.
The next best outcome is to be told where specifically this particular piece of code is incompatible with the alternative library. This can be done in multiple ways:
UX-STATIC-MISMATCH-NOTIFICATION
: Via static analysis, perhaps aided by type hints.UX-RUNTIME-MISMATCH-NOTIFICATION
: At runtime. For example, getting anStillUnsupportedError
exception when doingdf[x] = 1
for a certain value ofx
. This is more accurate than static analysis, maybe, but also less useful: it matters a lot whether it's 1 API that's missing or 50.
The next, lower level of user experience is to be presented with a chart of supported APIs, and the user can then manually validate whether an API is supported:
UX-LIST-OF-SUPPORTED-METHODS-DETAILED
: A list of supported methods/APIs, including all the various possible inputs (Pandas has so many!).UX-LIST-OF-SUPPORTED-METHODS
: As above, but just a list of supported methods. This is e.g. the current default in Modin, and it's quite insufficient since sometimes certain input variants don't work, even if others do.
As a maintainer wanting to make sure all APIs are supported, the equivalent of UX-LIST-OF-SUPPORTED-METHODS-DETAILED
should suffice.
UX-STATIC-MISMATCH-NOTIFICATION
or UX-RUNTIME-MISMATCH-NOTIFICATION
seem like the best options.
In terms of information they require, they need the same information as UX-LIST-OF-SUPPORTED-METHODS-DETAILED
, and they just require additional coding work to utilize that information.
Per the above, the initial bottleneck is being able to say in detail which APIs a particular library does or does not support. Therefore, good next steps would be:
- Prototype some way to gather the detailed API compatibility, probably for Modin.
- Then, prototype using that information for static analysis, or perhaps runtime errors.
The fundamental API atoms, the thing that can be supported or not supported, are specific a function/method being called with specific types. Or rather:
- For some method arguments, the type isn't meaningful: any type will do.
For example,
Series.add
will take anything that can be added, and one expects the underyling implementation to devolve to a plain+
. Put another way, the same logic handles all types. - For other arguments, the behavior is different based on the type.
For example,
df[x] = 1
has different behavior depending on whetherx
is a string, aDataFrame
, a list, a booleanIndex
, etc.... In typing terms, this would beUnion[str, DataFrame, Index, list]
, although at the moment I'm not sure one can express a boolean Index....
The information we need is then:
- Supported method/function argument variations for the Pandas API.
- Supported method/function argument variations for the other library's API.
Combined, one can determine compatibility.
There is also, of course, the semantics: even if the same signatures are supported, the results might be different.
Information about compatibility can come from multiple sources:
- Type annotations of the Pandas API. Insofar as they're missing, they could be expanded to be more complete.
- Type annotations of the emulating API, e.g. Modin.
- Pandas test suite, which to some extent is a generic test of the API.
- The emulating library's test suite.
Let's assume a complete and thorough type annotation for Pandas is available. For example:
class DataFrame:
# ...
def __setitem__(self, key: Union[DataFrame, Index, ndarray, List[str], str, slice, callable], value: Any):
# ...
That's not actually complete; the callable needs to only return the other items in the Union, and there are more variants.
What's more this needs to be complete: if one just said Iterable
or something that wouldn't allow for good checking.
We can then look at type annotations for e.g. Modin:
class DataFrame:
# ...
def __setitem__(self, key: Union[str, List[str]], value: Any):
# ....
Obviously, there are inputs that aren't supported and an automated tool could extract those.
Now, Modin could lie about what supports, of course, either on purpose (unlikely) or more likely by accident. Some ways of dealing this:
- Manual self-discipline, adding type annotations if and only if they have a corresponding test.
- In addition, or as an alternative, some validation of the supported types must be done based on the Modin test suite. There are tools that generate type annotations from running code, whereas this is... the opposite, but probably some of that could be reused.
The benefits of this approach are that each project can work in isolation and in parallel. It also should work with less-exact emulations like Dask's lazy approach.
The downside is that it only validates signatures, not semantics. However, if coupled with good testing discipline on the part of the reimplementation libraries, that's OK (albeit a duplication of work).
Pandas has a test suite. Some of that test suite is testing the public API.
This test suite could be modified to allow running against multiple libraries, so one could see which tests Modin fails.
Benefits are that you get to see that semantics are identical across libraries.
Some problems:
- Test failures don't necessarily easily match one-to-one to API calls. This would need to be annotated.
- Easier to solve, but annotations of "expected to fail" would need to be maintained outside the tests by the reimplementing libraries.
- Would have to figure out how to sync up generic test repo with Pandas repo, unless somehow Pandas tests are made generic.
Instead of reusing Pandas' test suite, one could create a completely new one.
This sounds like a lot of work.
It's possible to run the same code twice, e.g. on Modin and Pandas, record intermediate and final values, and then compare them. This would allow checking whether it's possible to switch semantically.
For example:
from pandas_comparator import pandas as pandas, record_value
def f():
x = pandas.DataFrame(...)
record_value("x", x)
y = lalalala(x)
record_value("y", y)
f()
And then:
$ PANDAS=pandas python mycode.py
$ PANDAS=modin python mycode.py
$ pandas-comparator-diff
In practice this will be more complex, e.g. probably want different virtualenvs for different runs.
By highly-specific, I mean e.g. using @override
to specify multiple different input/output pairs, using types like Index[bool]
if a boolean Index is special, etc..
This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Pandas. So seems like an easy sell.
- Modeled on Pandas.
- Careful to only add type annotations for things that are actually tested.
This is a bunch of work, but it's a good documentation practice and will also benefit people who are just using Modin.
3. Users of Pandas can use static analysis (mypy etc.) to validate that switching to Modin will work
Simply by having items 1 and 2, switching import from Pandas to Modin will allow type checking if APIs are compatible.
No additional work needed by maintainers.
4. Modin etc. can optionally have a runtime checking mode for users attempting to switch from Pandas to Modin.
Static analysis may be difficult for some users.
So using e.g. typeguard, maintainers of Modin etc. can enable runtime checking with some sort of API flag or environment variable, so there's no cost by default.
This is a small amount of work, probably.
5. A new tool can generate diffs between type annotations for Pandas and type annotations for Modin etc., for documentation purposes and for MAINTAINER-GOAL-ADDRESS-INCOMPAT
.
This will require some software development, but seems like a nicely scoped project.
While it's a good idea, in the short term this isn't going to help, because adding type annotations is going to be a long process.
For example:
- Pandas depends on NumPy types. There is no released version of NumPy with type annotations--next release might (as of Oct 2020), but Pandas might not want to rely on latest absolute version.
- Pandas still hasn't figured out how to make its own annotations public.
- Pandas is missing testing infrastructure.
- Partial type annotations are less useful.
Now, Modin etc. could start their own annotations in parallel, but they might diverge semantically making it less useful.
So worth pursuing, but it's not an immediate solution.
Running the same code and (limited) data on two or more variants of Pandas/Pandas-alike has some nice properties:
- It's also useful for testing if you can upgrade to latest version of Pandas: you can compare old Pandas to new Pandas.
- It can be run as an independent project, iterating quickly.
A minimal version would allow running same code on two versions of Pandas/Pandas-alike, recording values. The user could then look at differences.
What counts as a semantic difference is a large problem space, and likely user-specific, so once basic framework is done the likely work will be on making this more flexible and more informative. For example, minor floating point differences might not matter to most users, sorting may or may not matter, when things do differ semantically summarizing differences will be important, etc..