Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

Add missing methods #464

Closed
wants to merge 14 commits into from
Closed

Conversation

Fokko
Copy link

@Fokko Fokko commented Aug 27, 2020

When adding the files to the Spark project, we got errors because it was unable to find some methods.

@@ -29,7 +29,8 @@ T = TypeVar("T")
U = TypeVar("U", bound=SupportsIAdd)

import socketserver as SocketServer
from typing import Any

_accumulatorRegistry: Dict = {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we annotate this, could we be more precise?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering ‒ what is the reason for including this? I am not strictly against this, but I haven't seen any need for this one so far and in general we seem to lean towards excluding internal APIs, unless strictly necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _accumulatorRegistry is used by other files, therefore we have to include it. Otherwise, mypy will complain that it can't find the dict.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Could your provide some details on the test setup? Haven't seen any mypy failures in apache/spark#29591 (still work in progress, though I don't expect it will make a huge impact on type checks).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll whip up an example first thing tomorrow morning!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Have a good night.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be easily be reproducable on your machine as well. I'm working on my branch of apache/spark#29180:

MacBook-Pro-van-Fokko:spark fokkodriesprong$ git branch
* SPARK-17333

Right now everything passes:

MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python 
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
flake8 checks passed.

The sphinx-build command was not found. Skipping Sphinx build for now.


all lint-python tests passed!

If we remove the _accumulatorRegistry:

MacBook-Pro-van-Fokko:spark fokkodriesprong$ nano python/pyspark/accumulators.pyi 
MacBook-Pro-van-Fokko:spark fokkodriesprong$ git diff
diff --git a/python/pyspark/accumulators.pyi b/python/pyspark/accumulators.pyi
index f60de25704..6eafe46a46 100644
--- a/python/pyspark/accumulators.pyi
+++ b/python/pyspark/accumulators.pyi
@@ -30,7 +30,7 @@ U = TypeVar("U", bound=SupportsIAdd)
 
 import socketserver as SocketServer
 
-_accumulatorRegistry: Dict[int, Accumulator]
+# _accumulatorRegistry: Dict[int, Accumulator]
 
 class Accumulator(Generic[T]):
     aid: int

Then it fails:

MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python 
starting python compilation test...
python compilation succeeded.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

starting mypy test...
mypy checks failed:
python/pyspark/worker.py:34: error: Module 'pyspark.accumulators' has no attribute '_accumulatorRegistry'
Found 1 error in 1 file (checked 185 source files)
1

It looks like the _accumulatorRegistry is private, since it starts with a _, but it is actually being imported in worker.py.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be easily be reproducable on your machine as well

Now I see... This won't happen on my side, as worker.py has dynamic stub.

It looks like the _accumulatorRegistry is private, since it starts with a _, but it is actually being imported in worker.py.

There are quite a few of these imports all over the place. In general I tend to use specific ignores for these, as they're not user facing API.

If you don't mind, I'll keep this open for now and finish syncing things for SPARK-32714, as it might require some further discussion about the scope of annotations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me know if I can help somewhere

third_party/3/pyspark/broadcast.pyi Outdated Show resolved Hide resolved
@Fokko
Copy link
Author

Fokko commented Sep 1, 2020

Sorry for the late reply. Caught up in daily work :) I'm also installing pandas and numpy. They are being imported in the new PyArrow example: https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py

@zero323
Copy link
Owner

zero323 commented Oct 9, 2020

I think that we already picked everything, that was required for project port. I didn't experience any problems with the remaining bits, so let's keep these out for the time being and include directly upstream, if it ever proves necessary.

Thanks for all your work @Fokko!

@zero323 zero323 closed this Oct 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants