-
-
Notifications
You must be signed in to change notification settings - Fork 37
Conversation
@@ -29,7 +29,8 @@ T = TypeVar("T") | |||
U = TypeVar("U", bound=SupportsIAdd) | |||
|
|||
import socketserver as SocketServer | |||
from typing import Any | |||
|
|||
_accumulatorRegistry: Dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we annotate this, could we be more precise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it holds the Accumulators, where the int is a key: https://github.com/apache/spark/blob/7deb67c28f948cca4e768317ade6d68d2534408f/python/pyspark/context.py#L902
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering ‒ what is the reason for including this? I am not strictly against this, but I haven't seen any need for this one so far and in general we seem to lean towards excluding internal APIs, unless strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _accumulatorRegistry
is used by other files, therefore we have to include it. Otherwise, mypy will complain that it can't find the dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Could your provide some details on the test setup? Haven't seen any mypy failures in apache/spark#29591 (still work in progress, though I don't expect it will make a huge impact on type checks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll whip up an example first thing tomorrow morning!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Have a good night.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easily be reproducable on your machine as well. I'm working on my branch of apache/spark#29180:
MacBook-Pro-van-Fokko:spark fokkodriesprong$ git branch
* SPARK-17333
Right now everything passes:
MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python
starting python compilation test...
python compilation succeeded.
starting pycodestyle test...
pycodestyle checks passed.
starting flake8 test...
flake8 checks passed.
starting mypy test...
flake8 checks passed.
The sphinx-build command was not found. Skipping Sphinx build for now.
all lint-python tests passed!
If we remove the _accumulatorRegistry
:
MacBook-Pro-van-Fokko:spark fokkodriesprong$ nano python/pyspark/accumulators.pyi
MacBook-Pro-van-Fokko:spark fokkodriesprong$ git diff
diff --git a/python/pyspark/accumulators.pyi b/python/pyspark/accumulators.pyi
index f60de25704..6eafe46a46 100644
--- a/python/pyspark/accumulators.pyi
+++ b/python/pyspark/accumulators.pyi
@@ -30,7 +30,7 @@ U = TypeVar("U", bound=SupportsIAdd)
import socketserver as SocketServer
-_accumulatorRegistry: Dict[int, Accumulator]
+# _accumulatorRegistry: Dict[int, Accumulator]
class Accumulator(Generic[T]):
aid: int
Then it fails:
MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python
starting python compilation test...
python compilation succeeded.
starting pycodestyle test...
pycodestyle checks passed.
starting flake8 test...
flake8 checks passed.
starting mypy test...
mypy checks failed:
python/pyspark/worker.py:34: error: Module 'pyspark.accumulators' has no attribute '_accumulatorRegistry'
Found 1 error in 1 file (checked 185 source files)
1
It looks like the _accumulatorRegistry
is private, since it starts with a _
, but it is actually being imported in worker.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easily be reproducable on your machine as well
Now I see... This won't happen on my side, as worker.py
has dynamic stub.
It looks like the _accumulatorRegistry is private, since it starts with a _, but it is actually being imported in worker.py.
There are quite a few of these imports all over the place. In general I tend to use specific ignores
for these, as they're not user facing API.
If you don't mind, I'll keep this open for now and finish syncing things for SPARK-32714, as it might require some further discussion about the scope of annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me know if I can help somewhere
Sorry for the late reply. Caught up in daily work :) I'm also installing pandas and numpy. They are being imported in the new PyArrow example: https://github.com/apache/spark/blob/master/examples/src/main/python/sql/arrow.py |
I think that we already picked everything, that was required for project port. I didn't experience any problems with the remaining bits, so let's keep these out for the time being and include directly upstream, if it ever proves necessary. Thanks for all your work @Fokko! |
When adding the files to the Spark project, we got errors because it was unable to find some methods.