Skip to content

Use numpy.array instead of array in doctests (PR 23571)#5

Merged
srowen merged 1 commit intosrowen:SPARK-26640from
HyukjinKwon:pr-23571
Jan 17, 2019
Merged

Use numpy.array instead of array in doctests (PR 23571)#5
srowen merged 1 commit intosrowen:SPARK-26640from
HyukjinKwon:pr-23571

Conversation

@HyukjinKwon
Copy link

@HyukjinKwon HyukjinKwon commented Jan 17, 2019

What changes were proposed in this pull request?

from numpy import array looks actually used in doctests and a false positive from lgtm.com. It's passed via module's namespaces via __dict__. when it runs doctest.

Looks better explicitly use NumPy's array within each doctest since it's confusing between array.array and numpy.array anyway.

How was this patch tested?

Manually via

cd python
./run-tests --python-executables=python

@HyukjinKwon
Copy link
Author

@srowen, I tested and made this change during the review of that PR but please ignore this if the change is already done in your local or if it's easier for you to directly push it to your branch. Either way works :D.

@srowen
Copy link
Owner

srowen commented Jan 17, 2019

Oh yeah I was watching out for false positives here but missed this. It is clearer to prefix with numpy anyway. Yes I'll add this to my PR, thank you.

@srowen srowen merged commit 73e0a25 into srowen:SPARK-26640 Jan 17, 2019
srowen pushed a commit that referenced this pull request Aug 26, 2019
## What changes were proposed in this pull request?
This PR aims at improving the way physical plans are explained in spark.

Currently, the explain output for physical plan may look very cluttered and each operator's
string representation can be very wide and wraps around in the display making it little
hard to follow. This especially happens when explaining a query 1) Operating on wide tables
2) Has complex expressions etc.

This PR attempts to split the output into two sections. In the header section, we display
the basic operator tree with a number associated with each operator. In this section, we strictly
control what we output for each operator. In the footer section, each operator is verbosely
displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be
correlated by the originating expression id from its parent plan.

To illustrate, here is a simple plan displayed in old vs new way.

Example query1 :
```
EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0
```

Old :
```
*(2) Project [key#2, max(val)apache#15]
+- *(2) Filter (isnotnull(max(val#3)apache#18) AND (max(val#3)apache#18 > 0))
   +- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)apache#15, max(val#3)apache#18])
      +- Exchange hashpartitioning(key#2, 200)
         +- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21])
            +- *(1) Project [key#2, val#3]
               +- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
                  +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int>
```
New :
```
Project (8)
+- Filter (7)
   +- HashAggregate (6)
      +- Exchange (5)
         +- HashAggregate (4)
            +- Project (3)
               +- Filter (2)
                  +- Scan parquet default.explain_temp1 (1)

(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]

(2) Filter [codegen id : 1]
Input     : [key#2, val#3]
Condition : (isnotnull(key#2) AND (key#2 > 0))

(3) Project [codegen id : 1]
Output    : [key#2, val#3]
Input     : [key#2, val#3]

(4) HashAggregate [codegen id : 1]
Input: [key#2, val#3]

(5) Exchange
Input: [key#2, max#11]

(6) HashAggregate [codegen id : 2]
Input: [key#2, max#11]

(7) Filter [codegen id : 2]
Input     : [key#2, max(val)#5, max(val#3)apache#8]
Condition : (isnotnull(max(val#3)apache#8) AND (max(val#3)apache#8 > 0))

(8) Project [codegen id : 2]
Output    : [key#2, max(val)#5]
Input     : [key#2, max(val)#5, max(val#3)apache#8]
```

Example Query2 (subquery):
```
SELECT * FROM   explain_temp1 WHERE  KEY = (SELECT Max(KEY) FROM   explain_temp2 WHERE  KEY = (SELECT Max(KEY) FROM   explain_temp3 WHERE  val > 0) AND val = 2) AND val > 3
```
Old:
```
*(1) Project [key#2, val#3]
+- *(1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3))
   :  +- Subquery scalar-subquery#39
   :     +- *(2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)apache#45])
   :        +- Exchange SinglePartition
   :           +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47])
   :              +- *(1) Project [key#26]
   :                 +- *(1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2))
   :                    :  +- Subquery scalar-subquery#38
   :                    :     +- *(2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)apache#43])
   :                    :        +- Exchange SinglePartition
   :                    :           +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49])
   :                    :              +- *(1) Project [key#28]
   :                    :                 +- *(1) Filter (isnotnull(val#29) AND (val#29 > 0))
   :                    :                    +- *(1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int>
   :                    +- *(1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int>
   +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int>
```
New:
```
Project (3)
+- Filter (2)
   +- Scan parquet default.explain_temp1 (1)

(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]

(2) Filter [codegen id : 1]
Input     : [key#2, val#3]
Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3))

(3) Project [codegen id : 1]
Output    : [key#2, val#3]
Input     : [key#2, val#3]
===== Subqueries =====

Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23
HashAggregate (9)
+- Exchange (8)
   +- HashAggregate (7)
      +- Project (6)
         +- Filter (5)
            +- Scan parquet default.explain_temp2 (4)

(4) Scan parquet default.explain_temp2 [codegen id : 1]
Output: [key#26, val#27]

(5) Filter [codegen id : 1]
Input     : [key#26, val#27]
Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2))

(6) Project [codegen id : 1]
Output    : [key#26]
Input     : [key#26, val#27]

(7) HashAggregate [codegen id : 1]
Input: [key#26]

(8) Exchange
Input: [max#35]

(9) HashAggregate [codegen id : 2]
Input: [max#35]

Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22
HashAggregate (15)
+- Exchange (14)
   +- HashAggregate (13)
      +- Project (12)
         +- Filter (11)
            +- Scan parquet default.explain_temp3 (10)

(10) Scan parquet default.explain_temp3 [codegen id : 1]
Output: [key#28, val#29]

(11) Filter [codegen id : 1]
Input     : [key#28, val#29]
Condition : (isnotnull(val#29) AND (val#29 > 0))

(12) Project [codegen id : 1]
Output    : [key#28]
Input     : [key#28, val#29]

(13) HashAggregate [codegen id : 1]
Input: [key#28]

(14) Exchange
Input: [max#37]

(15) HashAggregate [codegen id : 2]
Input: [max#37]
```

Note:
I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow
would not be able to immediately incorporate the feedback. I will start to
work on them as soon as i can. Also, currently this PR provides a basic infrastructure
for explain enhancement. The details about individual operators will be implemented
in follow-up prs
## How was this patch tested?
Added a new test `explain.sql` that tests basic scenarios. Need to add more tests.

Closes apache#24759 from dilipbiswal/explain_feature.

Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon HyukjinKwon deleted the pr-23571 branch March 3, 2020 01:20
srowen pushed a commit that referenced this pull request Nov 16, 2021
### What changes were proposed in this pull request?

This PR aims to support K8s image building with Java 17.
Please note that we need more efforts to achieve to run all tests successfully.

### Why are the changes needed?

`OpenJDK` docker hub image switches the underlying OS from `Debian` to `OracleLinux` since Java 12.
So, `java_image_tag` doesn't work any longer.

**BEFORE**
```
$ bin/docker-image-tool.sh -n -b java_image_tag=17 build
[+] Building 0.8s (6/17)
 => [internal] load build definition from Dockerfile                                                                                                 0.0s
 => => transferring dockerfile: 37B                                                                                                                  0.0s
 => [internal] load .dockerignore                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                      0.0s
 => [internal] load metadata for docker.io/library/openjdk:17                                                                                        0.4s
 => CACHED [ 1/13] FROM docker.io/library/openjdk:17sha256:c7fffc2024948e6d75922025a17b7d81cb747fd0fe0167fef13c6fcfc72e4144                         0.0s
 => [internal] load build context                                                                                                                    0.1s
 => => transferring context: 69.25kB                                                                                                                 0.0s
 => ERROR [ 2/13] RUN set -ex &&     sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list &&     apt-get update &&     ln -s /li  0.2s
------
 > [ 2/13] RUN set -ex &&     sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list &&     apt-get update &&     ln -s /lib /lib64 &&     apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps &&     mkdir -p /opt/spark &&     mkdir -p /opt/spark/examples &&     mkdir -p /opt/spark/work-dir &&     touch /opt/spark/RELEASE &&     rm /bin/sh &&     ln -sv /bin/bash /bin/sh &&     echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&     chgrp root /etc/passwd && chmod ug+rw /etc/passwd &&     rm -rf /var/cache/apt/*:
#5 0.230 + sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list
#5 0.232 sed: can't read /etc/apt/sources.list: No such file or directory
------
executor failed running [/bin/sh -c set -ex &&     sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list &&     apt-get update &&     ln -s /lib /lib64 &&     apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps &&     mkdir -p /opt/spark &&     mkdir -p /opt/spark/examples &&     mkdir -p /opt/spark/work-dir &&     touch /opt/spark/RELEASE &&     rm /bin/sh &&     ln -sv /bin/bash /bin/sh &&     echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&     chgrp root /etc/passwd && chmod ug+rw /etc/passwd &&     rm -rf /var/cache/apt/*]: exit code: 2
Failed to build Spark JVM Docker image, please refer to Docker build output for details.
```

**AFTER (This PR with `-f` option)**
```
$ bin/docker-image-tool.sh -n -f kubernetes/dockerfiles/spark/Dockerfile.java17 build
[+] Building 29.3s (19/19) FINISHED
 => [internal] load build definition from Dockerfile.java17                                                                                          0.0s
 => => transferring dockerfile: 2.49kB                                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                                    0.0s
 => => transferring context: 2B                                                                                                                      0.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-slim                                                                              1.5s
 => [auth] library/debian:pull token for registry-1.docker.io                                                                                        0.0s
 => [internal] load build context                                                                                                                    0.1s
 => => transferring context: 80.54kB                                                                                                                 0.0s
 => CACHED [ 1/13] FROM docker.io/library/debian:bullseye-slimsha256:dddc0f5f01db7ca3599fd8cf9821ffc4d09ec9d7d15e49019e73228ac1eee7f9               0.0s
 => [ 2/13] RUN set -ex &&     apt-get update &&     ln -s /lib /lib64 &&     apt install -y bash tini libc6 libpam-modules krb5-user libnss3 proc  25.5s
 => [ 3/13] COPY jars /opt/spark/jars                                                                                                                0.4s
 => [ 4/13] COPY bin /opt/spark/bin                                                                                                                  0.0s
 => [ 5/13] COPY sbin /opt/spark/sbin                                                                                                                0.0s
 => [ 6/13] COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/                                                                                    0.0s
 => [ 7/13] COPY kubernetes/dockerfiles/spark/decom.sh /opt/                                                                                         0.0s
 => [ 8/13] COPY examples /opt/spark/examples                                                                                                        0.0s
 => [ 9/13] COPY kubernetes/tests /opt/spark/tests                                                                                                   0.0s
 => [10/13] COPY data /opt/spark/data                                                                                                                0.0s
 => [11/13] WORKDIR /opt/spark/work-dir                                                                                                              0.0s
 => [12/13] RUN chmod g+w /opt/spark/work-dir                                                                                                        0.2s
 => [13/13] RUN chmod a+x /opt/decom.sh                                                                                                              0.2s
 => exporting to image                                                                                                                               1.3s
 => => exporting layers                                                                                                                              1.3s
 => => writing image sha256:ec961d957826c9b7eb4d00e900262130fc1708aef6cb51298b627d4bc91f834b                                                         0.0s
 => => naming to docker.io/library/spark                                                                                                             0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
```

### Does this PR introduce _any_ user-facing change?

Yes, this is a new docker file exposed to the customer.

### How was this patch tested?

Pass the K8s IT building.

Closes apache#34586 from dongjoon-hyun/SPARK-37319.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants