Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Palantir Spark Diff on v3.0.2 #737

Closed
wants to merge 22 commits into from
Closed

Conversation

jdcasale
Copy link

@jdcasale jdcasale commented Mar 3, 2021

This PR will rebase palantir/spark on top of upstream 3.0.2. It has the Palantir essentials, a few changes not merged upstream, and a few cherry-picks that weren't released yet. Details on that below. General suggestion is to look at each commit in isolation, as it should represent an atomic change that can be reviewed independent of the other changes. This is a big diff, (sorry) but we don't really have a better way to manage this change.

This code has already been tested internally on f-s and downstream products.

What happens to our master

The diff you're looking at is comparing against a branch tracking upstream's v3.0.2 tag. If you approve, we'll apply this set of Palantir commits on top of v3.0.2, and then do a 3-way merge with master where we throw away master's changes but keep its commit history (like here).

Changes

Obviously we have Spark 3 now instead of being mostly Spark 2. While fixing merge conflicts we tried to change as little as we could. No tests got dropped and we're not expecting any behavioural changes.

However, we dropped some "logical items" listed in our FORK.md. That's either because they were taken upstream, or because we no longer need the change internally. Below is the annotated list with what we did with the respective item. Items not listed there were probably dropped.

Items listed in FORK.md

Differences with upstream

  • SPARK-21195 - Kept: Automatically register new metrics from sources and wire default registry
  • SPARK-20952 - Kept: ParquetFileFormat should forward TaskContext to its forkjoinpool
  • SPARK-20001 / SPARK-13587 - Kept: Support PythonRunner executing inside a Conda env (and R)
  • SPARK-17059 - Dropped: (fixed with DSv2) Allow FileFormat to specify partition pruning strategy via splits
  • SPARK-24345 - Dropped: (merged upstream) Improve ParseError stop location when offending symbol is a token
  • SPARK-23795 - Dropped: (no longer needed) Make AbstractLauncher#self() protected
  • SPARK-18079 - Kept: CollectLimitExec.executeToIterator should perform per-partition limits
  • SPARK-15777 - Kept: Catalog federation changes
  • SPARK-17091 - Dropped: (merged upstream) Better pushdown for IN expressions in parquet via UserDefinedPredicate
  • SPARK-26626 - Kept: Limited the maximum size of repeatedly substituted aliases
  • Kept: SafeLogging implemented in a bunch of files

Additions

All these are kept:

  • Gradle plugin to easily create custom docker images for use with k8s
  • Filter rLibDir by exists so that daemon.R references the correct file (#460)
  • Add pre-installed conda configuration and use to find rlib directory (#700)
  • Supports Arrow-serialization of Python 2 strings (#678)
  • SPARK-26200 - Allow setting HADOOP_CONF_DIR as a spark config
  • K8s local file mounting
  • K8s local deploy mode

Reverts

These were breaking changes we had reverted. Most of these were later undone upstream, so we no longer need to track reverts.

  • SPARK-25908 - Removal of monotonicall_increasing_id, toDegree, toRadians, approxCountDistinct, unionAll
  • SPARK-25862 - Removal of unboundedPreceding, unboundedFollowing, currentRow
  • SPARK-26127 - Removal of deprecated setters from tree regression and classification models
  • SPARK-25867 - Removal of KMeans computeCost
  • SPARK-26216 - Change to UserDefinedFunction type
    • SPARK-26323 - Scala UDF null checking
    • SPARK-26580 - Bring back scala 2.11 behaviour of primitive types null behaviour
  • SPARK-26133 - Old OneHotEncoder (accepting a break)
  • SPARK-11215 - StringIndexer multi column support (accepting a break here from 3.1)
  • SPARK-26616 - No document frequency in IDFModel (not a break)

Comment on lines +136 to +143
#- restore_cache:
# keys:
# - build-maven-{{ .Branch }}-{{ .BuildNum }}
# - build-maven-{{ .Branch }}-
# - build-maven-master-
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We disabled this cache for this branch. It was causing puzzling compile errors and we couldn't get to the bottom of it.

- build-binaries-{{ checksum "build/mvn" }}-{{ checksum "build/sbt" }}
- build-binaries-
- run:
command: ./build/mvn -DskipTests -Psparkr -Phadoop-palantir install
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the -Phadoop-palantir profile here. Our pom previously declared hadoop-cloud and dists/palantir-pom/bom as default modules. I now moved those under the hadoop-palantir profile for a more orderly diff that makes clear we're building those modules as dependencies of our "Palantir build".

Comment on lines +7 to +10
Some highlights include:

- predicate pushdown additions, including a patched version of Parquet
- various misc bugfixes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can leave this out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove this and squash it into 2a2df72 once we're ready to merge.

set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.1-src.zip;%PYTHONPATH%
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We jumped ahead here and took the bump from 3.1. 3.0.2 is only on 0.10.9.

The problem we found during testing is that with a dependency declared on 0.10.9, Conda would resolve 0.10.9.1. And that difference would trip up Py4j: When launching its gateway, the python-side would take its own version (0.10.9.1) and try to find a JAR of the exact same version (not a prefix) (here). But in Spark we declared 0.10.9 and that's the JAR we packaged. So Py4j fails to find the JAR and throws.

* logged, and suppressed. Therefore exceptions when executing this method won't
* make the job fail.
*
* @since 3.1.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a clean cherry-pick from upstream 3.1 here. @fsamuel-bs was kind enough to get this merged upstream and uses it internally.

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${project.version}</version>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master, we didn't have this <version>${project.version}</version>. I don't know if that had a reason, but I understand you should match the project version for module dependencies.

<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-dist_2.12-hadoop-palantir-bom</artifactId>
<version>3.0.2</version>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was 3.0.0-SNAPSHOT on master. It obviously doesn't matter for publishing because we override the version. But I wondered if it's good practice to update these versions between releases. So 3.0.3-SNAPSHOT in our case.

Comment on lines +196 to +200
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure-datalake</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a bunch of other extra dependencies in hadoop-cloud that we could remove without breaking applications. But this Azure dependency was stubborn and did cause errors. So in the interest of time I left it. No good reason for it as far as I can tell.

Copy link

@rshkv rshkv Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For convenience, this is hadoop-cloud/pom.xml compared with master (GH). Give GH a while to load before jumping to the diff.

Comment on lines +3144 to +3151
<profile>
<id>hadoop-palantir</id>
<modules>
<module>hadoop-cloud</module>
<module>dists/hadoop-palantir-bom</module>
<module>dists/hadoop-palantir</module>
</modules>
</profile>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For convenience, pom.xml compared against master (give it a while to load).

Comment on lines +45 to +49
# Palantir: Don't allow skipping these tests
#
# @unittest.skipIf(
# not have_pandas or not have_pyarrow,
# pandas_requirement_message or pyarrow_requirement_message)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We accidentally didn't bump pyarrow and pandas in our python/Dockerfile so all these tests end up being skipped without us noticing. So to avoid this next time we take an Arrow bump I disabled the test skipping here, gives you a friendly error message about the version mismatch.

Comment on lines +431 to +436
val KUBERNETES_SECRET_FILE_MOUNT_ENABLED =
ConfigBuilder("spark.kubernetes.file.secretMount.enabled")
.doc("Mount spark-submit files as base64-encoded k8s secret instead of uploading to " +
s"${KUBERNETES_FILE_UPLOAD_PATH.key}")
.booleanConf
.createWithDefault(false)
Copy link

@rshkv rshkv Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had to re-do local file mounting and the important resulting difference here is that where we use secret-based file mounting internally (i.e. everywhere) we have to set:

spark.kubernetes.file.secretMount.enabled: true

We can argue about the default. My preference was to stay in-line with upstream on the default behaviour.

The reason why I had to touch is this is that Spark 3 introduced its own behaviour to mount local files into drivers by uploading them to an HCFS (apache#23546). You activate that just by adding a local file to "spark.files" and it breaks for us because we don't have a HCFS and use secret-based mounting instead. So the behaviour now is: If you enable secret-based volume mounting, we disable HCFS-based mounting.

Comment on lines +160 to +161
// (Palantir) If file-mounting with secrets is enabled, don't upload files here
if (conf.get(KUBERNETES_SECRET_FILE_MOUNT_ENABLED)) return additionalProps.toMap
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's new as explained above. We want to be able to mount files using secrets and not have to upload to an HCFS as well.

* Spark's out-of-the-box solution is in [[BasicDriverFeatureStep]] and serves local files by
* uploading them to an HCFS and serving them from there.
*/
private[spark] class MountLocalDriverFilesFeatureStep(conf: KubernetesDriverConf)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change here is that we no longer mount files into executors. It stopped working because the volumes in executors and drivers would refer to different secret names (previously they referred to the same). But KubernetesClientApplication only uploads secrets declared in the driver, not in the executor. We could ensure they reference the same secret, but it's actually unnecessary to mount into executors in the first place. Once you mount "spark.files" into the driver, the driver makes the files available to executors via SparkContext#addFile.

@@ -31,9 +31,8 @@ RUN set -ex && \
sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \
apt-get update && \
ln -s /lib /lib64 && \
apt install -y bash tini libc6 libpam-modules krb5-user libnss3 && \
apt-get install -y bash tini libc6 libpam-modules krb5-user libnss3 && \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#723 fixes warnings like

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Comment on lines +125 to +126
// TODO(rshkv): Change to match default once upstream picks it up
conf.set(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, "false")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching what we have on master here. Not sure if we want this but IIRC we had failing tests without.

Comment on lines +189 to +192
// TODO(palantir): Assertion below differs from upstream
// palantir/parquet-mr always returns statistics
assert(oneFooter.getFileMetaData.getCreatedBy.contains("impala") ^
oneBlockColumnMeta.getStatistics.hasNonNullValue)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching palantir/spark's master here.

var tableStats = getTableStats(tblName)
assert(tableStats.sizeInBytes == 601)
assert(tableStats.sizeInBytes == 650)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, matching palantir/spark's master.

@jdcasale jdcasale force-pushed the jc/pt-diff-on-v3.0.2 branch from 2621724 to 15df9c0 Compare March 4, 2021 02:57
@rshkv rshkv force-pushed the jc/pt-diff-on-v3.0.2 branch 11 times, most recently from ff1060a to 55a3395 Compare March 4, 2021 14:34
@jdcasale jdcasale requested a review from robert3005 March 5, 2021 17:32
@rshkv rshkv force-pushed the jc/pt-diff-on-v3.0.2 branch 4 times, most recently from d16e32f to e5cffbf Compare March 8, 2021 20:39
jdcasale and others added 9 commits March 9, 2021 01:15
Co-authored-by: Yifei Huang <yifeih@palantir.com>
Co-authored-by: Josh Casale <jcasale@palantir.com>
Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
Palantir's mechanism to mount local files into pods using
secret-populated volumes. Used internally to provide pods with config
files.

Co-authored-by: mccheah <mcheah@palantir.com>
Co-authored-by: Josh Casale <jcasale@palantir.com>
Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
Co-authored-by: Onur Satici <onursatici@gmail.com>
Co-authored-by: Josh Casale <jcasale@palantir.com>
Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
Co-authored-by: Dan Sanduleac <dsanduleac@palantir.com>
Co-authored-by: Josh Casale <jcasale@palantir.com>
Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements.
It contains one bug fix (py4j/py4j@4152353).

To leverage fixes from the upstream in Py4J.

No.

Jenkins build and GitHub Actions will test it out.

Closes apache#31009 from HyukjinKwon/SPARK-33984.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…d metrics in sources

Co-authored-by: Robert Kruszewski <robertk@palantir.com>
Co-authored-by: Josh Casale <jcasale@palantir.com>
Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
The upstream default is INT96 but INT96 is considered deprecated by
Parquet [1] and we rely internally on the default being INT64
(TIMESTAMP_MICROS).

INT64 reduces the size of Parquet files and avoids unnecessary conversions
of microseconds to nanoseconds, see [2].

Apache went down the same route in [2] but then reverted to remain
compatible with Hive and Presto in [3].

[1] https://issues.apache.org/jira/browse/PARQUET-323
[2] apache#24425
[3] apache#28450
@rshkv rshkv force-pushed the jc/pt-diff-on-v3.0.2 branch from e5cffbf to f830c35 Compare March 9, 2021 01:17
@rshkv
Copy link

rshkv commented Mar 9, 2021

Checked that everything we're expecting to be log-safe still is (using this list for reference).

viirya and others added 3 commits March 9, 2021 01:43
…ver contains sensitive attributes should be redacted

### What changes were proposed in this pull request?

To make sure the sensitive attributes to be redacted in the history server log. This is the backport of original PR apache#30446.

### Why are the changes needed?

We found the secure attributes like password  in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.

The screenshot can be viewed in the attachment of JIRA Spark-33504

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#31631 from viirya/SPARK-33504-3.0.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…in log

### What changes were proposed in this pull request?

Redact event SparkListenerEnvironmentUpdate in log when its processing time exceeded logSlowEventThreshold.

This is the backport of apache#31335 to branch-3.0.

### Why are the changes needed?

Credentials could be exposed when its processing time exceeded logSlowEventThreshold.

As this is related to security issue, it is better to backport it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually tested in original PR.

Closes apache#31634 from viirya/SPARK-34232-3.0.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… blocks

### What changes were proposed in this pull request?

Fix a problems which can lead to data correctness after part blocks retry in `OneForOneBlockFetcher` when use `FetchShuffleBlocks` .

### Why are the changes needed?
This is a data correctness bug, It's is no problems when use old protocol to send `OpenBlocks` before fetch chunks in `OneForOneBlockFetcher`;
In latest branch, `OpenBlocks`  has been replaced to `FetchShuffleBlocks`. Howerver, `FetchShuffleBlocks` read shuffle blocks order is not the same as `blockIds` in `OneForOneBlockFetcher`; the `blockIds` is used to match blockId with shuffle data with index, now it is out of order;
It will lead to read wrong block chunk when some blocks fetch failed in `OneForOneBlockFetcher`, it will retry the rest of the blocks in `blockIds`  based on the `blockIds`'s order.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes apache#31643 from seayoun/yuhaiyang_fix_use_FetchShuffleBlocks_order.

Lead-authored-by: yuhaiyang <yuhaiyang@yuhaiyangs-MacBook-Pro.local>
Co-authored-by: yuhaiyang <yuhaiyang@172.19.25.126>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 4e43819)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@rshkv rshkv force-pushed the jc/pt-diff-on-v3.0.2 branch from a2bd672 to 6d71135 Compare March 9, 2021 01:45
@rshkv
Copy link

rshkv commented Mar 9, 2021

Cherry-picked some scary looking fixes from branch-3.0.

@rshkv
Copy link

rshkv commented Mar 9, 2021

I merged our current master into this branch using --strategy=ours, so:

(jc/pt-diff-o-v3.0.2) $ git merge master --strategy=ours

I'd later just merge this branch into master to keep our old commit history but discard its changes.

@rshkv rshkv force-pushed the jc/pt-diff-on-v3.0.2 branch from 1a307d2 to a2bd672 Compare March 9, 2021 16:46
@rshkv
Copy link

rshkv commented Mar 9, 2021

Nevermind, the above introduced 50 participants and grew the commits from 20 to 1700+. I undid that. We can discuss about what happens to master after this later.

@robert3005
Copy link

LGTM, the changes I had in mind are all there.

@rshkv
Copy link

rshkv commented Mar 10, 2021

Thank you, @robert3005

@rshkv
Copy link

rshkv commented Mar 10, 2021

We created a new branch called palantir-3.0.2 from this one. It's the new repo default branch and we released 3.0.2-palantir.0 from it.

Going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants