-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Palantir Spark Diff on v3.0.2 #737
Conversation
#- restore_cache: | ||
# keys: | ||
# - build-maven-{{ .Branch }}-{{ .BuildNum }} | ||
# - build-maven-{{ .Branch }}- | ||
# - build-maven-master- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We disabled this cache for this branch. It was causing puzzling compile errors and we couldn't get to the bottom of it.
- build-binaries-{{ checksum "build/mvn" }}-{{ checksum "build/sbt" }} | ||
- build-binaries- | ||
- run: | ||
command: ./build/mvn -DskipTests -Psparkr -Phadoop-palantir install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the -Phadoop-palantir
profile here. Our pom previously declared hadoop-cloud
and dists/palantir-pom/bom
as default modules. I now moved those under the hadoop-palantir
profile for a more orderly diff that makes clear we're building those modules as dependencies of our "Palantir build".
Some highlights include: | ||
|
||
- predicate pushdown additions, including a patched version of Parquet | ||
- various misc bugfixes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can leave this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove this and squash it into 2a2df72 once we're ready to merge.
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip;%PYTHONPATH% | ||
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.1-src.zip;%PYTHONPATH% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We jumped ahead here and took the bump from 3.1. 3.0.2 is only on 0.10.9
.
The problem we found during testing is that with a dependency declared on 0.10.9, Conda would resolve 0.10.9.1. And that difference would trip up Py4j: When launching its gateway, the python-side would take its own version (0.10.9.1) and try to find a JAR of the exact same version (not a prefix) (here). But in Spark we declared 0.10.9 and that's the JAR we packaged. So Py4j fails to find the JAR and throws.
* logged, and suppressed. Therefore exceptions when executing this method won't | ||
* make the job fail. | ||
* | ||
* @since 3.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a clean cherry-pick from upstream 3.1 here. @fsamuel-bs was kind enough to get this merged upstream and uses it internally.
<dependency> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-mllib_${scala.binary.version}</artifactId> | ||
<version>${project.version}</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On master, we didn't have this <version>${project.version}</version>
. I don't know if that had a reason, but I understand you should match the project version for module dependencies.
<parent> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-dist_2.12-hadoop-palantir-bom</artifactId> | ||
<version>3.0.2</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was 3.0.0-SNAPSHOT
on master. It obviously doesn't matter for publishing because we override the version. But I wondered if it's good practice to update these versions between releases. So 3.0.3-SNAPSHOT
in our case.
<dependency> | ||
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-azure-datalake</artifactId> | ||
<scope>${hadoop.deps.scope}</scope> | ||
</dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a bunch of other extra dependencies in hadoop-cloud
that we could remove without breaking applications. But this Azure dependency was stubborn and did cause errors. So in the interest of time I left it. No good reason for it as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For convenience, this is hadoop-cloud/pom.xml compared with master (GH). Give GH a while to load before jumping to the diff.
<profile> | ||
<id>hadoop-palantir</id> | ||
<modules> | ||
<module>hadoop-cloud</module> | ||
<module>dists/hadoop-palantir-bom</module> | ||
<module>dists/hadoop-palantir</module> | ||
</modules> | ||
</profile> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For convenience, pom.xml compared against master (give it a while to load).
# Palantir: Don't allow skipping these tests | ||
# | ||
# @unittest.skipIf( | ||
# not have_pandas or not have_pyarrow, | ||
# pandas_requirement_message or pyarrow_requirement_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We accidentally didn't bump pyarrow and pandas in our python/Dockerfile so all these tests end up being skipped without us noticing. So to avoid this next time we take an Arrow bump I disabled the test skipping here, gives you a friendly error message about the version mismatch.
val KUBERNETES_SECRET_FILE_MOUNT_ENABLED = | ||
ConfigBuilder("spark.kubernetes.file.secretMount.enabled") | ||
.doc("Mount spark-submit files as base64-encoded k8s secret instead of uploading to " + | ||
s"${KUBERNETES_FILE_UPLOAD_PATH.key}") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had to re-do local file mounting and the important resulting difference here is that where we use secret-based file mounting internally (i.e. everywhere) we have to set:
spark.kubernetes.file.secretMount.enabled: true
We can argue about the default. My preference was to stay in-line with upstream on the default behaviour.
The reason why I had to touch is this is that Spark 3 introduced its own behaviour to mount local files into drivers by uploading them to an HCFS (apache#23546). You activate that just by adding a local file to "spark.files" and it breaks for us because we don't have a HCFS and use secret-based mounting instead. So the behaviour now is: If you enable secret-based volume mounting, we disable HCFS-based mounting.
// (Palantir) If file-mounting with secrets is enabled, don't upload files here | ||
if (conf.get(KUBERNETES_SECRET_FILE_MOUNT_ENABLED)) return additionalProps.toMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's new as explained above. We want to be able to mount files using secrets and not have to upload to an HCFS as well.
* Spark's out-of-the-box solution is in [[BasicDriverFeatureStep]] and serves local files by | ||
* uploading them to an HCFS and serving them from there. | ||
*/ | ||
private[spark] class MountLocalDriverFilesFeatureStep(conf: KubernetesDriverConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change here is that we no longer mount files into executors. It stopped working because the volumes in executors and drivers would refer to different secret names (previously they referred to the same). But KubernetesClientApplication only uploads secrets declared in the driver, not in the executor. We could ensure they reference the same secret, but it's actually unnecessary to mount into executors in the first place. Once you mount "spark.files" into the driver, the driver makes the files available to executors via SparkContext#addFile.
@@ -31,9 +31,8 @@ RUN set -ex && \ | |||
sed -i 's/http:\/\/deb.\(.*\)/https:\/\/deb.\1/g' /etc/apt/sources.list && \ | |||
apt-get update && \ | |||
ln -s /lib /lib64 && \ | |||
apt install -y bash tini libc6 libpam-modules krb5-user libnss3 && \ | |||
apt-get install -y bash tini libc6 libpam-modules krb5-user libnss3 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#723 fixes warnings like
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
// TODO(rshkv): Change to match default once upstream picks it up | ||
conf.set(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED, "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching what we have on master here. Not sure if we want this but IIRC we had failing tests without.
// TODO(palantir): Assertion below differs from upstream | ||
// palantir/parquet-mr always returns statistics | ||
assert(oneFooter.getFileMetaData.getCreatedBy.contains("impala") ^ | ||
oneBlockColumnMeta.getStatistics.hasNonNullValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching palantir/spark's master here.
var tableStats = getTableStats(tblName) | ||
assert(tableStats.sizeInBytes == 601) | ||
assert(tableStats.sizeInBytes == 650) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, matching palantir/spark's master.
2621724
to
15df9c0
Compare
ff1060a
to
55a3395
Compare
d16e32f
to
e5cffbf
Compare
Co-authored-by: Yifei Huang <yifeih@palantir.com> Co-authored-by: Josh Casale <jcasale@palantir.com> Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
Palantir's mechanism to mount local files into pods using secret-populated volumes. Used internally to provide pods with config files. Co-authored-by: mccheah <mcheah@palantir.com> Co-authored-by: Josh Casale <jcasale@palantir.com> Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
Co-authored-by: Onur Satici <onursatici@gmail.com> Co-authored-by: Josh Casale <jcasale@palantir.com> Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702
Co-authored-by: Dan Sanduleac <dsanduleac@palantir.com> Co-authored-by: Josh Casale <jcasale@palantir.com> Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
This PR upgrade Py4J from 0.10.9 to 0.10.9.1 that contains some bug fixes and improvements. It contains one bug fix (py4j/py4j@4152353). To leverage fixes from the upstream in Py4J. No. Jenkins build and GitHub Actions will test it out. Closes apache#31009 from HyukjinKwon/SPARK-33984. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…d metrics in sources Co-authored-by: Robert Kruszewski <robertk@palantir.com> Co-authored-by: Josh Casale <jcasale@palantir.com> Co-authored-by: Will Raschkowski <wraschkowski@palantir.com>
The upstream default is INT96 but INT96 is considered deprecated by Parquet [1] and we rely internally on the default being INT64 (TIMESTAMP_MICROS). INT64 reduces the size of Parquet files and avoids unnecessary conversions of microseconds to nanoseconds, see [2]. Apache went down the same route in [2] but then reverted to remain compatible with Hive and Presto in [3]. [1] https://issues.apache.org/jira/browse/PARQUET-323 [2] apache#24425 [3] apache#28450
e5cffbf
to
f830c35
Compare
Checked that everything we're expecting to be log-safe still is (using this list for reference). |
…ver contains sensitive attributes should be redacted ### What changes were proposed in this pull request? To make sure the sensitive attributes to be redacted in the history server log. This is the backport of original PR apache#30446. ### Why are the changes needed? We found the secure attributes like password in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly. The screenshot can be viewed in the attachment of JIRA Spark-33504 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes apache#31631 from viirya/SPARK-33504-3.0. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…in log ### What changes were proposed in this pull request? Redact event SparkListenerEnvironmentUpdate in log when its processing time exceeded logSlowEventThreshold. This is the backport of apache#31335 to branch-3.0. ### Why are the changes needed? Credentials could be exposed when its processing time exceeded logSlowEventThreshold. As this is related to security issue, it is better to backport it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested in original PR. Closes apache#31634 from viirya/SPARK-34232-3.0. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
… blocks ### What changes were proposed in this pull request? Fix a problems which can lead to data correctness after part blocks retry in `OneForOneBlockFetcher` when use `FetchShuffleBlocks` . ### Why are the changes needed? This is a data correctness bug, It's is no problems when use old protocol to send `OpenBlocks` before fetch chunks in `OneForOneBlockFetcher`; In latest branch, `OpenBlocks` has been replaced to `FetchShuffleBlocks`. Howerver, `FetchShuffleBlocks` read shuffle blocks order is not the same as `blockIds` in `OneForOneBlockFetcher`; the `blockIds` is used to match blockId with shuffle data with index, now it is out of order; It will lead to read wrong block chunk when some blocks fetch failed in `OneForOneBlockFetcher`, it will retry the rest of the blocks in `blockIds` based on the `blockIds`'s order. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#31643 from seayoun/yuhaiyang_fix_use_FetchShuffleBlocks_order. Lead-authored-by: yuhaiyang <yuhaiyang@yuhaiyangs-MacBook-Pro.local> Co-authored-by: yuhaiyang <yuhaiyang@172.19.25.126> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4e43819) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
a2bd672
to
6d71135
Compare
Cherry-picked some scary looking fixes from branch-3.0. |
|
1a307d2
to
a2bd672
Compare
Nevermind, the above introduced 50 participants and grew the commits from 20 to 1700+. I undid that. We can discuss about what happens to master after this later. |
LGTM, the changes I had in mind are all there. |
Thank you, @robert3005 |
We created a new branch called Going to close this. |
This PR will rebase palantir/spark on top of upstream 3.0.2. It has the Palantir essentials, a few changes not merged upstream, and a few cherry-picks that weren't released yet. Details on that below. General suggestion is to look at each commit in isolation, as it should represent an atomic change that can be reviewed independent of the other changes. This is a big diff, (sorry) but we don't really have a better way to manage this change.
This code has already been tested internally on f-s and downstream products.
What happens to our master
The diff you're looking at is comparing against a branch tracking upstream's v3.0.2 tag. If you approve, we'll apply this set of Palantir commits on top of v3.0.2, and then do a 3-way merge with master where we throw away master's changes but keep its commit history (like here).
Changes
Obviously we have Spark 3 now instead of being mostly Spark 2. While fixing merge conflicts we tried to change as little as we could. No tests got dropped and we're not expecting any behavioural changes.
However, we dropped some "logical items" listed in our FORK.md. That's either because they were taken upstream, or because we no longer need the change internally. Below is the annotated list with what we did with the respective item. Items not listed there were probably dropped.
Items listed in FORK.md
Differences with upstream
Allow FileFormat to specify partition pruning strategy via splitsImprove ParseError stop location when offending symbol is a tokenMake AbstractLauncher#self() protectedBetter pushdown for IN expressions in parquet via UserDefinedPredicateAdditions
All these are kept:
Reverts
These were breaking changes we had reverted. Most of these were later undone upstream, so we no longer need to track reverts.
Removal ofmonotonicall_increasing_id
,toDegree
,toRadians
,approxCountDistinct
,unionAll
Removal ofunboundedPreceding
,unboundedFollowing
,currentRow
Removal of deprecated setters from tree regression and classification modelsRemoval of KMeans computeCostChange to UserDefinedFunction typeScala UDF null checkingBring back scala 2.11 behaviour of primitive types null behaviourOld OneHotEncoder(accepting a break)StringIndexer multi column support(accepting a break here from 3.1)No document frequency in IDFModel(not a break)