Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Sep 5, 2022

What changes were proposed in this pull request?

This PR suggests using Java 11/17 as runtime of Spark in the index.md.

Why are the changes needed?

The JIT optimizer in Java 11/17 is better than Java 8, running Spark using Java 11/17 will bring some performance benefits.

Spark uses Whole Stage Code Gen to improve performance, but the generated code may not be friendly to the JIT optimizer. For example, the case mentioned in SPARK-40303 by @wangyum :

  runBenchmark("Benchmark count distinct") {
    withTempPath { dir =>
      val N = 2000000
      val columns = Range(0, 100).map(i => s"id % $i AS id$i")

      spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir.getCanonicalPath)

      Seq(1, 2, 5, 10, 15, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100).foreach { cnt =>
        val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => s"count(distinct $c)")

        val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 1)
        benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
          withSQLConf(
            "spark.sql.codegen.wholeStage" -> "true",
            "spark.sql.codegen.factoryMode" -> "FALLBACK") {
            spark.read.parquet(dir.getCanonicalPath).selectExpr(selectExps: _*)
              .write.format("noop").mode("Overwrite").save()
          }
        }

        benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
          withSQLConf(
            "spark.sql.codegen.wholeStage" -> "false",
            "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
            spark.read.parquet(dir.getCanonicalPath).selectExpr(selectExps: _*)
              .write.format("noop").mode("Overwrite").save()
          }
        }
        benchmark.run()
      }
    }
  }

When use Java 8 to run the above case in local[2] mode, there will be obvious negative effects when cnt is 35, 40, 50, 60 or 70 with codegen enabled. (the performance after cnt > 70 is meet expectations due to fallback with InternalCompilerException: Code grows beyond 64 KB):

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
35 count distinct with codegen                   418321         418321           0          0.0      209160.3       1.0X
35 count distinct without codegen                 69975          69975           0          0.0       34987.7       6.0X

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
40 count distinct with codegen                   626627         626627           0          0.0      313313.7       1.0X
40 count distinct without codegen                 90564          90564           0          0.0       45281.9       6.9X

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
50 count distinct with codegen                   906749         906749           0          0.0      453374.7       1.0X
50 count distinct without codegen                140278         140278           0          0.0       70138.8       6.5X

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
60 count distinct with codegen                   995616         995616           0          0.0      497808.2       1.0X
60 count distinct without codegen                215088         215088           0          0.0      107544.2       4.6X

OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
70 count distinct with codegen                  1060273        1060273           0          0.0      530136.4       1.0X
70 count distinct without codegen                290576         290576           0          0.0      145287.9       3.6X

But run the above cases using Java 11 or Java 17, there will be no obvious negative effects:

Java 11

OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
35 count distinct with codegen                    54733          54733           0          0.0       27366.6       1.0X
35 count distinct without codegen                 97613          97613           0          0.0       48806.6       0.6X

OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
40 count distinct with codegen                    70454          70454           0          0.0       35226.8       1.0X
40 count distinct without codegen                127975         127975           0          0.0       63987.3       0.6X


OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1017-azure
ntel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
50 count distinct with codegen                   104559         104559           0          0.0       52279.7       1.0X
50 count distinct without codegen                182331         182331           0          0.0       91165.5       0.6X

OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
60 count distinct with codegen                   146774         146774           0          0.0       73386.8       1.0X
60 count distinct without codegen                257184         257184           0          0.0      128592.1       0.6X

OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
70 count distinct with codegen                   190883         190883           0          0.0       95441.4       1.0X
70 count distinct without codegen                346295         346295           0          0.0      173147.4       0.6X

Java 17

OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
35 count distinct with codegen                    45439          45439           0          0.0       22719.5       1.0X
35 count distinct without codegen                 77241          77241           0          0.0       38620.5       0.6X

OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
40 count distinct with codegen                    58347          58347           0          0.0       29173.3       1.0X
40 count distinct without codegen                 98928          98928           0          0.0       49464.1       0.6X


OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
50 count distinct with codegen                    85589          85589           0          0.0       42794.7       1.0X
50 count distinct without codegen                151642         151642           0          0.0       75820.9       0.6X

OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
60 count distinct with codegen                   118545         118545           0          0.0       59272.5       1.0X
60 count distinct without codegen                234024         234024           0          0.0      117012.2       0.5X

OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Linux 5.15.0-1017-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Benchmark count distinct:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
70 count distinct with codegen                   157151         157151           0          0.0       78575.7       1.0X
70 count distinct without codegen                319582         319582           0          0.0      159791.2       0.5X

After turning on the -XX:+PrintCompilation option, we can found the logs related to the operatorName_doConsume method compilation failure of the C2 compiler:

COMPILE SKIPPED: unsupported incoming calling sequence

or 

COMPILE SKIPPED: unsupported calling sequence

When using Java 8, they are identified as not retryable, this indicates that the compiler deemed this method should not be attempted to compile again on any tier of compilation, and because this is an OSR compilation (i.e. loop compilation), this will mark the method as "never try to perform OSR compilation again on all tiers". But when using Java 11/17 they are identified as retry at different tier, it'll try tier 1, this also makes the performance of the above cases seem acceptable.

So this pr started to suggest using Java 11/17 as runtime environment of Spark in the document.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Just change the document.

@github-actions github-actions bot added the DOCS label Sep 5, 2022
@LuciferYang LuciferYang changed the title [SPARK-40331][DOCS] Recommend use Java 11+ as the runtime of Spark 3.4.0 [SPARK-40331][DOCS] Recommend use Java 11+ as the runtime environment of Spark Sep 5, 2022
@LuciferYang
Copy link
Contributor Author

In addition, using Java 11(with numa bind of cpu and memory ) on the arm architecture will also have better performance than Java 8(with numa bind of cpu and memory ), with a general performance improvement of about 5% ~ 10%

@LuciferYang LuciferYang changed the title [SPARK-40331][DOCS] Recommend use Java 11+ as the runtime environment of Spark [SPARK-40331][DOCS] Recommend use Java 11/17 as the runtime environment of Spark Sep 5, 2022
@wangyum
Copy link
Member

wangyum commented Sep 5, 2022

This is a real use case with 50 count distinct:
Java 8 with whole stage codegen:
image

Java 8 with whole stage codegen and disable TieredCompilation(-XX:-TieredCompilation):
image

Java 8 with whole stage codegen and validParamLength = 48(set spark.sql.CodeGenerator.validParamLength=48):
image

Java 11:
image

Java 17:
image

@wangyum
Copy link
Member

wangyum commented Sep 5, 2022

@HyukjinKwon
Copy link
Member

cc @rednaxelafx FYI

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS), and it should run on any platform that runs a supported version of Java. This should include JVMs on x86_64 and ARM64. It's easy to run locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation.

Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+.
Java 11/17 is the recommended version to run Spark on.
Copy link
Member

@HyukjinKwon HyukjinKwon Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should explicitly mention this though. Some benchmark results weren't good actually IIRC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that the performance you mentioned looks limited to whole stage codegen and JIT complier only. I remember we saw some slower performance (e.g., at https://github.com/apache/spark/tree/master/core/benchmarks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility, is the current code more inclined to the best practice of Java 8? Just as the current use of Scala 2.13 has not reached the best practice. I think it should be improved through some refactoring work, . @rednaxelafx , WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should move forward. But we should deprecate Java 8 first which I think would mean that encourage users to use JDK 11 and 17.

What I am unsure is that whether it's better to recommend something (JDK 11/17) slower when the old stuff (JDK 8) is not even deprecated. To end users, JDK 8 is still a faster option in general if I am not mistaken.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @HyukjinKwon 's comment. Yes, it does. Our benchmark shows JDK 8 is faster in general.

Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.7+ and R 3.5+.
Java 11/17 is the recommended version to run Spark on.
Python 3.7 support is deprecated as of Spark 3.4.0.
Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why do we remove this deprecation BTW?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok to keep this line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should not remove this, @LuciferYang . Please revert this from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@zhengruifeng
Copy link
Contributor

also cc @Yikun

@Yikun
Copy link
Member

Yikun commented Sep 6, 2022

Some info FYI:

  1. OpenJDK 8 has more longer time before EOL (2026) than OpenJDK 11 (2024), it shows the upstream community attitude of Java 8 in some level.
  2. For some initial aarch64 port: https://openjdk.org/jeps/237 were introduced in Java 9, and Java 8 backport this.
  3. Java 11 have more performance enhancement in aarch64, For some aarch64 improvement like: https://openjdk.org/jeps/315 only be supported in java 11 but java 8 not.
  4. IIRC, I had some discussion the JDK experts on our internal team, OpenJDK 11 contains many experimental feature, and these features are stable on OpenJDK 17, Java 11 might be a transition version.
  5. MacOS java support introduced in Java17. https://openjdk.org/jeps/391

Above all, personally think, unless we see that Java 11 has advantages in most scenarios in Apache Spark, otherwise users should choose the appropriate JDK among 8, 11, 17 according to their own situation (at current time).

@LuciferYang
Copy link
Contributor Author

In addition, using Java 11(with numa bind of cpu and memory ) on the arm architecture will also have better performance than Java 8(with numa bind of cpu and memory ), with a general performance improvement of about 5% ~ 10%

To clarify, this is NUMA bound through the capabilities of the container, not the capabilities of Java 11

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Spark community still supports Hadoop 2.7.4 distribution officially. What do you think about that, @LuciferYang ?

@LuciferYang
Copy link
Contributor Author

Apache Spark community still supports Hadoop 2.7.4 distribution officially. What do you think about that, @LuciferYang ?

Is this question related to the current PR? Or is it a separate question? Is there any bad case with Hadoop 2.7.3 + Java 11? Maybe it's something I don't known.

@LuciferYang
Copy link
Contributor Author

@dongjoon-hyun 20% of SQL type jobs in our production environment use "Spark 3.1 (or Spark 3.2) + Hadoop 2.7.3 + Java 11" on the Arm architecture, and no fatal problems have been found in the last 8 months, but our production environment is not strongly dependent on Hadoop, so I may miss some fatal cases
.

@dongjoon-hyun
Copy link
Member

Yes it is related because this PR is about a general recommendation for Spark, not a specific distribution.

Is this question related to the current PR?

I want to confirm that we didn't miss any thing there. So, what about Java 17 combination because this PR recommends both 11/17?

@LuciferYang
Copy link
Contributor Author

@dongjoon-hyun Java 17 has not been used in our production environment, but for the scenarios mentioned in the PR description, the performance using 11 and 17 is obviously better than 8. May be @wangyum have some performance data of Java 17 in the production environment that can be shared?

@cloud-fan
Copy link
Contributor

Just for curiosity: are we already using java 11 to run tests in Github Actions?

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Sep 6, 2022

Just for curiosity: are we already using java 11 to run tests in Github Actions?

Currently, Java 11/17 only have daily test. For each pr, only build, no test, am I right? @HyukjinKwon

@srowen
Copy link
Member

srowen commented Sep 6, 2022

If this is at all controversial, let's just not make this change

@HyukjinKwon
Copy link
Member

Currently, Java 11/17 only have daily test. For each pr, only build, no test, am I right?

yup.

@srowen srowen closed this Sep 11, 2022
@wangyum
Copy link
Member

wangyum commented Sep 21, 2022

The performance Issue fixed by JDK-8159720.

@LuciferYang
Copy link
Contributor Author

The performance Issue fixed by JDK-8159720.

Seems to be fixed only in Java 9+?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants