Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote cudf as dist direct dependency, mark aggregator provided #4043

Merged
merged 11 commits into from
Nov 12, 2021

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Nov 5, 2021

Closes #3935

  • Manually emulate dependency promotion of cudf dependency by the shade plugin
  • Mark aggregator as provided
  • Stop overriding default-jar execution, just bind it to none and introduce a dedicated execution for jar of parallel world directories.
  • Add dependency-reduced-pom* to the clean phase.
  • Undo buggy default-install execution that was not using dependency-reduced-pom

Signed-off-by: Gera Shegalov gera@apache.org

Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov gerashegalov added the build Related to CI / CD or cleanly building label Nov 5, 2021
@gerashegalov gerashegalov added this to the Nov 1 - Nov 12 milestone Nov 5, 2021
@gerashegalov gerashegalov self-assigned this Nov 5, 2021
@gerashegalov
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator

so is this actually the right thing to do? I thought our old dependency reduced pom didn't have the spark dependencies as provided in it, but it would be good to double check that. So this changes the scope to be compile? would be nice to have more description because we aren't removing the dependency from the pom we are changing the dependency scope.

@tgravescs
Copy link
Collaborator

actually, I think I was wrong, must have been looking at the wrong pom, or dependency reduced from aggregator.

https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/21.08.0/rapids-4-spark_2.12-21.08.0.pom

Shows them as provided.

@jlowe
Copy link
Member

jlowe commented Nov 5, 2021

I don't think this is what we want. If you look at the generated dependency-reduced-pom.xml, the dependencies are still effectively gone. This jar clearly cannot stand on its own and needs other jars at runtime to work, and I think the dependencies in the pom should reflect that.

The dist pom is pulling in the aggregator pom with the understanding that we'll get all of the aggregator dependencies, transitively. That's exactly what we want during the compile/build phase, but when it comes to advertising the dependencies in the deployed artifact, we need to remove the aggregator jar from the dependency tree underneath us, but also move all of the aggregator jar dependencies up to become dist jar dependencies. The dist jar contains the aggregator jar contents but not its transitive dependencies, so the dist jar dependencies should look similar to the aggregator jar dependencies.

In the end I would expect the dependency reduced pom to list the Apache Spark jars along with the cudf jar as provided dependencies.

@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Nov 5, 2021

In the end I would expect the dependency reduced pom to list the Apache Spark jars along with the cudf jar as provided dependencies.

@jlowe please update #3935 to reflect that you mean cudf and other aggregator's immediate child dependencies need to be promoted to the dist jar as provided. The issue currently talks only about spark-hive and spark-sql.

Look like we want to re-enable dependency-reduced-pom generation in the aggregator module, just pick any aggregator classifier (say the first one) and transplant its dependencies as provided.

@jlowe
Copy link
Member

jlowe commented Nov 5, 2021

Look like we want to just pick any aggregator (say the first one) and transplant its dependencies

Argh, I don't think we can just grab the aggregator dependencies, since some of those dependencies have been pulled into the dist jar as part of the custom shading (sql-plugin, shuffle-plugin, etc.). We'd have to filter out the aggregator dependencies we know have been pulled in to the dist jar.

As for listing whatever Spark jars are specified by the first aggregator, is that what we want? It would be nice to show the matrix of support -- we could put the <profile> options in there showing how the dependencies switch by profile, but not sure that really helps for dependency analysis or resolution. Listing the oldest Spark version for the jars may be the best option if we list the Spark dependency at all. In that case it will get "interesting" if future Spark versions change their jar composition such that we need more jars or differently named jars. Anyone else have ideas on how to best present this?

@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Nov 5, 2021

I also thought of using profiles, propagating release3XY profiles with release301 being the default looks reasonable to me.

@gerashegalov gerashegalov marked this pull request as draft November 5, 2021 23:05
@gerashegalov gerashegalov changed the title Remove only aggregator depenedencies from the dist pom [WIP] Remove only aggregator depenedencies from the dist pom Nov 5, 2021
@gerashegalov gerashegalov changed the title [WIP] Remove only aggregator depenedencies from the dist pom [WIP] replace the aggregator dependency in reduced dist pom with aggregator's dependency list from the aggregator reduced pom Nov 5, 2021
@gerashegalov gerashegalov changed the title [WIP] replace the aggregator dependency in reduced dist pom with aggregator's dependency list from the aggregator reduced pom Remove aggregator from dist, promote cudf and spark as direct dependencies Nov 10, 2021
@gerashegalov gerashegalov marked this pull request as ready for review November 10, 2021 00:51
@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Nov 10, 2021

I investigated the option promoteTransitiveDependencies .

The only way I was able to make it work is by switching cudf (note it is not advertised in our 21.08- poms at all) and spark to compile-scope dependencies so they don't get filtered out by the shade plugin. Then to make sure that those artifacts don't get included into the uber jar, we can add aI.rapids:* and org.apache.spark:* as excludes in the shade plugin config. In this case dependency-reduced-pom.xml has them in the dependency list, however with the compile dependency scope. In summary these are too many changes for the result we can easily hardcode for the moment.

As for advertising the right version of the spark dependencies, I think we can skip adding profiles to this pom. We publish the parent pom. -Dbuildver should work. I confirmed this by creating an empty Maven project.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>hello-world</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>Hello World</name>
  <url>http://nvidia.com</url>
  <dependencies>
    <dependency>
      <groupId>com.nvidia</groupId>
      <artifactId>rapids-4-spark_2.12</artifactId>
      <version>21.12.0-SNAPSHOT</version>
    </dependency>
  </dependencies>
</project>

But if we want to see cudf and spark as part of the dependency tree they have to be compile-scope. Transitive provided dependencies are dropped, and thus will not appear in the tree rooted in a project depending on rapids-4-spark.

Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes Nov 10, 2021
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me, but would be good to check with @NvTimLiu and @pxLi that the CI scripts will be OK with the plugin jar attempting to pull cudf and Spark dependencies when it did not declare them before.

aggregator/pom.xml Show resolved Hide resolved
@gerashegalov
Copy link
Collaborator Author

gerashegalov commented Nov 10, 2021

I reproduced the CI failure locally:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-install-plugin:3.0.0-M1:install-file (install-with-dependency-reduced-pom) on project rapids-4-spark_2.12: Execution install-with-dependency-reduced-pom of goal org.apache.maven.plugins:maven-install-plugin:3.0.0-M1:install-file failed: Cannot add two different pieces of metadata for: project com.nvidia:rapids-4-spark-udf_2.12 -> [Help 1]

working on the fix.

debugged the install plugin and It turns out that it's not safe to use the install-file goal as part of the multi-module project, it was getting a wrong project from the maven session. Thus, using the install goal with the static pom file is our only workable option for the moment. I am changing the aggregator scope in dist to provided so it gets chopped away by Maven dependency scope matrix for transitive dependencies https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#dependency-scope .

Another benefit of the static file is that it will be automatically consistent in META-INF/maven

Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

build

README.md Show resolved Hide resolved
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

build

Copy link
Collaborator

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

we will keep an eye on this to see if any CI setup need to be updated

@gerashegalov gerashegalov changed the title Remove aggregator from dist, promote cudf and spark as direct dependencies Promote cudf as dist direct dependency, mark aggregator provided Nov 12, 2021
@gerashegalov gerashegalov merged commit 65c1389 into NVIDIA:branch-21.12 Nov 12, 2021
@gerashegalov gerashegalov deleted the gerashegalov/issue3935 branch November 12, 2021 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit dependencies of distribution jar
4 participants