Promote cudf as dist direct dependency, mark aggregator provided (#4043)

Closes #3935 - Manually emulate dependency promotion of cudf dependency by the shade plugin - Mark aggregator as provided - Stop overriding default-jar execution, just bind it to none and introduce a dedicated execution for jar of parallel world directories. - Add dependency-reduced-pom* to the clean phase. - Undo buggy default-install execution that was not using dependency-reduced-pom Signed-off-by: Gera Shegalov <gera@apache.org>
NVIDIA · Nov 12, 2021 · 65c1389 · 65c1389
1 parent 7d3629f
commit 65c1389
Show file tree

Hide file tree

Showing 4 changed files with 70 additions and 26 deletions.
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # RAPIDS Accelerator For Apache Spark
-NOTE: For the latest stable [README.md](https://github.com/nvidia/spark-rapids/blob/main/README.md) ensure you are on the main branch. The RAPIDS Accelerator for Apache Spark provides a set of plugins for Apache Spark that leverage GPUs to accelerate processing via the RAPIDS libraries and UCX. Documentation on the current release can be found [here](https://nvidia.github.io/spark-rapids/). 
+NOTE: For the latest stable [README.md](https://github.com/nvidia/spark-rapids/blob/main/README.md) ensure you are on the main branch. The RAPIDS Accelerator for Apache Spark provides a set of plugins for Apache Spark that leverage GPUs to accelerate processing via the RAPIDS libraries and UCX. Documentation on the current release can be found [here](https://nvidia.github.io/spark-rapids/).
 
-The RAPIDS Accelerator for Apache Spark provides a set of plugins for 
+The RAPIDS Accelerator for Apache Spark provides a set of plugins for
 [Apache Spark](https://spark.apache.org) that leverage GPUs to accelerate processing
 via the [RAPIDS](https://rapids.ai) libraries and [UCX](https://www.openucx.org/).
 
@@ -19,7 +19,7 @@ To get started tuning your job and get the most performance out of it please sta
 
 ## Configuration
 
-The plugin has a set of Spark configs that control its behavior and are documented 
+The plugin has a set of Spark configs that control its behavior and are documented
 [here](docs/configs.md).
 
 ## Issues
@@ -30,13 +30,13 @@ may file one [here](https://github.com/NVIDIA/spark-rapids/issues/new/choose).
 ## Download
 
 The jar files for the most recent release can be retrieved from the [download](docs/download.md)
-page. 
+page.
 
 ## Building From Source
 
 See the [build instructions in the contributing guide](CONTRIBUTING.md#building-from-source).
 
-## Testing 
+## Testing
 
 Tests are described [here](tests/README.md).
 
@@ -45,7 +45,7 @@ The RAPIDS Accelerator For Apache Spark does provide some APIs for doing zero co
 transfer into other GPU enabled applications.  It is described
 [here](docs/ml-integration.md).
 
-Currently, we are working with XGBoost to try to provide this integration out of the box. 
+Currently, we are working with XGBoost to try to provide this integration out of the box.
 
 You may need to disable RMM caching when exporting data to an ML library as that library
 will likely want to use all of the GPU's memory and if it is not aware of RMM it will not have
@@ -60,6 +60,21 @@ The profiling tool generates information which can be used for debugging and pro
 Information such as Spark version, executor information, properties and so on. This runs on either CPU or
 GPU generated event logs.
 
-Please refer to [spark qualification tool documentation](docs/spark-qualification-tool.md) 
+Please refer to [spark qualification tool documentation](docs/spark-qualification-tool.md)
 and [spark profiling tool documentation](docs/spark-profiling-tool.md)
-for more details on how to use the tools.
+for more details on how to use the tools.
+
+## Dependency for External Projects
+
+If you need to develop some functionality on top of RAPIDS Accelerator For Apache Spark (we currently
+limit support to GPU-accelerated UDFs) we recommend you declare our distribution artifact
+as a `provided` dependency.
+
+```xml
+<dependency>
+    <groupId>com.nvidia</groupId>
+    <artifactId>rapids-4-spark_2.12</artifactId>
+    <version>21.12.0-SNAPSHOT</version>
+    <scope>provided</scope>
+</dependency>
+```
diff --git a/aggregator/pom.xml b/aggregator/pom.xml
@@ -210,6 +210,31 @@
                 <groupId>org.apache.rat</groupId>
                 <artifactId>apache-rat-plugin</artifactId>
             </plugin>
+            <plugin>
+                <!-- keep for the case dependency-reduced pom is enabled -->
+                <artifactId>maven-clean-plugin</artifactId>
+                <version>3.1.0</version>
+                <executions>
+                    <execution>
+                        <id>clean-reduced-dependency-poms</id>
+                        <phase>clean</phase>
+                        <goals>
+                            <goal>clean</goal>
+                        </goals>
+                        <configuration>
+                            <skip>${skipDrpClean}</skip>
+                            <filesets>
+                                <fileset>
+                                    <directory>${project.basedir}</directory>
+                                    <includes>
+                                        <include>dependency-reduced-pom*.xml</include>
+                                    </includes>
+                                </fileset>
+                            </filesets>
+                        </configuration>
+                    </execution>
+                </executions>
+            </plugin>
         </plugins>
     </build>
 

diff --git a/dist/pom.xml b/dist/pom.xml
@@ -34,8 +34,23 @@
             <artifactId>rapids-4-spark-aggregator_${scala.binary.version}</artifactId>
             <version>${project.version}</version>
             <classifier>${spark.version.classifier}</classifier>
+            <!--
+                provided such that the 3rd party project depending on this will drop it
+                https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope
+            -->
             <scope>provided</scope>
         </dependency>
+
+        <!--
+            manually promoting provided cudf as a direct dependency
+        -->
+        <dependency>
+            <groupId>ai.rapids</groupId>
+            <artifactId>cudf</artifactId>
+            <version>${cudf.version}</version>
+            <classifier>${cuda.version}</classifier>
+            <scope>compile</scope>
+        </dependency>
     </dependencies>
 
     <properties>
@@ -223,7 +238,14 @@
                 <executions>
                     <execution>
                         <id>default-jar</id>
+                        <phase>none</phase>
+                    </execution>
+                    <execution>
+                        <id>create-parallel-worlds-jar</id>
                         <phase>package</phase>
+                        <goals>
+                            <goal>jar</goal>
+                        </goals>
                         <configuration>
                             <classesDirectory>${project.build.directory}/parallel-world</classesDirectory>
                         </configuration>
@@ -336,20 +358,6 @@
                     </excludes>
                 </configuration>
             </plugin>
-            <plugin>
-                <groupId>org.apache.maven.plugins</groupId>
-                <artifactId>maven-install-plugin</artifactId>
-                <version>3.0.0-M1</version>
-                <executions>
-                    <execution>
-                        <id>default-install</id>
-                        <phase>install</phase>
-                        <configuration>
-                            <pomFile>${project.build.directory}/dependency-reduced-pom.xml</pomFile>
-                        </configuration>
-                    </execution>
-                </executions>
-            </plugin>
         </plugins>
     </build>
 </project>
diff --git a/dist/scripts/binary-dedupe.sh b/dist/scripts/binary-dedupe.sh
@@ -220,9 +220,5 @@ time (
 echo "$((++STEP))/ deleting all class files listed in $DELETE_DUPLICATES_TXT"
 time (< "$DELETE_DUPLICATES_TXT" sort -u | xargs rm) 2>&1
 
-echo "Generating dependency-reduced-pom.xml"
-# which is just delete the dependencies list altogether
-sed  -e '/<dependencies>/,/<\/dependencies>/d' ../pom.xml > dependency-reduced-pom.xml
-
 end_time=$(date +%s)
 echo "binary-dedupe completed in $((end_time - start_time)) seconds"