Adds instructions for running the Multi-language Java quickstart from…

… released Beam (#23721) * Adds instructions for running the Multi-language Java quickstart from released Beam * Fix dependencies * Addressing reviewer comments
apache · Oct 21, 2022 · 69fe1cc · 69fe1cc
1 parent 2e49c7e
commit 69fe1cc
Show file tree

Hide file tree

Showing 4 changed files with 71 additions and 15 deletions.
diff --git a/...ltilanguage/PythonDataframeWordCount.java → ...ltilanguage/PythonDataframeWordCount.java b/...ltilanguage/PythonDataframeWordCount.java → ...ltilanguage/PythonDataframeWordCount.java
diff --git a/examples/multi-language/README.md b/examples/multi-language/README.md
@@ -166,3 +166,9 @@ of the digit. The second item is the predicted label of the digit.
 ```
 gsutil cat gs://$GCP_BUCKET/multi-language-beam/output*
 ```
+
+### Python Dataframe Wordcount
+
+This example is covered in the [Java multi-language pipelines quickstart](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/).
+The pipeline source code is available at
+[PythonDataframeWordCount.java](https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/PythonDataframeWordCount.java).
diff --git a/examples/multi-language/build.gradle b/examples/multi-language/build.gradle
@@ -40,7 +40,6 @@ dependencies {
     runtimeOnly project(path: ":runners:portability:java")
     implementation library.java.vendored_guava_26_0_jre
     implementation project(":sdks:java:expansion-service")
-    implementation project(":sdks:java:extensions:python")
     permitUnusedDeclared project(":sdks:java:expansion-service") // BEAM-11761
 }
 

diff --git a/website/www/site/content/en/documentation/sdks/java-multi-language-pipelines.md b/website/www/site/content/en/documentation/sdks/java-multi-language-pipelines.md
@@ -138,26 +138,27 @@ default Beam SDK, you might need to run your own expansion service. In such
 cases, [start the expansion service](#advanced-start-an-expansion-service)
 before running your pipeline.
 
-Here we've provided commands for running the example pipeline using
-Gradle on a [Beam HEAD Git clone](https://github.com/apache/beam).
-If you need a more stable environment, please
-[setup a Java project](/get-started/quickstart-java/) that uses the latest
-released Beam version and include the necessary dependencies.
+### Run with Dataflow runner at HEAD (Beam 2.41.0 and later)
 
-### Run with Dataflow runner
+> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
+> Beam 2.42.0 requires manually starting up an expansion service (see
+> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
+> and using the additional pipeline option `--expansionService=localhost:<PORT>`
+> when executing the pipeline.
 
 The following script runs the example multi-language pipeline on Dataflow, using
 example text from a Cloud Storage bucket. You’ll need to adapt the script to
 your environment.
 
 ```
+export GCP_PROJECT=<project>
 export OUTPUT_BUCKET=<bucket>
 export GCP_REGION=<region>
 export TEMP_LOCATION=gs://$OUTPUT_BUCKET/tmp
-export PYTHON_VERSION=<version>
 
 ./gradlew :examples:multi-language:pythonDataframeWordCount --args=" \
 --runner=DataflowRunner \
+--project=$GCP_PROJECT \
 --output=gs://${OUTPUT_BUCKET}/count \
 --region=${GCP_REGION}"
 ```
@@ -192,10 +193,15 @@ python -m apache_beam.runners.portability.local_job_service_main -p $JOB_SERVER_
 
 5. Run the pipeline.
 
+> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
+> Beam 2.42.0 requires manually starting up an expansion service (see
+> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
+> and using the additional pipeline option `--expansionService=localhost:<PORT>`
+> when executing the pipeline.
+
 ```
 export JOB_SERVER_PORT=<port>  # Same port as before
 export OUTPUT_FILE=<local relative path>
-export PYTHON_VERSION=<version>
 
 ./gradlew :examples:multi-language:pythonDataframeWordCount --args=" \
 --runner=PortableRunner \
@@ -226,19 +232,64 @@ For example, to start the standard expansion service for a Python transform,
 [ExpansionServiceServicer](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service.py),
 follow these steps:
 
-1. Activate a Python virtual environment and install Apache Beam, as described
-   in the [Python quick start](/get-started/quickstart-py/).
-2. In the **beam/sdks/python** directory of the Beam source code, run the
-   following command:
+1. Activate a new virtual environment following
+[these instructions](https://beam.apache.org/get-started/quickstart-py/#create-and-activate-a-virtual-environment).
+
+2. Install Apache Beam with `gcp` and `dataframe` packages.
+
+```
+pip install apache-beam[gcp,dataframe]
+```
+
+4. Run the following command
 
    ```
-   python apache_beam/runners/portability/expansion_service_main.py -p 18089 --fully_qualified_name_glob "*"
+   python -m apache_beam.runners.portability.expansion_service_main -p <PORT> --fully_qualified_name_glob "*"
    ```
 
 The command runs
 [expansion_service_main.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_main.py), which starts the standard expansion service. When you use
 Gradle to run your Java pipeline, you can specify the expansion service with the
-`expansionService` option. For example: `--expansionService=localhost:18089`.
+`expansionService` option. For example: `--expansionService=localhost:<PORT>`.
+
+### Run with Dataflow runner using a Beam release (Beam 2.43.0 and later)
+
+> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
+> Beam 2.42.0 requires manually starting up an expansion service (see
+> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
+> and using the additional pipeline option `--expansionService=localhost:<PORT>`
+> when executing the pipeline.
+
+* Check out the Beam examples Maven archetype for the relevant Beam version.
+
+```
+export BEAM_VERSION=<Beam version>
+
+mvn archetype:generate \
+    -DarchetypeGroupId=org.apache.beam \
+    -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
+    -DarchetypeVersion=$BEAM_VERSION \
+    -DgroupId=org.example \
+    -DartifactId=multi-language-beam \
+    -Dversion="0.1" \
+    -Dpackage=org.apache.beam.examples \
+    -DinteractiveMode=false
+```
+
+* Run the pipeline.
+
+```
+export GCP_PROJECT=<GCP project>
+export GCP_BUCKET=<GCP bucket>
+export GCP_REGION=<GCP region>
+
+mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.multilanguage.PythonDataframeWordCount \
+    -Dexec.args="--runner=DataflowRunner --project=$GCP_PROJECT \
+                 --region=us-central1 \
+                 --gcpTempLocation=gs://$GCP_BUCKET/multi-language-beam/tmp \
+                 --output=gs://$GCP_BUCKET/multi-language-beam/output" \
+    -Pdataflow-runner
+```
 
 ## Next steps