Skip to content

Commit

Permalink
Adds instructions for running the Multi-language Java quickstart from…
Browse files Browse the repository at this point in the history
… released Beam (#23721)

* Adds instructions for running the Multi-language Java quickstart from released Beam

* Fix dependencies

* Addressing reviewer comments
  • Loading branch information
chamikaramj authored Oct 21, 2022
1 parent 2e49c7e commit 69fe1cc
Show file tree
Hide file tree
Showing 4 changed files with 71 additions and 15 deletions.
6 changes: 6 additions & 0 deletions examples/multi-language/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,3 +166,9 @@ of the digit. The second item is the predicted label of the digit.
```
gsutil cat gs://$GCP_BUCKET/multi-language-beam/output*
```

### Python Dataframe Wordcount

This example is covered in the [Java multi-language pipelines quickstart](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/).
The pipeline source code is available at
[PythonDataframeWordCount.java](https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/PythonDataframeWordCount.java).
1 change: 0 additions & 1 deletion examples/multi-language/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ dependencies {
runtimeOnly project(path: ":runners:portability:java")
implementation library.java.vendored_guava_26_0_jre
implementation project(":sdks:java:expansion-service")
implementation project(":sdks:java:extensions:python")
permitUnusedDeclared project(":sdks:java:expansion-service") // BEAM-11761
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,26 +138,27 @@ default Beam SDK, you might need to run your own expansion service. In such
cases, [start the expansion service](#advanced-start-an-expansion-service)
before running your pipeline.

Here we've provided commands for running the example pipeline using
Gradle on a [Beam HEAD Git clone](https://github.com/apache/beam).
If you need a more stable environment, please
[setup a Java project](/get-started/quickstart-java/) that uses the latest
released Beam version and include the necessary dependencies.
### Run with Dataflow runner at HEAD (Beam 2.41.0 and later)

### Run with Dataflow runner
> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
> Beam 2.42.0 requires manually starting up an expansion service (see
> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
> and using the additional pipeline option `--expansionService=localhost:<PORT>`
> when executing the pipeline.
The following script runs the example multi-language pipeline on Dataflow, using
example text from a Cloud Storage bucket. You’ll need to adapt the script to
your environment.

```
export GCP_PROJECT=<project>
export OUTPUT_BUCKET=<bucket>
export GCP_REGION=<region>
export TEMP_LOCATION=gs://$OUTPUT_BUCKET/tmp
export PYTHON_VERSION=<version>
./gradlew :examples:multi-language:pythonDataframeWordCount --args=" \
--runner=DataflowRunner \
--project=$GCP_PROJECT \
--output=gs://${OUTPUT_BUCKET}/count \
--region=${GCP_REGION}"
```
Expand Down Expand Up @@ -192,10 +193,15 @@ python -m apache_beam.runners.portability.local_job_service_main -p $JOB_SERVER_

5. Run the pipeline.

> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
> Beam 2.42.0 requires manually starting up an expansion service (see
> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
> and using the additional pipeline option `--expansionService=localhost:<PORT>`
> when executing the pipeline.
```
export JOB_SERVER_PORT=<port> # Same port as before
export OUTPUT_FILE=<local relative path>
export PYTHON_VERSION=<version>
./gradlew :examples:multi-language:pythonDataframeWordCount --args=" \
--runner=PortableRunner \
Expand Down Expand Up @@ -226,19 +232,64 @@ For example, to start the standard expansion service for a Python transform,
[ExpansionServiceServicer](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service.py),
follow these steps:

1. Activate a Python virtual environment and install Apache Beam, as described
in the [Python quick start](/get-started/quickstart-py/).
2. In the **beam/sdks/python** directory of the Beam source code, run the
following command:
1. Activate a new virtual environment following
[these instructions](https://beam.apache.org/get-started/quickstart-py/#create-and-activate-a-virtual-environment).

2. Install Apache Beam with `gcp` and `dataframe` packages.

```
pip install apache-beam[gcp,dataframe]
```

4. Run the following command

```
python apache_beam/runners/portability/expansion_service_main.py -p 18089 --fully_qualified_name_glob "*"
python -m apache_beam.runners.portability.expansion_service_main -p <PORT> --fully_qualified_name_glob "*"
```

The command runs
[expansion_service_main.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_main.py), which starts the standard expansion service. When you use
Gradle to run your Java pipeline, you can specify the expansion service with the
`expansionService` option. For example: `--expansionService=localhost:18089`.
`expansionService` option. For example: `--expansionService=localhost:<PORT>`.

### Run with Dataflow runner using a Beam release (Beam 2.43.0 and later)

> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
> Beam 2.42.0 requires manually starting up an expansion service (see
> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
> and using the additional pipeline option `--expansionService=localhost:<PORT>`
> when executing the pipeline.
* Check out the Beam examples Maven archetype for the relevant Beam version.

```
export BEAM_VERSION=<Beam version>
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=$BEAM_VERSION \
-DgroupId=org.example \
-DartifactId=multi-language-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
```

* Run the pipeline.

```
export GCP_PROJECT=<GCP project>
export GCP_BUCKET=<GCP bucket>
export GCP_REGION=<GCP region>
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.multilanguage.PythonDataframeWordCount \
-Dexec.args="--runner=DataflowRunner --project=$GCP_PROJECT \
--region=us-central1 \
--gcpTempLocation=gs://$GCP_BUCKET/multi-language-beam/tmp \
--output=gs://$GCP_BUCKET/multi-language-beam/output" \
-Pdataflow-runner
```

## Next steps

Expand Down

0 comments on commit 69fe1cc

Please sign in to comment.