-
Notifications
You must be signed in to change notification settings - Fork 6.7k
feat(pubsublite): Spark SQL Streaming to Pub/Sub Lite using Pub/Sub Lite Spark Connector #6682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
77e262f
add initial files
anguillanneuf c789aaa
add key, event time, data but not attributes
anguillanneuf 10df79a
pytest passes
anguillanneuf 6668dd7
nox tests pass and add readme
anguillanneuf 7c1a63d
add license header
anguillanneuf 61564a0
Merge branch 'master' into pubsublite
anguillanneuf a7750b2
update region tag
anguillanneuf da4e9ee
Merge branch 'master' into pubsublite
anguillanneuf 857d916
Merge branch 'master' into pubsublite
anguillanneuf fce1d3c
Update pubsublite/spark-connector/README.md
anguillanneuf b620106
address reviewer comments
anguillanneuf f35d250
Merge branch 'pubsublite' of github.com:GoogleCloudPlatform/python-do…
anguillanneuf c72510f
Merge branch 'master' into pubsublite
anguillanneuf 15100aa
Merge branch 'master' into pubsublite
anguillanneuf 418db54
use different uber jar uri
anguillanneuf a3abebe
Merge branch 'pubsublite' of github.com:GoogleCloudPlatform/python-do…
anguillanneuf 5e0f794
address reviewer comment
anguillanneuf f7a2636
Merge branch 'master' into pubsublite
anguillanneuf 55ae470
chore(deps): update dependency google-cloud-bigquery to v2.26.0 (#6654)
renovate-bot 329d376
address reviewer comments
anguillanneuf fb49887
lint
anguillanneuf 9f0f9c2
Merge branch 'master' into pubsublite
anguillanneuf 4d88501
read from the starting offset
anguillanneuf f1f4867
run test in py-3.8 and py-3.9
anguillanneuf 3209bba
only test in py-3.8 due to concurrent logging limitation
anguillanneuf 86f7b8a
address jiangmichaelll's comments
anguillanneuf 507d9e6
merge conflicts
anguillanneuf a7cc8a0
address jiangmichaelll's comments
anguillanneuf 7ddde9c
Merge branch 'master' into pubsublite
anguillanneuf 5c010f1
Merge branch 'master' into pubsublite
anguillanneuf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,189 @@ | ||
| # Using Spark SQL Streaming with Pub/Sub Lite | ||
|
|
||
| The samples in this directory show how to read messages from and write messages to Pub/Sub Lite from an [Apache Spark] cluster created with [Cloud Dataproc] using the [Pub/Sub Lite Spark Connector]. | ||
|
|
||
| Get the connector's uber jar from this [public Cloud Storage location]. Alternatively, visit this [Maven link] to download the connector's uber jar. The uber jar has a "with-dependencies" suffix. You will need to include it on the driver and executor classpaths when submitting a Spark job, typically in the `--jars` flag. | ||
|
|
||
| ## Before you begin | ||
|
|
||
| 1. Install the [Cloud SDK]. | ||
| > *Note:* This is not required in [Cloud Shell] | ||
| > because Cloud Shell has the Cloud SDK pre-installed. | ||
|
|
||
| 1. Create a new Google Cloud project via the | ||
| [*New Project* page] or via the `gcloud` command line tool. | ||
|
|
||
| ```sh | ||
| export PROJECT_ID=your-google-cloud-project-id | ||
| gcloud projects create $PROJECT_ID | ||
| ``` | ||
| Or use an existing Google Cloud project. | ||
| ```sh | ||
| export PROJECT_ID=$(gcloud config get-value project) | ||
| ``` | ||
|
|
||
| 1. [Enable billing]. | ||
|
|
||
| 1. Setup the Cloud SDK to your GCP project. | ||
|
|
||
| ```sh | ||
| gcloud init | ||
| ``` | ||
|
|
||
| 1. [Enable the APIs]: Pub/Sub Lite, Dataproc, Cloud Storage. | ||
|
|
||
| 1. Create a Pub/Sub Lite [topic] and [subscription] in a supported [location]. | ||
|
|
||
| ```bash | ||
| export TOPIC_ID=your-topic-id | ||
| export SUBSCRIPTION_ID=your-subscription-id | ||
| export PUBSUBLITE_LOCATION=your-location | ||
|
|
||
| gcloud pubsub lite-topics create $TOPIC_ID \ | ||
| --location=$PUBSUBLITE_LOCATION \ | ||
| --partitions=2 \ | ||
| --per-partition-bytes=30GiB | ||
|
|
||
| gcloud pubsub lite-subscriptions create $SUBSCRIPTION_ID \ | ||
| --location=$PUBSUBLITE_LOCATION \ | ||
| --topic=$TOPIC_ID | ||
| ``` | ||
|
|
||
| 1. Create a Cloud Storage bucket. | ||
|
|
||
| ```bash | ||
| export BUCKET_ID=your-gcs-bucket-id | ||
|
|
||
| gsutil mb gs://$BUCKET_ID | ||
| ``` | ||
|
|
||
| ## Python setup | ||
|
|
||
| 1. [Install Python and virtualenv]. | ||
|
|
||
| 1. Clone the `python-docs-samples` repository. | ||
|
|
||
| ```bash | ||
| git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git | ||
| ``` | ||
|
|
||
| 1. Navigate to the sample code directory. | ||
|
|
||
| ```bash | ||
| cd python-docs-samples/pubsublite/spark-connector | ||
| ``` | ||
|
|
||
| 1. Create a virtual environment and activate it. | ||
|
|
||
| ```bash | ||
| python -m venv env | ||
| source env/bin/activate | ||
| ``` | ||
| > Once you are finished with the tutorial, you can deactivate | ||
| > the virtualenv and go back to your global Python environment | ||
| > by running `deactivate`. | ||
|
|
||
| 1. Install the required packages. | ||
| ```bash | ||
| python -m pip install -U -r requirements.txt | ||
| ``` | ||
|
|
||
| ## Creating a Spark cluster on Dataproc | ||
|
|
||
| 1. Go to [Cloud Console for Dataproc]. | ||
|
|
||
| 1. Go to Clusters, then [Create Cluster]. | ||
| > **Note:** When setting up the cluster, you must choose | ||
| > [Dataproc Image Version 1.5] under ___Versioning___ because | ||
| > the connector currently only supports Spark 2.4.8. | ||
| > Additionally, in ___Manage security (optional)___, you | ||
| > must enable the cloud-platform scope for your cluster by | ||
| > checking "Allow API access to all Google Cloud services in | ||
| > the same project" under ___Project access___. | ||
|
|
||
| Here is an equivalent example using a `gcloud` command, with an additional optional argument to enable component gateway: | ||
|
|
||
| ```sh | ||
| export CLUSTER_ID=your-cluster-id | ||
| export DATAPROC_REGION=your-dataproc-region | ||
|
|
||
| gcloud dataproc clusters create $CLUSTER_ID \ | ||
| --region $DATAPROC_REGION \ | ||
| --image-version 1.5-debian10 \ | ||
| --scopes 'https://www.googleapis.com/auth/cloud-platform' \ | ||
| --enable-component-gateway | ||
| ``` | ||
|
|
||
| ## Writing to Pub/Sub Lite | ||
|
|
||
| [spark_streaming_to_pubsublite_example.py](spark_streaming_to_pubsublite_example.py) creates a streaming source of consecutive numbers with timestamps for 60 seconds and writes them to a Pub/Sub topic. | ||
|
|
||
| To submit a write job: | ||
|
|
||
| ```sh | ||
| export PROJECT_NUMBER=$(gcloud projects list --filter="projectId:$PROJECT_ID" --format="value(PROJECT_NUMBER)") | ||
|
|
||
| gcloud dataproc jobs submit pyspark spark_streaming_to_pubsublite_example.py \ | ||
| --region=$DATAPROC_REGION \ | ||
| --cluster=$CLUSTER_ID \ | ||
| --jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar \ | ||
| --driver-log-levels=root=INFO \ | ||
| --properties=spark.master=yarn \ | ||
| -- --project_number=$PROJECT_NUMBER --location=$PUBSUBLITE_LOCATION --topic_id=$TOPIC_ID | ||
| ``` | ||
|
|
||
| Visit the job URL in the command output or the jobs panel in [Cloud Console for Dataproc] to monitor the job progress. | ||
anguillanneuf marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| You should see INFO logging like the following in the output: | ||
|
|
||
| ```none | ||
| INFO com.google.cloud.pubsublite.spark.PslStreamWriter: Committed 1 messages for epochId .. | ||
| ``` | ||
|
|
||
| ## Reading from Pub/Sub Lite | ||
|
|
||
| [spark_streaming_from_pubsublite_example.py](spark_streaming_from_pubsublite_example.py) reads messages formatted as dataframe rows from a Pub/Sub subscription and prints them out to the console. | ||
|
|
||
| To submit a read job: | ||
|
|
||
| ```sh | ||
| gcloud dataproc jobs submit pyspark spark_streaming_from_pubsublite_example.py \ | ||
| --region=$DATAPROC_REGION \ | ||
| --cluster=$CLUSTER_ID \ | ||
| --jars=gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar \ | ||
| --driver-log-levels=root=INFO \ | ||
| --properties=spark.master=yarn \ | ||
| -- --project_number=$PROJECT_NUMBER --location=$PUBSUBLITE_LOCATION --subscription_id=$SUBSCRIPTION_ID | ||
| ``` | ||
|
|
||
| Here is an example output: <!--TODO: update attributes field output with the next release of the connector--> | ||
|
|
||
| ```none | ||
anguillanneuf marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| +--------------------+---------+------+---+----+--------------------+--------------------+----------+ | ||
| | subscription|partition|offset|key|data| publish_timestamp| event_timestamp|attributes| | ||
| +--------------------+---------+------+---+----+--------------------+--------------------+----------+ | ||
| |projects/50200928...| 0| 89523| 0| .|2021-09-03 23:01:...|2021-09-03 22:56:...| []| | ||
| |projects/50200928...| 0| 89524| 1| .|2021-09-03 23:01:...|2021-09-03 22:56:...| []| | ||
| |projects/50200928...| 0| 89525| 2| .|2021-09-03 23:01:...|2021-09-03 22:56:...| []| | ||
| ``` | ||
|
|
||
| [Apache Spark]: https://spark.apache.org/ | ||
| [Pub/Sub Lite Spark Connector]: https://github.com/googleapis/java-pubsublite-spark | ||
| [Cloud Dataproc]: https://cloud.google.com/dataproc/docs/ | ||
| [public Cloud Storage location]: gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar | ||
| [Maven link]: https://search.maven.org/search?q=g:com.google.cloud%20a:pubsublite-spark-sql-streaming | ||
|
|
||
| [Cloud SDK]: https://cloud.google.com/sdk/docs/ | ||
| [Cloud Shell]: https://console.cloud.google.com/cloudshell/editor/ | ||
| [*New Project* page]: https://console.cloud.google.com/projectcreate | ||
| [Enable billing]: https://cloud.google.com/billing/docs/how-to/modify-project/ | ||
| [Enable the APIs]: https://console.cloud.google.com/flows/enableapi?apiid=pubsublite.googleapis.com,dataproc,storage_component | ||
| [topic]: https://cloud.google.com/pubsub/lite/docs/topics | ||
| [subscription]: https://cloud.google.com/pubsub/lite/docs/subscriptions | ||
| [location]: https://cloud.google.com/pubsub/lite/docs/locations | ||
|
|
||
| [Install Python and virtualenv]: https://cloud.google.com/python/setup/ | ||
| [Cloud Console for Dataproc]: https://console.cloud.google.com/dataproc/ | ||
|
|
||
| [Create Cluster]: https://pantheon.corp.google.com/dataproc/clustersAdd | ||
| [Dataproc Image Version 1.5]: https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.5 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| # Copyright 2021 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Default TEST_CONFIG_OVERRIDE for python repos. | ||
|
|
||
| # You can copy this file into your directory, then it will be imported from | ||
| # the noxfile.py. | ||
|
|
||
| # The source of truth: | ||
| # https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/noxfile_config.py | ||
|
|
||
| TEST_CONFIG_OVERRIDE = { | ||
| # You can opt out from the test for specific Python versions. | ||
| # NOTE: We currently only run the test in Python 3.7 and 3.8. | ||
| "ignored_versions": ["2.7", "3.6", "3.9"], | ||
| # Old samples are opted out of enforcing Python type hints | ||
| # All new samples should feature them | ||
| "enforce_type_hints": True, | ||
| # An envvar key for determining the project id to use. Change it | ||
| # to 'BUILD_SPECIFIC_GCLOUD_PROJECT' if you want to opt in using a | ||
| # build specific Cloud project. You can also use your own string | ||
| # to use your own Cloud project. | ||
| "gcloud_project_env": "GOOGLE_CLOUD_PROJECT", | ||
| # 'gcloud_project_env': 'BUILD_SPECIFIC_GCLOUD_PROJECT', | ||
| # A dictionary you want to inject into your test. Don't put any | ||
| # secrets here. These values will override predefined values. | ||
| "envs": { | ||
| "PUBSUBLITE_BUCKET_ID": "pubsublite-spark", | ||
| "PUBSUBLITE_CLUSTER_ID": "pubsublite-spark", | ||
| }, | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| google-cloud-dataproc==2.5.0 | ||
| google-cloud-pubsublite==1.1.0 | ||
| google-cloud-storage==1.42.1 | ||
| pytest==6.2.5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| pyspark[sql]==3.1.2 |
68 changes: 68 additions & 0 deletions
68
pubsublite/spark-connector/spark_streaming_from_pubsublite_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # Copyright 2021 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import argparse | ||
|
|
||
|
|
||
| def spark_streaming_from_pubsublite( | ||
| project_number: int, location: str, subscription_id: str | ||
anguillanneuf marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ) -> None: | ||
| # [START pubsublite_spark_streaming_from_pubsublite] | ||
| from pyspark.sql import SparkSession | ||
| from pyspark.sql.types import StringType | ||
|
|
||
| # TODO(developer): | ||
| # project_number = 11223344556677 | ||
| # location = "us-central1-a" | ||
| # subscription_id = "your-subscription-id" | ||
|
|
||
| spark = SparkSession.builder.appName("read-app").master("yarn").getOrCreate() | ||
|
|
||
| sdf = ( | ||
| spark.readStream.format("pubsublite") | ||
| .option( | ||
| "pubsublite.subscription", | ||
| f"projects/{project_number}/locations/{location}/subscriptions/{subscription_id}", | ||
| ) | ||
| .load() | ||
| ) | ||
|
|
||
| sdf = sdf.withColumn("data", sdf.data.cast(StringType())) | ||
|
|
||
| query = ( | ||
| sdf.writeStream.format("console") | ||
| .outputMode("append") | ||
| .trigger(processingTime="1 second") | ||
| .start() | ||
| ) | ||
|
|
||
| # Wait 120 seconds (must be >= 60 seconds) to start receiving messages. | ||
| query.awaitTermination(120) | ||
anguillanneuf marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| query.stop() | ||
| # [END pubsublite_spark_streaming_from_pubsublite] | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser( | ||
| description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter, | ||
| ) | ||
| parser.add_argument("--project_number", help="Google Cloud Project Number") | ||
| parser.add_argument("--location", help="Your Cloud location, e.g. us-central1-a") | ||
| parser.add_argument("--subscription_id", help="Your Pub/Sub Lite subscription ID") | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| spark_streaming_from_pubsublite( | ||
| args.project_number, args.location, args.subscription_id | ||
| ) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.