feat(pubsublite): Spark SQL Streaming to Pub/Sub Lite using Pub/Sub Lite Spark Connector #6682

anguillanneuf · 2021-09-10T20:31:07Z

This PR contains code that will turn into a quickstart page on cloud.google.com/pubsub/docs.

Notes to the reviewer:

Creating a dataframe column named "attributes" is currently blocked on Issue using PySpark writeStream to write to Pub/Sub Lite with an attributes field googleapis/java-pubsublite-spark#261. cc: @jiangmichaellll
Once unblocked, I can add these lines:

python-docs-samples/pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py

Lines 39 to 50 in 61564a0

    
           sdf = ( 
        
               sdf.withColumn("key", (sdf.value % 5).cast(StringType()).cast(BinaryType())) 
        
               .withColumn("event_timestamp", sdf.timestamp) 
        
               .withColumn( 
        
                   "data", 
        
                   sdf.value.cast(StringType()).cast(BinaryType()) 
        
                   # ).withColumn( 
        
                   # "attributes", create_map( 
        
                   # lit("prop1"), array(divisible_by_two_udf("value").cast(BinaryType()))).cast(MapType(StringType(), ArrayType(BinaryType()), True)) 
        
               ) 
        
               .drop("value", "timestamp") 
        
           )

Also added TODO.

Region tags are tracked in b/199554829 and b/199554739.
Internal cl/396882350.

snippet-bot · 2021-09-10T20:31:11Z

Here is the summary of changes.

You are about to add 2 region tags.

pubsublite/spark-connector/spark_streaming_from_pubsublite_example.py:21, tag pubsublite_spark_streaming_from_pubsublite
pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py:21, tag pubsublite_spark_streaming_to_pubsublite

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

Refresh this comment

dandhlee

Thanks for submitting this! Few comments below.

As well, is there another Pub/Sub expert who could take a look at this sample as a product owner?

pubsublite/spark-connector/README.md

pubsublite/spark-connector/spark_streaming_from_pubsublite_example.py

pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py

pubsublite/spark-connector/spark_streaming_from_pubsublite_example.py

pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py

pubsublite/spark-connector/spark_streaming_test.py

Co-authored-by: Dan Lee <71398022+dandhlee@users.noreply.github.com>

…cs-samples into pubsublite

anguillanneuf · 2021-09-14T22:31:18Z

As well, is there another Pub/Sub expert who could take a look at this sample as a product owner?

@jiangmichaellll (developer for the connector, he is a Pub/Sub Lite dev) should be the person to review this.

…cs-samples into pubsublite

jiangmichaellll

Mostly looking good, let me know when you need next round of review

pubsublite/spark-connector/README.md

pubsublite/spark-connector/spark_streaming_test.py

pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py

dandhlee

Almost there!

pubsublite/spark-connector/README.md

dandhlee · 2021-09-15T21:22:35Z

pubsublite/spark-connector/noxfile_config.py

+TEST_CONFIG_OVERRIDE = {
+    # You can opt out from the test for specific Python versions.
+    # NOTE: Apache Beam does not currently support Python 3.9.
+    "ignored_versions": ["2.7", "3.6", "3.7", "3.9"],


Think I missed this before, but is it a strict requirement to only be using 3.8?

I can open this up as a last step. Right now just want to test to run fast.

@jiangmichaellll Looks like multiple streaming jobs in the same cluster will produce "Concurrent update to the log. Multiple streaming jobs detected" and cause one job to fail. See job log.

@dandhlee Since we are just using one Spark cluster, I vote we just test against one Python version.

hmm did you manually run jobs concurrently? I never seen this before. I usually either change app name or only run one at a time.

When two Python versions are selected, two ITs run in parallel (two different topics created at roughly the same time, two PySpark jobs submitted at the roughly the same time as well).

Used uuid in appName when starting up a SparkSession. But one write job would fail (log) and the other one succeed (log).

Using uuid in checkpointLocation solved Concurrent update to the log.

But occasionally you may see something like this where no messages are published to Pub/Sub Lite due to insufficient worker resource (may happen if multiple presumbit tests are running, each requesting multiple Python versions):

21/09/16 19:19:53 INFO com.google.cloud.pubsublite.spark.PslSparkUtils: Input schema to write to Pub/Sub Lite doesn't contain attributes column, this field for all rows will be set to empty. [CONTEXT ratelimit_period="5 MINUTES" ] 21/09/16 19:19:54 INFO com.google.cloud.pubsublite.spark.PslStreamWriter: Committed 0 messages for epochId:0. 21/09/16 19:19:54 WARN org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 1000 milliseconds, but spent 1334 milliseconds 21/09/16 19:20:09 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 21/09/16 19:20:24 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 21/09/16 19:20:39 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 21/09/16 19:20:52 ERROR org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@6d4c678c is aborting. 21/09/16 19:20:52 WARN com.google.cloud.pubsublite.spark.PslStreamWriter: Epoch id: 1 is aborted, 0 messages might have been published.

pubsublite/spark-connector/spark_streaming_from_pubsublite_example.py

pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py

pubsublite/spark-connector/spark_streaming_from_pubsublite_example.py

pubsublite/spark-connector/README.md

pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py

jiangmichaellll

LGTM assuming you uuid suffix the appName to address the concurrent issue.

review is stale

dandhlee

LGTM!

dandhlee · 2021-09-16T21:54:21Z

Admin merging since we have a product expert review and a samples reviewer review!

anguillanneuf added 4 commits September 2, 2021 16:37

add initial files

77e262f

add key, event time, data but not attributes

c789aaa

pytest passes

10df79a

nox tests pass and add readme

6668dd7

product-auto-label bot added the samples Issues that are directly related to samples. label Sep 10, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Sep 10, 2021

anguillanneuf added 2 commits September 10, 2021 13:32

add license header

7c1a63d

Merge branch 'master' into pubsublite

61564a0

anguillanneuf marked this pull request as ready for review September 10, 2021 22:09

anguillanneuf requested a review from a team as a code owner September 10, 2021 22:09

anguillanneuf changed the title ~~Spark SQL Streaming to Pub/Sub Lite using Pub/Sub Lite Spark Connector~~ feat(pubsublite): Spark SQL Streaming to Pub/Sub Lite using Pub/Sub Lite Spark Connector Sep 10, 2021

blunderbuss-gcf bot assigned dandhlee Sep 10, 2021

update region tag

a7750b2

dandhlee added the api: pubsublite Issues related to the Pub/Sub Lite API. label Sep 10, 2021

Merge branch 'master' into pubsublite

da4e9ee

dandhlee requested changes Sep 13, 2021

View reviewed changes

leahecole previously requested changes Sep 14, 2021

View reviewed changes

pubsublite/spark-connector/spark_streaming_test.py Outdated Show resolved Hide resolved

pubsublite/spark-connector/spark_streaming_test.py Outdated Show resolved Hide resolved

anguillanneuf and others added 5 commits September 14, 2021 13:56

Merge branch 'master' into pubsublite

857d916

Update pubsublite/spark-connector/README.md

fce1d3c

Co-authored-by: Dan Lee <71398022+dandhlee@users.noreply.github.com>

address reviewer comments

b620106

Merge branch 'pubsublite' of github.com:GoogleCloudPlatform/python-do…

f35d250

…cs-samples into pubsublite

Merge branch 'master' into pubsublite

c72510f

anguillanneuf requested review from dandhlee and leahecole September 14, 2021 22:29

anguillanneuf and others added 4 commits September 14, 2021 18:47

Merge branch 'master' into pubsublite

15100aa

use different uber jar uri

418db54

Merge branch 'pubsublite' of github.com:GoogleCloudPlatform/python-do…

a3abebe

…cs-samples into pubsublite

address reviewer comment

5e0f794

Merge branch 'master' into pubsublite

f7a2636

jiangmichaellll reviewed Sep 15, 2021

View reviewed changes

dandhlee requested changes Sep 15, 2021

View reviewed changes

renovate-bot and others added 2 commits September 15, 2021 15:39

chore(deps): update dependency google-cloud-bigquery to v2.26.0 (#6654)

55ae470

address reviewer comments

329d376

anguillanneuf force-pushed the pubsublite branch from b0fc20f to 329d376 Compare September 15, 2021 22:39

anguillanneuf requested a review from bradmiro as a code owner September 15, 2021 22:39

anguillanneuf added 5 commits September 15, 2021 15:42

lint

fb49887

Merge branch 'master' into pubsublite

9f0f9c2

read from the starting offset

4d88501

run test in py-3.8 and py-3.9

f1f4867

only test in py-3.8 due to concurrent logging limitation

3209bba

anguillanneuf requested a review from dandhlee September 15, 2021 23:50

jiangmichaellll reviewed Sep 16, 2021

View reviewed changes

jiangmichaellll approved these changes Sep 16, 2021

View reviewed changes

anguillanneuf force-pushed the pubsublite branch from fe86f34 to a64d28e Compare September 16, 2021 18:28

anguillanneuf added 2 commits September 16, 2021 12:02

address jiangmichaelll's comments

86f7b8a

merge conflicts

507d9e6

anguillanneuf force-pushed the pubsublite branch 2 times, most recently from 83c36f3 to bd21883 Compare September 16, 2021 19:13

address jiangmichaelll's comments

a7cc8a0

anguillanneuf force-pushed the pubsublite branch from bd21883 to a7cc8a0 Compare September 16, 2021 19:15

dandhlee approved these changes Sep 16, 2021

View reviewed changes

anguillanneuf added 2 commits September 16, 2021 17:30

Merge branch 'master' into pubsublite

7ddde9c

Merge branch 'master' into pubsublite

5c010f1

dandhlee merged commit 267a36a into master Sep 16, 2021

dandhlee deleted the pubsublite branch September 16, 2021 21:54

	sdf = (
	sdf.withColumn("key", (sdf.value % 5).cast(StringType()).cast(BinaryType()))
	.withColumn("event_timestamp", sdf.timestamp)
	.withColumn(
	"data",
	sdf.value.cast(StringType()).cast(BinaryType())
	# ).withColumn(
	# "attributes", create_map(
	# lit("prop1"), array(divisible_by_two_udf("value").cast(BinaryType()))).cast(MapType(StringType(), ArrayType(BinaryType()), True))
	)
	.drop("value", "timestamp")
	)

feat(pubsublite): Spark SQL Streaming to Pub/Sub Lite using Pub/Sub Lite Spark Connector #6682

feat(pubsublite): Spark SQL Streaming to Pub/Sub Lite using Pub/Sub Lite Spark Connector #6682

Uh oh!

Conversation

anguillanneuf commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snippet-bot bot commented Sep 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dandhlee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anguillanneuf commented Sep 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiangmichaellll left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dandhlee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dandhlee Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

anguillanneuf Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

anguillanneuf Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangmichaellll Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

anguillanneuf Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

anguillanneuf Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anguillanneuf Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anguillanneuf Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiangmichaellll left a comment

Choose a reason for hiding this comment

Uh oh!

anguillanneuf commented Sep 10, 2021 •

edited

Loading

snippet-bot bot commented Sep 10, 2021 •

edited

Loading

anguillanneuf commented Sep 14, 2021 •

edited

Loading

anguillanneuf Sep 15, 2021 •

edited

Loading

anguillanneuf Sep 16, 2021 •

edited

Loading

anguillanneuf Sep 16, 2021 •

edited

Loading

anguillanneuf Sep 16, 2021 •

edited

Loading