-
Notifications
You must be signed in to change notification settings - Fork 6.7k
feat(pubsublite): Spark SQL Streaming to Pub/Sub Lite using Pub/Sub Lite Spark Connector #6682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Here is the summary of changes. You are about to add 2 region tags.
This comment is generated by snippet-bot.
|
dandhlee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this! Few comments below.
As well, is there another Pub/Sub expert who could take a look at this sample as a product owner?
pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Dan Lee <71398022+dandhlee@users.noreply.github.com>
…cs-samples into pubsublite
@jiangmichaellll (developer for the connector, he is a Pub/Sub Lite dev) should be the person to review this. |
jiangmichaellll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looking good, let me know when you need next round of review
dandhlee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
| TEST_CONFIG_OVERRIDE = { | ||
| # You can opt out from the test for specific Python versions. | ||
| # NOTE: Apache Beam does not currently support Python 3.9. | ||
| "ignored_versions": ["2.7", "3.6", "3.7", "3.9"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I missed this before, but is it a strict requirement to only be using 3.8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can open this up as a last step. Right now just want to test to run fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiangmichaellll Looks like multiple streaming jobs in the same cluster will produce "Concurrent update to the log. Multiple streaming jobs detected" and cause one job to fail. See job log.
@dandhlee Since we are just using one Spark cluster, I vote we just test against one Python version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm did you manually run jobs concurrently? I never seen this before. I usually either change app name or only run one at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When two Python versions are selected, two ITs run in parallel (two different topics created at roughly the same time, two PySpark jobs submitted at the roughly the same time as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using uuid in checkpointLocation solved Concurrent update to the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But occasionally you may see something like this where no messages are published to Pub/Sub Lite due to insufficient worker resource (may happen if multiple presumbit tests are running, each requesting multiple Python versions):
21/09/16 19:19:53 INFO com.google.cloud.pubsublite.spark.PslSparkUtils: Input schema to write to Pub/Sub Lite doesn't contain attributes column, this field for all rows will be set to empty. [CONTEXT ratelimit_period="5 MINUTES" ]
21/09/16 19:19:54 INFO com.google.cloud.pubsublite.spark.PslStreamWriter: Committed 0 messages for epochId:0.
21/09/16 19:19:54 WARN org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 1000 milliseconds, but spent 1334 milliseconds
21/09/16 19:20:09 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/09/16 19:20:24 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/09/16 19:20:39 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/09/16 19:20:52 ERROR org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec: Data source writer org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@6d4c678c is aborting.
21/09/16 19:20:52 WARN com.google.cloud.pubsublite.spark.PslStreamWriter: Epoch id: 1 is aborted, 0 messages might have been published.
pubsublite/spark-connector/spark_streaming_from_pubsublite_example.py
Outdated
Show resolved
Hide resolved
b0fc20f to
329d376
Compare
jiangmichaellll
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming you uuid suffix the appName to address the concurrent issue.
fe86f34 to
a64d28e
Compare
83c36f3 to
bd21883
Compare
bd21883 to
a7cc8a0
Compare
dandhlee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Admin merging since we have a product expert review and a samples reviewer review! |
This PR contains code that will turn into a quickstart page on cloud.google.com/pubsub/docs.
Notes to the reviewer:
Once unblocked, I can add these lines:
python-docs-samples/pubsublite/spark-connector/spark_streaming_to_pubsublite_example.py
Lines 39 to 50 in 61564a0
Also added TODO.