-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor BQ to expose all beam's configurations #5456
base: main
Are you sure you want to change the base?
Conversation
import org.apache.beam.sdk.values.{PCollection, PCollectionTuple, TupleTag} | ||
|
||
/** | ||
* A sink for error records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit more explanation on error records could be helpful, maybe:
* A sink for error records. | |
* A sink for error records. | |
* | |
* An error record is produced by certain PTransforms that catch processing exceptions and transform the resulting (element, exception) pair into a [[BadRecord]] instance. | |
* When an ErrorSink is configured (via ScioContext#errorSink), these BadRecords can be accessed as an SCollection by invoking the ErrorSink#sink method. | |
* An ErrorSink is useful if you'd like to set up special handling of exceptions (incrementing Counters, logging the exceptions in a database, etc). |
* Once the [[sink]] is materialized, the [[handler]] must not be used anymore. | ||
*/ | ||
sealed trait ErrorSink { | ||
def handler: ErrorHandler[BadRecord, _] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could def handler
be private[scio]
? not sure when a user would need to access this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the API exposed by beam. As mentioned in the description we do not pass the ErrorSink
directly.
sc.bigQueryStorageFormat[MyType](
table,
format,
errorHandler = errorSink.handler
)
I was thinking of adding to the ScioContext
a beam java like API too
def registerBadRecordErrorHandler[T](handler: PTransform[PCollection[BadRecord], T] sinkTransform): BadRecordErrorHandler[OutputT]
f9c02ac
to
2d66440
Compare
Here are the main changes:
the BQ
Table
source has a single normalized definition, with multiple constrictors (form string spec orTableReference
). It nows includes an optionalTable.Filter
that can be used is the storage read API to project and filter.read API changes with
Format
API take aBigqueryIO.Format
object allowing to convert either fromGenericRecord
(this should be prefered) orTableRow
Storage
Api allow to pass anErrorHandler
. In order to preserve a flat structureScioContext.errorSink(): ErrorSink
has been added. This allow to do the followingThe
handler
can be passed to multiple IOs beforesink
is materialized. The sink will flatten the errors from the IOs.