Skip to content

Commit

Permalink
Fixing analytics violations found in release pipeline (Azure#29496)
Browse files Browse the repository at this point in the history
  • Loading branch information
FabianMeiswinkel authored and khmic5 committed Jun 19, 2022
1 parent e06beb7 commit c2eccd6
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions sdk/cosmos/azure-cosmos-spark_3_2-12/docs/migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,12 @@ When reading data in Spark via a DataSource, the number of partitions in the ret
For most use cases the default partitioning strategy `Default` should be sufficient. For some use cases (especially when doing very targeted filtering to just one logical partition key etc.) it might be preferable to use the previous partitioning model (just one partition per physical Cosmos partition) = this can be achieved by using the `Restrictive` partitioning strategy.

### Automatically populating "id" column
Unlike the Cosmos DB connector for Spark 2.* the new connector for Spark 3.* and above `cosmos.oltp` requires you to pre-populate an `id` column in the data frame containing the data you want to write into Cosmos DB. In the previous version it was possible to let the Spark connector auto-generate a value for the `id` column. We changed this, because the auto-generation of `id` values in the connector has an intrinsic problem whenever Spark retries a task. A different `id` value would be generated for each retry - which means you might end-up with duplicate records. A sample that explains how you can safely populate `id` values in a data frame is located here: [Ingestion best practices](./scenarios/Ingestion.md#populating-id-column)
Unlike the Cosmos DB connector for Spark 2.* the new connector for Spark 3.* and above `cosmos.oltp` requires you to pre-populate an `id` column in the data frame containing the data you want to write into Cosmos DB. In the previous version it was possible to let the Spark connector auto-generate a value for the `id` column. We changed this, because the auto-generation of `id` values in the connector has an intrinsic problem whenever Spark retries a task. A different `id` value would be generated for each retry - which means you might end-up with duplicate records. A sample that explains how you can safely populate `id` values in a data frame is located here: [Ingestion best practices](https://aka.ms/azure-cosmos-spark-3-scenarios-ingestion#populating-id-column)

### "One DataSource rules them all" vs. separate DataSource for ChangeFeed
In the Cosmos DB Connector for Spark 2.4 all operations (writing or reading documents/items from the container as well as processing change feed) were surfaced by one DataSource. With the new Cosmos DB Connector for Spark 3 we are using two different DataSources - `cosmos.oltp` and `cosmos.oltp.changeFeed` to be able to express the supported capabilities in the DataSource V2 API correctly - for example to expose to the Spark runtime that the change feed DataSource would not support writes but supports read stream operations - while the DataSource for items support writing (both streaming and batch) but no read stream operations. This means when migrating existing notebooks/programs, it will be necessary to not only change the configuration but also the identifier in the `format(xxx)` clause to use the right identifier - `cosmos.oltp` for operations on items and `cosmos.oltp.changeFeed` when processing change feed events.

### Structured streaming
The API for Structured Streaming between DataSource V1 and V2 has changed. The new Cosmos DB connector uses the DataSource V2 API. As a result, some proprietary approaches (like persisting the progress/bookmarks/offset in a Cosmos DB container) are replaced by default Spark mechanics where possible. So the new connector for example just uses Spark Offsets to expose the progress. Spark will store these offsets in the provided checkpoint location and when you want to restart/recover a query the mechanism is the same as with any other DataSource supporting Structured Streaming - like Kafka for example. [The Structured Streaming Programming Guide](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) is a good starting point to learn how to work with Structured Streaming.
The API for Structured Streaming between DataSource V1 and V2 has changed. The new Cosmos DB connector uses the DataSource V2 API. As a result, some proprietary approaches (like persisting the progress/bookmarks/offset in a Cosmos DB container) are replaced by default Spark mechanics where possible. So the new connector for example just uses Spark Offsets to expose the progress. Spark will store these offsets in the provided checkpoint location and when you want to restart/recover a query the mechanism is the same as with any other DataSource supporting Structured Streaming - like Kafka for example. [The Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) is a good starting point to learn how to work with Structured Streaming.
Currently, the new Cosmos DB Connector allows to use the `cosmos.oltp` DataSource as a sink and `cosmos.oltp.changefed` as a source for MicroBatch queries. We will add support for Continuous Processing in addition to MicroBatching soon after GA.

0 comments on commit c2eccd6

Please sign in to comment.