diff --git a/hudi-utils/src/main/java/org/apache/hudi/utils/HoodieSparkConfigs.java b/hudi-utils/src/main/java/org/apache/hudi/utils/HoodieSparkConfigs.java index d4e73d2f3bc15..de57bbe87eab7 100644 --- a/hudi-utils/src/main/java/org/apache/hudi/utils/HoodieSparkConfigs.java +++ b/hudi-utils/src/main/java/org/apache/hudi/utils/HoodieSparkConfigs.java @@ -61,7 +61,7 @@ public static String description(Object sparkConfigObject) { ".options(clientOpts) // any of the Hudi client opts can be passed in as well\n" + ".option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), \"_row_key\")\n" + ".option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), \"partition\")\n" + - ".option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), \"timestamp\")\n" + + ".option(HoodieTableConfig.ORDERING_FIELDS(), \"timestamp\")\n" + ".option(HoodieWriteConfig.TABLE_NAME, tableName)\n" + ".mode(SaveMode.Append)\n" + ".save(basePath);\n" + diff --git a/website/docs/basic_configurations.md b/website/docs/basic_configurations.md index 7d7f35f9014c2..16fb55d1ac710 100644 --- a/website/docs/basic_configurations.md +++ b/website/docs/basic_configurations.md @@ -33,48 +33,47 @@ Configurations of the Hudi Table like type of ingestion, storage formats, hive t [**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | -| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | -| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | -| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | -| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | -| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | -| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | -| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | -| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | -| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | -| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | -| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | -| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | -| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | -| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | -| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Comma separated fields used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | -| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | -| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | -| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | -| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | -| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | -| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | -| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | -| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | -| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | -| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | +| Config Name | Default | Description | +|--------------------------------------------------------------------------------------------------| ----------------------------------------------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | +| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | +| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | +| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | +| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | +| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | +| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | +| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | +| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | +| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | +| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | +| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | +| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | +| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | +| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | +| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | +| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | +| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | +| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | +| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | +| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | +| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | +| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | +| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | | [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs.
`Config Param: CDC_SUPPLEMENTAL_LOGGING_MODE`
`Since Version: 0.13.0` | -| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | -| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | -| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | -| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | -| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | -| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | -| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | -| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | -| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | +| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | +| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | +| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | +| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | +| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | +| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | +| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | +| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | +| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | --- ## Spark Datasource Configs {#SPARK_DATASOURCE} @@ -97,7 +96,6 @@ Options useful for reading tables via `read.format.option(...)` | [hoodie.datasource.read.end.instanttime](#hoodiedatasourcereadendinstanttime) | (N/A) | Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified.
`Config Param: END_COMMIT` | | [hoodie.datasource.read.incr.table.version](#hoodiedatasourcereadincrtableversion) | (N/A) | The table version assumed for incremental read
`Config Param: INCREMENTAL_READ_TABLE_VERSION` | | [hoodie.datasource.read.streaming.table.version](#hoodiedatasourcereadstreamingtableversion) | (N/A) | The table version assumed for streaming read
`Config Param: STREAMING_READ_TABLE_VERSION` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: READ_PRE_COMBINE_FIELD` | | [hoodie.datasource.query.type](#hoodiedatasourcequerytype) | snapshot | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files)
`Config Param: QUERY_TYPE` | --- @@ -111,7 +109,7 @@ inputDF.write() .options(clientOpts) // any of the Hudi client opts can be passed in as well .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") -.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") +.option(HoodieTableConfig.ORDERING_FIELDS(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath); @@ -126,24 +124,22 @@ Options useful for writing tables via `write.format.option(...)` [**Basic Configs**](#Write-Options-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | -| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: ORDERING_FIELDS` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | -| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | -| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | -| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | -| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | -| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | -| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | +| Config Name | Default | Description | +| ------------------------------------------------------------------------------------------------ | ----------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | +| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | +| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | +| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | +| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | +| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | +| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | +| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | +| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | --- ## Flink Sql Configs {#FLINK_SQL} @@ -375,16 +371,16 @@ Configurations that control write behavior on Hudi tables. These can be directly [**Basic Configs**](#Write-Configurations-basic-configs) -| Config Name | Default | Description | -| ---------------------------------------------------------------------------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD_NAME` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | -| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | +| Config Name | Default | Description | +|------------------------------------------------------------------------------------------------| -------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | +| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | | [hoodie.write.concurrency.mode](#hoodiewriteconcurrencymode) | SINGLE_WRITER | org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor.
`Config Param: WRITE_CONCURRENCY_MODE` | -| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | +| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | --- diff --git a/website/docs/configurations.md b/website/docs/configurations.md index 9e62c2a90ddfd..6b34689f6d604 100644 --- a/website/docs/configurations.md +++ b/website/docs/configurations.md @@ -54,48 +54,47 @@ Configurations of the Hudi Table like type of ingestion, storage formats, hive t [**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | -| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | -| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | -| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | -| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | -| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | -| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | -| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | -| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | -| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | -| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | -| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | -| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | -| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | -| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | -| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Comma separated fields used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | -| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | -| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | -| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | -| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | -| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | -| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | -| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | -| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | -| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | -| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | +| Config Name | Default | Description | +| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | +| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | +| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | +| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | +| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | +| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | +| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | +| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | +| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | +| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | +| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | +| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | +| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | +| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | +| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | +| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | +| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | +| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | +| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | +| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | +| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | +| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | +| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | +| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | | [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs.
`Config Param: CDC_SUPPLEMENTAL_LOGGING_MODE`
`Since Version: 0.13.0` | -| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | -| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | -| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | -| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | -| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | -| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | -| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | -| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | -| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | +| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | +| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | +| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | +| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | +| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | +| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | +| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | +| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | +| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | [**Advanced Configs**](#Hudi-Table-Basic-Configs-advanced-configs) @@ -125,7 +124,6 @@ Options useful for reading tables via `read.format.option(...)` | [hoodie.datasource.read.end.instanttime](#hoodiedatasourcereadendinstanttime) | (N/A) | Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified.
`Config Param: END_COMMIT` | | [hoodie.datasource.read.incr.table.version](#hoodiedatasourcereadincrtableversion) | (N/A) | The table version assumed for incremental read
`Config Param: INCREMENTAL_READ_TABLE_VERSION` | | [hoodie.datasource.read.streaming.table.version](#hoodiedatasourcereadstreamingtableversion) | (N/A) | The table version assumed for streaming read
`Config Param: STREAMING_READ_TABLE_VERSION` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: READ_PRE_COMBINE_FIELD` | | [hoodie.datasource.query.type](#hoodiedatasourcequerytype) | snapshot | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files)
`Config Param: QUERY_TYPE` | [**Advanced Configs**](#Read-Options-advanced-configs) @@ -169,7 +167,7 @@ inputDF.write() .options(clientOpts) // any of the Hudi client opts can be passed in as well .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") -.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") +.option(HoodieTableConfig.ORDERING_FIELDS(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath); @@ -183,24 +181,23 @@ Options useful for writing tables via `write.format.option(...)` [**Basic Configs**](#Write-Options-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | -| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: ORDERING_FIELDS` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | -| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | -| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | -| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | -| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | -| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | -| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | +| Config Name | Default | Description | +| ------------------------------------------------------------------------------------------------ | ----------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | +| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | +| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | +| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | +| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | +| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | +| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | +| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | [**Advanced Configs**](#Write-Options-advanced-configs) @@ -958,16 +955,16 @@ Configurations that control write behavior on Hudi tables. These can be directly [**Basic Configs**](#Write-Configurations-basic-configs) -| Config Name | Default | Description | -| ---------------------------------------------------------------------------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD_NAME` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | -| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | +| Config Name | Default | Description | +| ---------------------------------------------------------------------------------------------- | -------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | +| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | | [hoodie.write.concurrency.mode](#hoodiewriteconcurrencymode) | SINGLE_WRITER | org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor.
`Config Param: WRITE_CONCURRENCY_MODE` | -| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | +| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | [**Advanced Configs**](#Write-Configurations-advanced-configs) diff --git a/website/docs/quick-start-guide.md b/website/docs/quick-start-guide.md index 8c05a9af0452a..360b4c172b13d 100644 --- a/website/docs/quick-start-guide.md +++ b/website/docs/quick-start-guide.md @@ -1240,7 +1240,7 @@ CREATE TABLE hudi_table ( driver STRING, fare DOUBLE, city STRING -) USING HUDI TBLPROPERTIES (preCombineField = 'ts') +) USING HUDI TBLPROPERTIES (orderingFields = 'ts') PARTITIONED BY (city); ``` COMMIT_TIME_ORDERING (when ordering field is not set) | Determines the logic of merging different records with the same record key. Valid values: (1) `COMMIT_TIME_ORDERING`: use commit time to merge records, i.e., the record from later commit overwrites the earlier record with the same key. (2) `EVENT_TIME_ORDERING`: use event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of commit time. The event time or preCombine field needs to be specified by the user. This is the default when an ordering field is configured. (3) `CUSTOM`: use custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| hoodie.write.record.merge.strategy.id | N/A (Optional) | ID of record merge strategy. Hudi will pick `HoodieRecordMerger` implementations from `hoodie.write.record.merge.custom.implementation.classes` that have the same merge strategy ID. When using custom merge logic, you need to specify both this config and `hoodie.write.record.merge.custom.implementation.classes`.
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.strategy` (deprecated) | -| hoodie.write.record.merge.custom.implementation.classes | N/A (Optional) | List of `HoodieRecordMerger` implementations constituting Hudi's merging strategy based on the engine used. Hudi selects the first implementation from this list that matches the following criteria: (1) has the same merge strategy ID as specified in `hoodie.write.record.merge.strategy.id` (if provided), (2) is compatible with the execution engine (e.g., SPARK merger for Spark, FLINK merger for Flink, AVRO for Java/Hive). The order in the list matters - place your preferred implementation first. Engine-specific implementations (SPARK, FLINK) are more efficient as they avoid Avro serialization/deserialization overhead.
`Config Param: RECORD_MERGE_IMPL_CLASSES`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.impls` (deprecated) | +| Config Name | Default | Description | +|---------------------------------------------------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| hoodie.write.record.merge.mode | EVENT_TIME_ORDERING (when ordering field is set)
COMMIT_TIME_ORDERING (when ordering field is not set) | Determines the logic of merging different records with the same record key. Valid values: (1) `COMMIT_TIME_ORDERING`: use commit time to merge records, i.e., the record from later commit overwrites the earlier record with the same key. (2) `EVENT_TIME_ORDERING`: use event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of commit time. The event time or ordering fields need to be specified by the user. This is the default when an ordering field is configured. (3) `CUSTOM`: use custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| hoodie.write.record.merge.strategy.id | N/A (Optional) | ID of record merge strategy. Hudi will pick `HoodieRecordMerger` implementations from `hoodie.write.record.merge.custom.implementation.classes` that have the same merge strategy ID. When using custom merge logic, you need to specify both this config and `hoodie.write.record.merge.custom.implementation.classes`.
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.strategy` (deprecated) | +| hoodie.write.record.merge.custom.implementation.classes | N/A (Optional) | List of `HoodieRecordMerger` implementations constituting Hudi's merging strategy based on the engine used. Hudi selects the first implementation from this list that matches the following criteria: (1) has the same merge strategy ID as specified in `hoodie.write.record.merge.strategy.id` (if provided), (2) is compatible with the execution engine (e.g., SPARK merger for Spark, FLINK merger for Flink, AVRO for Java/Hive). The order in the list matters - place your preferred implementation first. Engine-specific implementations (SPARK, FLINK) are more efficient as they avoid Avro serialization/deserialization overhead.
`Config Param: RECORD_MERGE_IMPL_CLASSES`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.impls` (deprecated) | ## Record Payloads (deprecated) diff --git a/website/docs/sql_ddl.md b/website/docs/sql_ddl.md index 1b06225c57e02..d1c5ba865bdb6 100644 --- a/website/docs/sql_ddl.md +++ b/website/docs/sql_ddl.md @@ -77,7 +77,7 @@ should be specified as `PARTITIONED BY (dt, hh)`. As discussed [here](quick-start-guide.md#keys), tables track each record in the table using a record key. Hudi auto-generated a highly compressed key for each new record in the examples so far. If you want to use an existing field as the key, you can set the `primaryKey` option. -Typically, this is also accompanied by configuring ordering fields (via `preCombineField` option) to deal with out-of-order data and potential +Typically, this is also accompanied by configuring ordering fields (via `orderingFields` option) to deal with out-of-order data and potential duplicate records with the same key in the incoming writes. :::note @@ -86,7 +86,7 @@ this materializes a composite key of the two fields, which can be useful for exp ::: Here is an example of creating a table using both options. Typically, a field that denotes the time of the event or -fact, e.g., order creation time, event generation time etc., is used as the ordering field (via `preCombineField`). Hudi resolves multiple versions +fact, e.g., order creation time, event generation time etc., is used as the ordering field (via `orderingFields`). Hudi resolves multiple versions of the same record by ordering based on this field when queries are run on the table. ```sql @@ -99,7 +99,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_keyed ( TBLPROPERTIES ( type = 'cow', primaryKey = 'id', - preCombineField = 'ts' + orderingFields = 'ts' ); ``` @@ -118,13 +118,13 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'EVENT_TIME_ORDERING' ) LOCATION 'file:///tmp/hudi_table_merge_mode/'; ``` -With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `precombineField` ordering field) overwrites the record with the +With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `orderingFields`) overwrites the record with the smaller event time on the same key, regardless of transaction's commit time. Users can set `CUSTOM` mode to provide their own merge logic. With `CUSTOM` merge mode, you can provide a custom class that implements the merge logic. The interfaces to implement is explained in detail [here](record_merger.md#custom). @@ -139,7 +139,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode_custom ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'CUSTOM', 'hoodie.record.merge.strategy.id' = '' ) @@ -177,7 +177,7 @@ CREATE TABLE hudi_table_ctas USING hudi TBLPROPERTIES ( type = 'cow', - preCombineField = 'ts' + orderingFields = 'ts' ) PARTITIONED BY (dt) AS SELECT * FROM parquet_table; @@ -196,7 +196,7 @@ CREATE TABLE hudi_table_ctas USING hudi TBLPROPERTIES ( type = 'cow', - preCombineField = 'ts' + orderingFields = 'ts' ) AS SELECT * FROM parquet_table; ``` @@ -579,10 +579,10 @@ Users can set table properties while creating a table. The important table prope |------------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | cow | The table type to create. `type = 'cow'` creates a COPY-ON-WRITE table, while `type = 'mor'` creates a MERGE-ON-READ table. Same as `hoodie.datasource.write.table.type`. More details can be found [here](table_types.md) | | primaryKey | uuid | The primary key field names of the table separated by commas. Same as `hoodie.datasource.write.recordkey.field`. If this config is ignored, hudi will auto-generate primary keys. If explicitly set, primary key generation will honor user configuration. | -| preCombineField | | The ordering field(s) of the table. It is used for resolving the final version of the record among multiple versions. Generally, `event time` or another similar column will be used for ordering purposes. Hudi will be able to handle out-of-order data using the ordering field value. | +| orderingFields | | The ordering field(s) of the table. It is used for resolving the final version of the record among multiple versions. Generally, `event time` or another similar column will be used for ordering purposes. Hudi will be able to handle out-of-order data using the ordering field value. | :::note -`primaryKey`, `preCombineField`, and `type` and other properties are case-sensitive. +`primaryKey`, `orderingFields`, and `type` and other properties are case-sensitive. ::: #### Passing Lock Providers for Concurrent Writers @@ -833,7 +833,7 @@ WITH ( 'connector' = 'hudi', 'path' = 'file:///tmp/hudi_table', 'table.type' = 'MERGE_ON_READ', -'precombine.field' = 'ts' +'ordering.fields' = 'ts' ); ``` diff --git a/website/docs/sql_dml.md b/website/docs/sql_dml.md index 81ad924289f07..64002c02c60b1 100644 --- a/website/docs/sql_dml.md +++ b/website/docs/sql_dml.md @@ -51,7 +51,7 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 1000 :::note Mapping to write operations Hudi offers flexibility in choosing the underlying [write operation](write_operations.md) of a `INSERT INTO` statement using the `hoodie.spark.sql.insert.into.operation` configuration. Possible options include *"bulk_insert"* (large inserts), *"insert"* (with small file management), -and *"upsert"* (with deduplication/merging). If ordering fields are not set, *"insert"* is chosen as the default. For a table with ordering fields set (via `preCombineField`), +and *"upsert"* (with deduplication/merging). If ordering fields are not set, *"insert"* is chosen as the default. For a table with ordering fields set (via `orderingFields`), *"upsert"* is chosen as the default operation. ::: @@ -101,7 +101,7 @@ update hudi_cow_pt_tbl set ts = 1001 where name = 'a1'; ``` :::info -The `UPDATE` operation requires the specification of ordering fields (via `preCombineField`). +The `UPDATE` operation requires the specification of ordering fields (via `orderingFields`). ::: ### Merge Into @@ -138,7 +138,7 @@ For a Hudi table with user configured primary keys, the join condition and the ` For a table where Hudi auto generates primary keys, the join condition in `MERGE INTO` can be on any arbitrary data columns. -if the `hoodie.record.merge.mode` is set to `EVENT_TIME_ORDERING`, ordering fields (via `preCombineField`) are required to be set with value in the `UPDATE`/`INSERT` clause. +if the `hoodie.record.merge.mode` is set to `EVENT_TIME_ORDERING`, ordering fields (via `orderingFields`) are required to be set with value in the `UPDATE`/`INSERT` clause. It is enforced that if the target table has primary key and partition key column, the source table counterparts must enforce the same data type accordingly. Plus, if the target table is configured with `hoodie.record.merge.mode` = `EVENT_TIME_ORDERING` where target table is expected to have valid ordering fields configuration, the source table counterpart must also have the same data type. ::: @@ -148,7 +148,7 @@ Examples below ```sql -- source table using hudi for testing merging into non-partitioned table create table merge_source (id int, name string, price double, ts bigint) using hudi -tblproperties (primaryKey = 'id', preCombineField = 'ts'); +tblproperties (primaryKey = 'id', orderingFields = 'ts'); insert into merge_source values (1, "old_a1", 22.22, 900), (2, "new_a2", 33.33, 2000), (3, "new_a3", 44.44, 2000); merge into hudi_mor_tbl as target @@ -199,7 +199,7 @@ CREATE TABLE tableName ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - preCombineField = '_ts' + orderingFields = '_ts' ) LOCATION '/location/to/basePath'; diff --git a/website/docs/sql_queries.md b/website/docs/sql_queries.md index d7eeb9cdb1c40..9310c72fbc624 100644 --- a/website/docs/sql_queries.md +++ b/website/docs/sql_queries.md @@ -210,7 +210,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'EVENT_TIME_ORDERING' ) LOCATION 'file:///tmp/hudi_table_merge_mode/'; @@ -225,7 +225,7 @@ INSERT INTO hudi_table_merge_mode VALUES (1, 'a1', 900, 20.0); SELECT id, name, ts, price FROM hudi_table_merge_mode; ``` -With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `precombineField` ordering field) overwrites the record with the +With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `orderingFields`) overwrites the record with the smaller event time on the same key, regardless of transaction time. ### Snapshot Query with Custom Merge Mode @@ -244,7 +244,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode_custom ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'CUSTOM', 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.PartialUpdateAvroPayload' ) diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md index 6c414b54f6e3c..6c8eb699ce1dd 100644 --- a/website/docs/write_operations.md +++ b/website/docs/write_operations.md @@ -96,7 +96,7 @@ Here are the basic configs relevant to the write operations types mentioned abov | Config Name | Default | Description | |------------------------------------------------|----------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | hoodie.datasource.write.operation | upsert (Optional) | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.

`Config Param: OPERATION` | -| hoodie.datasource.write.precombine.field | (no default) (Optional) | Field used for ordering records before actual write. When two records have the same key value, we will pick the one with the largest value for the ordering field, determined by Object.compareTo(..). Note: This config is deprecated, use `hoodie.table.ordering.fields` instead.

`Config Param: PRECOMBINE_FIELD` | +| hoodie.table.ordering.fields | (N/A) (Optional) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | | hoodie.combine.before.insert | false (Optional) | When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage.

`Config Param: COMBINE_BEFORE_INSERT` | | hoodie.datasource.write.insert.drop.duplicates | false (Optional) | If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. This config is deprecated as of 0.14.0. Please use hoodie.datasource.insert.dup.policy instead.

`Config Param: INSERT_DROP_DUPS` | | hoodie.bulkinsert.sort.mode | NONE (Optional) | org.apache.hudi.execution.bulkinsert.BulkInsertSortMode: Modes for sorting records during bulk insert.
`Config Param: BULK_INSERT_SORT_MODE` | diff --git a/website/learn/tech-specs-1point0.md b/website/learn/tech-specs-1point0.md index fccdbf9a6a206..634224bb90344 100644 --- a/website/learn/tech-specs-1point0.md +++ b/website/learn/tech-specs-1point0.md @@ -320,13 +320,13 @@ Below is the list of properties that are stored in this file. | hoodie.table.version | Table format version | | hoodie.table.recordkey.fields | Comma-separated list of fields used for record keys. This property is optional. | | hoodie.table.partition.fields | Comma-separated list of fields used for partitioning the table. This property is optional. | -| hoodie.table.precombine.field | Field used to break ties when two records have same value for the record key. This property is optional. | +| hoodie.table.ordering.fields | Fields used to break ties when two records have same value for the record key. This property is optional. | | hoodie.timeline.layout.version | Version of timeline used by the table. | | hoodie.table.checksum | Table checksum used to guard against partial writes on HDFS. The value is auto-generated. | | hoodie.table.metadata.partitions | Comma-separated list of metadata partitions that can be used by reader, e.g. _files_, _column\_stats_ | | hoodie.table.index.defs.path | Absolute path where the index definitions are stored for various indexes created by the users. This property is optional. | -The record key, precombine and partition fields are optional but play an important role in modeling data stored in Hudi +The record key, ordering and partition fields are optional but play an important role in modeling data stored in Hudi table. | Field | Description | diff --git a/website/src/pages/faq/storage.md b/website/src/pages/faq/storage.md index 66b7f8ea23d6b..abb1bc3d68136 100644 --- a/website/src/pages/faq/storage.md +++ b/website/src/pages/faq/storage.md @@ -139,7 +139,7 @@ hudi_options = { 'hoodie.table.name': "test_recon1", 'hoodie.datasource.write.recordkey.field': 'uuid', 'hoodie.datasource.write.table.name': "test_recon1", - 'hoodie.datasource.write.precombine.field': 'ts', + 'hoodie.table.ordering.fields': 'ts', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, "hoodie.datasource.write.hive_style_partitioning":"true", diff --git a/website/versioned_docs/version-1.1.1/basic_configurations.md b/website/versioned_docs/version-1.1.1/basic_configurations.md index 7d7f35f9014c2..1b888eb9d6edc 100644 --- a/website/versioned_docs/version-1.1.1/basic_configurations.md +++ b/website/versioned_docs/version-1.1.1/basic_configurations.md @@ -33,48 +33,47 @@ Configurations of the Hudi Table like type of ingestion, storage formats, hive t [**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | -| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | -| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | -| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | -| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | -| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | -| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | -| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | -| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | -| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | -| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | -| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | -| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | -| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | -| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | -| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Comma separated fields used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | -| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | -| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | -| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | -| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | -| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | -| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | -| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | -| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | -| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | -| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | +| Config Name | Default | Description | +| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | +| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | +| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | +| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | +| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | +| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | +| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | +| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | +| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | +| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | +| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | +| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | +| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | +| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | +| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | +| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | +| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | +| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | +| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | +| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | +| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | +| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | +| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | +| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | | [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs.
`Config Param: CDC_SUPPLEMENTAL_LOGGING_MODE`
`Since Version: 0.13.0` | -| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | -| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | -| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | -| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | -| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | -| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | -| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | -| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | -| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | +| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | +| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | +| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | +| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | +| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | +| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | +| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | +| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | +| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | --- ## Spark Datasource Configs {#SPARK_DATASOURCE} @@ -97,7 +96,6 @@ Options useful for reading tables via `read.format.option(...)` | [hoodie.datasource.read.end.instanttime](#hoodiedatasourcereadendinstanttime) | (N/A) | Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified.
`Config Param: END_COMMIT` | | [hoodie.datasource.read.incr.table.version](#hoodiedatasourcereadincrtableversion) | (N/A) | The table version assumed for incremental read
`Config Param: INCREMENTAL_READ_TABLE_VERSION` | | [hoodie.datasource.read.streaming.table.version](#hoodiedatasourcereadstreamingtableversion) | (N/A) | The table version assumed for streaming read
`Config Param: STREAMING_READ_TABLE_VERSION` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: READ_PRE_COMBINE_FIELD` | | [hoodie.datasource.query.type](#hoodiedatasourcequerytype) | snapshot | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files)
`Config Param: QUERY_TYPE` | --- @@ -111,7 +109,7 @@ inputDF.write() .options(clientOpts) // any of the Hudi client opts can be passed in as well .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") -.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") +.option(HoodieTableConfig.ORDERING_FIELDS(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath); @@ -126,24 +124,23 @@ Options useful for writing tables via `write.format.option(...)` [**Basic Configs**](#Write-Options-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | -| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: ORDERING_FIELDS` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | -| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | -| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | -| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | -| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | -| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | -| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | +| Config Name | Default | Description | +|--------------------------------------------------------------------------------------------------|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | +| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | +| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | +| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | +| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | +| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | +| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | +| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | --- ## Flink Sql Configs {#FLINK_SQL} @@ -375,16 +372,16 @@ Configurations that control write behavior on Hudi tables. These can be directly [**Basic Configs**](#Write-Configurations-basic-configs) -| Config Name | Default | Description | -| ---------------------------------------------------------------------------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD_NAME` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | -| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | -| [hoodie.write.concurrency.mode](#hoodiewriteconcurrencymode) | SINGLE_WRITER | org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor.
`Config Param: WRITE_CONCURRENCY_MODE` | -| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | +| Config Name | Default | Description | +|------------------------------------------------------------------------------------------------|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | +| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | +| [hoodie.write.concurrency.mode](#hoodiewriteconcurrencymode) | SINGLE_WRITER | org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor.
`Config Param: WRITE_CONCURRENCY_MODE` | +| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | --- diff --git a/website/versioned_docs/version-1.1.1/configurations.md b/website/versioned_docs/version-1.1.1/configurations.md index 9e62c2a90ddfd..702fb5b61efec 100644 --- a/website/versioned_docs/version-1.1.1/configurations.md +++ b/website/versioned_docs/version-1.1.1/configurations.md @@ -54,48 +54,47 @@ Configurations of the Hudi Table like type of ingestion, storage formats, hive t [**Basic Configs**](#Hudi-Table-Basic-Configs-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | -| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | -| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | -| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | -| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | -| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | -| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | -| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | -| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | -| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | -| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | -| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | -| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | -| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | -| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | -| [hoodie.table.precombine.field](#hoodietableprecombinefield) | (N/A) | Comma separated fields used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | -| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | -| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | -| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | -| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | -| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | -| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | -| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | -| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | -| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | -| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | +| Config Name | Default | Description | +| ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.bootstrap.base.path](#hoodiebootstrapbasepath) | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi table
`Config Param: BOOTSTRAP_BASE_PATH` | +| [hoodie.compaction.payload.class](#hoodiecompactionpayloadclass) | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.
`Config Param: PAYLOAD_CLASS_NAME` | +| [hoodie.database.name](#hoodiedatabasename) | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
`Config Param: DATABASE_NAME` | +| [hoodie.record.merge.mode](#hoodierecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.record.merge.strategy.id](#hoodierecordmergestrategyid) | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in `hoodie.write.record.merge.custom.implementation.classes` which has the same merger strategy id
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0` | +| [hoodie.table.checksum](#hoodietablechecksum) | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
`Config Param: TABLE_CHECKSUM`
`Since Version: 0.11.0` | +| [hoodie.table.create.schema](#hoodietablecreateschema) | (N/A) | Schema used when creating the table
`Config Param: CREATE_SCHEMA` | +| [hoodie.table.index.defs.path](#hoodietableindexdefspath) | (N/A) | Relative path to table base path where the index definitions are stored
`Config Param: RELATIVE_INDEX_DEFINITION_PATH`
`Since Version: 1.0.0` | +| [hoodie.table.keygenerator.class](#hoodietablekeygeneratorclass) | (N/A) | Key Generator class property for the hoodie table
`Config Param: KEY_GENERATOR_CLASS_NAME` | +| [hoodie.table.keygenerator.type](#hoodietablekeygeneratortype) | (N/A) | Key Generator type to determine key generator class
`Config Param: KEY_GENERATOR_TYPE`
`Since Version: 1.0.0` | +| [hoodie.table.legacy.payload.class](#hoodietablelegacypayloadclass) | (N/A) | Payload class to indicate the payload class that is used to create the table and is not used anymore.
`Config Param: LEGACY_PAYLOAD_CLASS_NAME`
`Since Version: 1.1.0` | +| [hoodie.table.metadata.partitions](#hoodietablemetadatapartitions) | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
`Config Param: TABLE_METADATA_PARTITIONS`
`Since Version: 0.11.0` | +| [hoodie.table.metadata.partitions.inflight](#hoodietablemetadatapartitionsinflight) | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
`Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT`
`Since Version: 0.11.0` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.
`Config Param: NAME` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.partial.update.mode](#hoodietablepartialupdatemode) | (N/A) | This property when set, will define how two versions of the record will be merged together when records are partially formed
`Config Param: PARTIAL_UPDATE_MODE`
`Since Version: 1.1.0` | +| [hoodie.table.partition.fields](#hoodietablepartitionfields) | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generators
`Config Param: PARTITION_FIELDS` | +| [hoodie.table.recordkey.fields](#hoodietablerecordkeyfields) | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
`Config Param: RECORDKEY_FIELDS` | +| [hoodie.table.secondary.indexes.metadata](#hoodietablesecondaryindexesmetadata) | (N/A) | The metadata of secondary indexes
`Config Param: SECONDARY_INDEXES_METADATA`
`Since Version: 0.13.0` | +| [hoodie.timeline.layout.version](#hoodietimelinelayoutversion) | (N/A) | Version of timeline used, by the table.
`Config Param: TIMELINE_LAYOUT_VERSION` | +| [hoodie.archivelog.folder](#hoodiearchivelogfolder) | archived | path under the meta folder, to store archived timeline instants at.
`Config Param: ARCHIVELOG_FOLDER` | +| [hoodie.bootstrap.index.class](#hoodiebootstrapindexclass) | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_CLASS_NAME` | +| [hoodie.bootstrap.index.enable](#hoodiebootstrapindexenable) | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
`Config Param: BOOTSTRAP_INDEX_ENABLE` | +| [hoodie.bootstrap.index.type](#hoodiebootstrapindextype) | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.
`Config Param: BOOTSTRAP_INDEX_TYPE`
`Since Version: 1.0.0` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING_ENABLE` | +| [hoodie.partition.metafile.use.base.format](#hoodiepartitionmetafileusebaseformat) | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
`Config Param: PARTITION_METAFILE_USE_BASE_FORMAT` | +| [hoodie.populate.meta.fields](#hoodiepopulatemetafields) | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
`Config Param: POPULATE_META_FIELDS` | +| [hoodie.table.base.file.format](#hoodietablebasefileformat) | PARQUET | Base file format to store all the base file data.
`Config Param: BASE_FILE_FORMAT` | +| [hoodie.table.cdc.enabled](#hoodietablecdcenabled) | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.
`Config Param: CDC_ENABLED`
`Since Version: 0.13.0` | | [hoodie.table.cdc.supplemental.logging.mode](#hoodietablecdcsupplementalloggingmode) | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs.
`Config Param: CDC_SUPPLEMENTAL_LOGGING_MODE`
`Since Version: 0.13.0` | -| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | -| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | -| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | -| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | -| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | -| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | -| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | -| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | -| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | +| [hoodie.table.format](#hoodietableformat) | native | Table format name used when writing to the table.
`Config Param: TABLE_FORMAT` | +| [hoodie.table.initial.version](#hoodietableinitialversion) | NINE | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.
`Config Param: INITIAL_VERSION`
`Since Version: 1.0.0` | +| [hoodie.table.log.file.format](#hoodietablelogfileformat) | HOODIE_LOG | Log format used for the delta logs.
`Config Param: LOG_FILE_FORMAT` | +| [hoodie.table.multiple.base.file.formats.enable](#hoodietablemultiplebasefileformatsenable) | false | When set to true, the table can support reading and writing multiple base file formats.
`Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLE`
`Since Version: 1.0.0` | +| [hoodie.table.timeline.timezone](#hoodietabletimelinetimezone) | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
`Config Param: TIMELINE_TIMEZONE` | +| [hoodie.table.type](#hoodietabletype) | COPY_ON_WRITE | The table type for the underlying data.
`Config Param: TYPE` | +| [hoodie.table.version](#hoodietableversion) | NINE | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
`Config Param: VERSION` | +| [hoodie.timeline.history.path](#hoodietimelinehistorypath) | history | path under the meta folder, to store timeline history at.
`Config Param: TIMELINE_HISTORY_PATH` | +| [hoodie.timeline.path](#hoodietimelinepath) | timeline | path under the meta folder, to store timeline instants at.
`Config Param: TIMELINE_PATH` | [**Advanced Configs**](#Hudi-Table-Basic-Configs-advanced-configs) @@ -125,7 +124,6 @@ Options useful for reading tables via `read.format.option(...)` | [hoodie.datasource.read.end.instanttime](#hoodiedatasourcereadendinstanttime) | (N/A) | Used when `hoodie.datasource.query.type` is set to `incremental`. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified.
`Config Param: END_COMMIT` | | [hoodie.datasource.read.incr.table.version](#hoodiedatasourcereadincrtableversion) | (N/A) | The table version assumed for incremental read
`Config Param: INCREMENTAL_READ_TABLE_VERSION` | | [hoodie.datasource.read.streaming.table.version](#hoodiedatasourcereadstreamingtableversion) | (N/A) | The table version assumed for streaming read
`Config Param: STREAMING_READ_TABLE_VERSION` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: READ_PRE_COMBINE_FIELD` | | [hoodie.datasource.query.type](#hoodiedatasourcequerytype) | snapshot | Whether data needs to be read, in `incremental` mode (new data since an instantTime) (or) `read_optimized` mode (obtain latest view, based on base files) (or) `snapshot` mode (obtain latest view, by merging base and (if any) log files)
`Config Param: QUERY_TYPE` | [**Advanced Configs**](#Read-Options-advanced-configs) @@ -169,7 +167,7 @@ inputDF.write() .options(clientOpts) // any of the Hudi client opts can be passed in as well .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") -.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") +.option(HoodieTableConfig.ORDERING_FIELDS(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath); @@ -183,24 +181,23 @@ Options useful for writing tables via `write.format.option(...)` [**Basic Configs**](#Write-Options-basic-configs) -| Config Name | Default | Description | -| ------------------------------------------------------------------------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | -| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: ORDERING_FIELDS` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD` | -| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | -| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | -| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | -| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | -| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | -| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | -| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | -| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | -| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | +| Config Name | Default | Description | +| ------------------------------------------------------------------------------------------------ | ----------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.datasource.hive_sync.mode](#hoodiedatasourcehive_syncmode) | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
`Config Param: HIVE_SYNC_MODE` | +| [hoodie.datasource.write.partitionpath.field](#hoodiedatasourcewritepartitionpathfield) | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()
`Config Param: PARTITIONPATH_FIELD` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.datasource.write.recordkey.field](#hoodiedatasourcewriterecordkeyfield) | (N/A) | Record key field. Value to be used as the `recordKey` component of `HoodieKey`. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: RECORDKEY_FIELD` | +| [hoodie.datasource.write.secondarykey.column](#hoodiedatasourcewritesecondarykeycolumn) | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: `a.b.c`
`Config Param: SECONDARYKEY_COLUMN_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.clustering.async.enabled](#hoodieclusteringasyncenabled) | false | Enable running of clustering service, asynchronously as inserts happen on the table.
`Config Param: ASYNC_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.clustering.inline](#hoodieclusteringinline) | false | Turn on inline clustering - clustering will be run after each write operation is complete
`Config Param: INLINE_CLUSTERING_ENABLE`
`Since Version: 0.7.0` | +| [hoodie.datasource.hive_sync.enable](#hoodiedatasourcehive_syncenable) | false | When set to true, register/sync the table to Apache Hive metastore.
`Config Param: HIVE_SYNC_ENABLED` | +| [hoodie.datasource.hive_sync.jdbcurl](#hoodiedatasourcehive_syncjdbcurl) | jdbc:hive2://localhost:10000 | Hive metastore url
`Config Param: HIVE_URL` | +| [hoodie.datasource.hive_sync.metastore.uris](#hoodiedatasourcehive_syncmetastoreuris) | thrift://localhost:9083 | Hive metastore url
`Config Param: METASTORE_URIS` | +| [hoodie.datasource.meta.sync.enable](#hoodiedatasourcemetasyncenable) | false | Enable Syncing the Hudi Table with an external meta store or data catalog.
`Config Param: META_SYNC_ENABLED` | +| [hoodie.datasource.write.hive_style_partitioning](#hoodiedatasourcewritehive_style_partitioning) | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
`Config Param: HIVE_STYLE_PARTITIONING` | +| [hoodie.datasource.write.operation](#hoodiedatasourcewriteoperation) | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
`Config Param: OPERATION` | +| [hoodie.datasource.write.table.type](#hoodiedatasourcewritetabletype) | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.
`Config Param: TABLE_TYPE` | [**Advanced Configs**](#Write-Options-advanced-configs) @@ -958,16 +955,16 @@ Configurations that control write behavior on Hudi tables. These can be directly [**Basic Configs**](#Write-Configurations-basic-configs) -| Config Name | Default | Description | -| ---------------------------------------------------------------------------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | -| [hoodie.datasource.write.precombine.field](#hoodiedatasourcewriteprecombinefield) | (N/A) | Comma separated list of fields used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..). For multiple fields if first key comparison is same, second key comparison is made and so on. This config is used for combining records within the same batch and also for merging using event time merge mode
`Config Param: PRECOMBINE_FIELD_NAME` | -| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | -| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | -| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | +| Config Name | Default | Description | +| ---------------------------------------------------------------------------------------------- | -------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [hoodie.base.path](#hoodiebasepath) | (N/A) | Base path on lake storage, under which all the table data is stored. Always prefix it explicitly with the storage scheme (e.g hdfs://, s3:// etc). Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs etc in .hoodie directory under this base path directory.
`Config Param: BASE_PATH` | +| [hoodie.table.ordering.fields](#hoodietableorderingfields) | (N/A) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | +| [hoodie.table.name](#hoodietablename) | (N/A) | Table name that will be used for registering with metastores like HMS. Needs to be same across runs.
`Config Param: TBL_NAME` | +| [hoodie.write.record.merge.mode](#hoodiewriterecordmergemode) | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or ordering fields need to be specified by the user. CUSTOM: Using custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| [hoodie.fail.job.on.duplicate.data.file.detection](#hoodiefailjobonduplicatedatafiledetection) | false | If config is enabled, entire job is failed on invalid file detection
`Config Param: FAIL_JOB_ON_DUPLICATE_DATA_FILE_DETECTION` | +| [hoodie.write.auto.upgrade](#hoodiewriteautoupgrade) | true | If enabled, writers automatically migrate the table to the specified write table version if the current table version is lower.
`Config Param: AUTO_UPGRADE_VERSION`
`Since Version: 1.0.0` | | [hoodie.write.concurrency.mode](#hoodiewriteconcurrencymode) | SINGLE_WRITER | org.apache.hudi.common.model.WriteConcurrencyMode: Concurrency modes for write operations. SINGLE_WRITER(default): Only one active writer to the table. Maximizes throughput. OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table with lazy conflict resolution using locks. This means that only one writer succeeds if multiple writers write to the same file group. NON_BLOCKING_CONCURRENCY_CONTROL: Multiple writers can operate on the table with non-blocking conflict resolution. The writers can write into the same file group with the conflicts resolved automatically by the query reader and the compactor.
`Config Param: WRITE_CONCURRENCY_MODE` | -| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | +| [hoodie.write.table.version](#hoodiewritetableversion) | 9 | The table version this writer is storing the table in. This should match the current table version.
`Config Param: WRITE_TABLE_VERSION`
`Since Version: 1.0.0` | [**Advanced Configs**](#Write-Configurations-advanced-configs) diff --git a/website/versioned_docs/version-1.1.1/quick-start-guide.md b/website/versioned_docs/version-1.1.1/quick-start-guide.md index 8c05a9af0452a..360b4c172b13d 100644 --- a/website/versioned_docs/version-1.1.1/quick-start-guide.md +++ b/website/versioned_docs/version-1.1.1/quick-start-guide.md @@ -1240,7 +1240,7 @@ CREATE TABLE hudi_table ( driver STRING, fare DOUBLE, city STRING -) USING HUDI TBLPROPERTIES (preCombineField = 'ts') +) USING HUDI TBLPROPERTIES (orderingFields = 'ts') PARTITIONED BY (city); ``` COMMIT_TIME_ORDERING (when ordering field is not set) | Determines the logic of merging different records with the same record key. Valid values: (1) `COMMIT_TIME_ORDERING`: use commit time to merge records, i.e., the record from later commit overwrites the earlier record with the same key. (2) `EVENT_TIME_ORDERING`: use event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of commit time. The event time or preCombine field needs to be specified by the user. This is the default when an ordering field is configured. (3) `CUSTOM`: use custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | -| hoodie.write.record.merge.strategy.id | N/A (Optional) | ID of record merge strategy. Hudi will pick `HoodieRecordMerger` implementations from `hoodie.write.record.merge.custom.implementation.classes` that have the same merge strategy ID. When using custom merge logic, you need to specify both this config and `hoodie.write.record.merge.custom.implementation.classes`.
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.strategy` (deprecated) | -| hoodie.write.record.merge.custom.implementation.classes | N/A (Optional) | List of `HoodieRecordMerger` implementations constituting Hudi's merging strategy based on the engine used. Hudi selects the first implementation from this list that matches the following criteria: (1) has the same merge strategy ID as specified in `hoodie.write.record.merge.strategy.id` (if provided), (2) is compatible with the execution engine (e.g., SPARK merger for Spark, FLINK merger for Flink, AVRO for Java/Hive). The order in the list matters - place your preferred implementation first. Engine-specific implementations (SPARK, FLINK) are more efficient as they avoid Avro serialization/deserialization overhead.
`Config Param: RECORD_MERGE_IMPL_CLASSES`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.impls` (deprecated) | +| Config Name | Default | Description | +|---------------------------------------------------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| hoodie.write.record.merge.mode | EVENT_TIME_ORDERING (when ordering field is set)
COMMIT_TIME_ORDERING (when ordering field is not set) | Determines the logic of merging different records with the same record key. Valid values: (1) `COMMIT_TIME_ORDERING`: use commit time to merge records, i.e., the record from later commit overwrites the earlier record with the same key. (2) `EVENT_TIME_ORDERING`: use event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of commit time. The event time or ordering fields need to be specified by the user. This is the default when an ordering field is configured. (3) `CUSTOM`: use custom merging logic specified by the user.
`Config Param: RECORD_MERGE_MODE`
`Since Version: 1.0.0` | +| hoodie.write.record.merge.strategy.id | N/A (Optional) | ID of record merge strategy. Hudi will pick `HoodieRecordMerger` implementations from `hoodie.write.record.merge.custom.implementation.classes` that have the same merge strategy ID. When using custom merge logic, you need to specify both this config and `hoodie.write.record.merge.custom.implementation.classes`.
`Config Param: RECORD_MERGE_STRATEGY_ID`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.strategy` (deprecated) | +| hoodie.write.record.merge.custom.implementation.classes | N/A (Optional) | List of `HoodieRecordMerger` implementations constituting Hudi's merging strategy based on the engine used. Hudi selects the first implementation from this list that matches the following criteria: (1) has the same merge strategy ID as specified in `hoodie.write.record.merge.strategy.id` (if provided), (2) is compatible with the execution engine (e.g., SPARK merger for Spark, FLINK merger for Flink, AVRO for Java/Hive). The order in the list matters - place your preferred implementation first. Engine-specific implementations (SPARK, FLINK) are more efficient as they avoid Avro serialization/deserialization overhead.
`Config Param: RECORD_MERGE_IMPL_CLASSES`
`Since Version: 0.13.0`
`Alternative: hoodie.datasource.write.record.merger.impls` (deprecated) | ## Record Payloads (deprecated) diff --git a/website/versioned_docs/version-1.1.1/sql_ddl.md b/website/versioned_docs/version-1.1.1/sql_ddl.md index 1b06225c57e02..d1c5ba865bdb6 100644 --- a/website/versioned_docs/version-1.1.1/sql_ddl.md +++ b/website/versioned_docs/version-1.1.1/sql_ddl.md @@ -77,7 +77,7 @@ should be specified as `PARTITIONED BY (dt, hh)`. As discussed [here](quick-start-guide.md#keys), tables track each record in the table using a record key. Hudi auto-generated a highly compressed key for each new record in the examples so far. If you want to use an existing field as the key, you can set the `primaryKey` option. -Typically, this is also accompanied by configuring ordering fields (via `preCombineField` option) to deal with out-of-order data and potential +Typically, this is also accompanied by configuring ordering fields (via `orderingFields` option) to deal with out-of-order data and potential duplicate records with the same key in the incoming writes. :::note @@ -86,7 +86,7 @@ this materializes a composite key of the two fields, which can be useful for exp ::: Here is an example of creating a table using both options. Typically, a field that denotes the time of the event or -fact, e.g., order creation time, event generation time etc., is used as the ordering field (via `preCombineField`). Hudi resolves multiple versions +fact, e.g., order creation time, event generation time etc., is used as the ordering field (via `orderingFields`). Hudi resolves multiple versions of the same record by ordering based on this field when queries are run on the table. ```sql @@ -99,7 +99,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_keyed ( TBLPROPERTIES ( type = 'cow', primaryKey = 'id', - preCombineField = 'ts' + orderingFields = 'ts' ); ``` @@ -118,13 +118,13 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'EVENT_TIME_ORDERING' ) LOCATION 'file:///tmp/hudi_table_merge_mode/'; ``` -With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `precombineField` ordering field) overwrites the record with the +With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `orderingFields`) overwrites the record with the smaller event time on the same key, regardless of transaction's commit time. Users can set `CUSTOM` mode to provide their own merge logic. With `CUSTOM` merge mode, you can provide a custom class that implements the merge logic. The interfaces to implement is explained in detail [here](record_merger.md#custom). @@ -139,7 +139,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode_custom ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'CUSTOM', 'hoodie.record.merge.strategy.id' = '' ) @@ -177,7 +177,7 @@ CREATE TABLE hudi_table_ctas USING hudi TBLPROPERTIES ( type = 'cow', - preCombineField = 'ts' + orderingFields = 'ts' ) PARTITIONED BY (dt) AS SELECT * FROM parquet_table; @@ -196,7 +196,7 @@ CREATE TABLE hudi_table_ctas USING hudi TBLPROPERTIES ( type = 'cow', - preCombineField = 'ts' + orderingFields = 'ts' ) AS SELECT * FROM parquet_table; ``` @@ -579,10 +579,10 @@ Users can set table properties while creating a table. The important table prope |------------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | type | cow | The table type to create. `type = 'cow'` creates a COPY-ON-WRITE table, while `type = 'mor'` creates a MERGE-ON-READ table. Same as `hoodie.datasource.write.table.type`. More details can be found [here](table_types.md) | | primaryKey | uuid | The primary key field names of the table separated by commas. Same as `hoodie.datasource.write.recordkey.field`. If this config is ignored, hudi will auto-generate primary keys. If explicitly set, primary key generation will honor user configuration. | -| preCombineField | | The ordering field(s) of the table. It is used for resolving the final version of the record among multiple versions. Generally, `event time` or another similar column will be used for ordering purposes. Hudi will be able to handle out-of-order data using the ordering field value. | +| orderingFields | | The ordering field(s) of the table. It is used for resolving the final version of the record among multiple versions. Generally, `event time` or another similar column will be used for ordering purposes. Hudi will be able to handle out-of-order data using the ordering field value. | :::note -`primaryKey`, `preCombineField`, and `type` and other properties are case-sensitive. +`primaryKey`, `orderingFields`, and `type` and other properties are case-sensitive. ::: #### Passing Lock Providers for Concurrent Writers @@ -833,7 +833,7 @@ WITH ( 'connector' = 'hudi', 'path' = 'file:///tmp/hudi_table', 'table.type' = 'MERGE_ON_READ', -'precombine.field' = 'ts' +'ordering.fields' = 'ts' ); ``` diff --git a/website/versioned_docs/version-1.1.1/sql_dml.md b/website/versioned_docs/version-1.1.1/sql_dml.md index 81ad924289f07..64002c02c60b1 100644 --- a/website/versioned_docs/version-1.1.1/sql_dml.md +++ b/website/versioned_docs/version-1.1.1/sql_dml.md @@ -51,7 +51,7 @@ INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 1000 :::note Mapping to write operations Hudi offers flexibility in choosing the underlying [write operation](write_operations.md) of a `INSERT INTO` statement using the `hoodie.spark.sql.insert.into.operation` configuration. Possible options include *"bulk_insert"* (large inserts), *"insert"* (with small file management), -and *"upsert"* (with deduplication/merging). If ordering fields are not set, *"insert"* is chosen as the default. For a table with ordering fields set (via `preCombineField`), +and *"upsert"* (with deduplication/merging). If ordering fields are not set, *"insert"* is chosen as the default. For a table with ordering fields set (via `orderingFields`), *"upsert"* is chosen as the default operation. ::: @@ -101,7 +101,7 @@ update hudi_cow_pt_tbl set ts = 1001 where name = 'a1'; ``` :::info -The `UPDATE` operation requires the specification of ordering fields (via `preCombineField`). +The `UPDATE` operation requires the specification of ordering fields (via `orderingFields`). ::: ### Merge Into @@ -138,7 +138,7 @@ For a Hudi table with user configured primary keys, the join condition and the ` For a table where Hudi auto generates primary keys, the join condition in `MERGE INTO` can be on any arbitrary data columns. -if the `hoodie.record.merge.mode` is set to `EVENT_TIME_ORDERING`, ordering fields (via `preCombineField`) are required to be set with value in the `UPDATE`/`INSERT` clause. +if the `hoodie.record.merge.mode` is set to `EVENT_TIME_ORDERING`, ordering fields (via `orderingFields`) are required to be set with value in the `UPDATE`/`INSERT` clause. It is enforced that if the target table has primary key and partition key column, the source table counterparts must enforce the same data type accordingly. Plus, if the target table is configured with `hoodie.record.merge.mode` = `EVENT_TIME_ORDERING` where target table is expected to have valid ordering fields configuration, the source table counterpart must also have the same data type. ::: @@ -148,7 +148,7 @@ Examples below ```sql -- source table using hudi for testing merging into non-partitioned table create table merge_source (id int, name string, price double, ts bigint) using hudi -tblproperties (primaryKey = 'id', preCombineField = 'ts'); +tblproperties (primaryKey = 'id', orderingFields = 'ts'); insert into merge_source values (1, "old_a1", 22.22, 900), (2, "new_a2", 33.33, 2000), (3, "new_a3", 44.44, 2000); merge into hudi_mor_tbl as target @@ -199,7 +199,7 @@ CREATE TABLE tableName ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - preCombineField = '_ts' + orderingFields = '_ts' ) LOCATION '/location/to/basePath'; diff --git a/website/versioned_docs/version-1.1.1/sql_queries.md b/website/versioned_docs/version-1.1.1/sql_queries.md index d7eeb9cdb1c40..9310c72fbc624 100644 --- a/website/versioned_docs/version-1.1.1/sql_queries.md +++ b/website/versioned_docs/version-1.1.1/sql_queries.md @@ -210,7 +210,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'EVENT_TIME_ORDERING' ) LOCATION 'file:///tmp/hudi_table_merge_mode/'; @@ -225,7 +225,7 @@ INSERT INTO hudi_table_merge_mode VALUES (1, 'a1', 900, 20.0); SELECT id, name, ts, price FROM hudi_table_merge_mode; ``` -With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `precombineField` ordering field) overwrites the record with the +With `EVENT_TIME_ORDERING`, the record with the larger event time (specified via `orderingFields`) overwrites the record with the smaller event time on the same key, regardless of transaction time. ### Snapshot Query with Custom Merge Mode @@ -244,7 +244,7 @@ CREATE TABLE IF NOT EXISTS hudi_table_merge_mode_custom ( TBLPROPERTIES ( type = 'mor', primaryKey = 'id', - precombineField = 'ts', + orderingFields = 'ts', recordMergeMode = 'CUSTOM', 'hoodie.datasource.write.payload.class' = 'org.apache.hudi.common.model.PartialUpdateAvroPayload' ) diff --git a/website/versioned_docs/version-1.1.1/write_operations.md b/website/versioned_docs/version-1.1.1/write_operations.md index 6c414b54f6e3c..6c8eb699ce1dd 100644 --- a/website/versioned_docs/version-1.1.1/write_operations.md +++ b/website/versioned_docs/version-1.1.1/write_operations.md @@ -96,7 +96,7 @@ Here are the basic configs relevant to the write operations types mentioned abov | Config Name | Default | Description | |------------------------------------------------|----------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | hoodie.datasource.write.operation | upsert (Optional) | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.

`Config Param: OPERATION` | -| hoodie.datasource.write.precombine.field | (no default) (Optional) | Field used for ordering records before actual write. When two records have the same key value, we will pick the one with the largest value for the ordering field, determined by Object.compareTo(..). Note: This config is deprecated, use `hoodie.table.ordering.fields` instead.

`Config Param: PRECOMBINE_FIELD` | +| hoodie.table.ordering.fields | (N/A) (Optional) | Comma separated fields used in records merging comparison. By default, when two records have the same key value, the largest value for the ordering field determined by Object.compareTo(..), is picked. If there are multiple fields configured, comparison is made on the first field. If the first field values are same, comparison is made on the second field and so on.
`Config Param: ORDERING_FIELDS` | | hoodie.combine.before.insert | false (Optional) | When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage.

`Config Param: COMBINE_BEFORE_INSERT` | | hoodie.datasource.write.insert.drop.duplicates | false (Optional) | If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. This config is deprecated as of 0.14.0. Please use hoodie.datasource.insert.dup.policy instead.

`Config Param: INSERT_DROP_DUPS` | | hoodie.bulkinsert.sort.mode | NONE (Optional) | org.apache.hudi.execution.bulkinsert.BulkInsertSortMode: Modes for sorting records during bulk insert.
`Config Param: BULK_INSERT_SORT_MODE` |