File CDK: update spec & config with new CSV options #28133

clnoll · 2023-07-11T01:20:12Z

The existing S3 connector offers a number of options for configuring connections for CSV file types. To ensure backwards compatibility, we'll want to update the AbstractFileBasedSpec and config adapter to handle them.

In #28131, we're creating a new S3 FileBasedConfig object. This ticket involves extending that object to handle CSV-specific options, and will also require the creation of a custom parser that handles the old options.

Verify that all options that we still support are appropriately mapped to the name in the file-based CDK:
- delimiter
- quote_char
- escape_char
- encoding
- double_quote
- newlines_in_values
Handle options that we don't offer in the new file-based CDK
- infer_datatypes: Configures whether a schema for the source should be inferred from the current data or not. If this is set to True, we'll want to infer & cast types even if the user has not provided a schema. Unfortunately this is set to True for the vast majority of connectors so it does feel like we should handle it as opposed to letting this be a breaking change.
- additional_reader_options: Options provided to the CSV reader. There are only a handful of connectors with these set. The file-based CDK will be updated to support the following
  - strings_can_be_null: this should always be True.
  - null_values: this should be offered as a CSV-specific config option, so the spec should be updated accordingly.
  - We should confirm that these options are not necessary: autogenerate_column_names, compression, include_missing_columns, and check_utf8.
  - One connector is using {"column_types":{"Zipcode": "string"}}. Because this is a single connector we should consider deprecating this option.
  - Double-check to verify that we have a plan to either support or deprecate all additional_reader_options that are in use by connectors in cloud.
- advanced_options: Options provided to Pyarrow, used by a handful of connectors. Instead of blindly passing these options to pyarrow, we should deliberately surface those that we want to support, and deprecate the rest, as follows.
  - column_types: this allows us to support headerless CSVs. We should surface it as an option in the CSV-specific section of the spec.
  - skip_rows & skip_rows_after_names: select one of these and offer it as a CSV-specific config option. (For existing connectors, we should be able to support both by calculating the value for skip_rows based on skip_rows_after_names or vice-versa.)
  - Verify that encoding is already handled, and that compression will be handled by the stream reader without requiring additional config options.
  - Double-check to verify that we have a plan to either support or deprecate alladvanced_options that are in use by connectors in cloud.

Acceptance Criteria

The existing CSV config options are mapped and handled appropriately by the S3 connector.
Any options that we cannot support are identified, along with the connectors that will be impacted.

The text was updated successfully, but these errors were encountered:

clnoll · 2023-07-11T17:01:58Z

Grooming notes:

To avoid breaking OSS connectors we should handle all options instead of trying to only support the existing ones.

girarda · 2023-07-27T01:20:32Z

End of sprint update:

PR for missing options except infer_datatypes
infer_datatypes can be tackled in a separate PR

clnoll added team/extensibility area/file-cdk labels Jul 11, 2023

clnoll mentioned this issue Jul 11, 2023

File CDK: S3 connector testing & slow rollout #28137

Closed

5 tasks

clnoll changed the title ~~File CDK: S3 config adapter (CSV options)~~ File CDK: update spec & config with new CSV options Jul 13, 2023

girarda self-assigned this Jul 18, 2023

girarda mentioned this issue Jul 26, 2023

Add CSV options to the CSV parser #28491

Merged

girarda mentioned this issue Aug 1, 2023

File-based CSV: infer datatypes #28893

Closed

2 tasks

girarda closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File CDK: update spec & config with new CSV options #28133

File CDK: update spec & config with new CSV options #28133

clnoll commented Jul 11, 2023 •

edited

Loading

clnoll commented Jul 11, 2023

girarda commented Jul 27, 2023

File CDK: update spec & config with new CSV options #28133

File CDK: update spec & config with new CSV options #28133

Comments

clnoll commented Jul 11, 2023 • edited Loading

Acceptance Criteria

clnoll commented Jul 11, 2023

girarda commented Jul 27, 2023

clnoll commented Jul 11, 2023 •

edited

Loading