You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The existing S3 connector offers a number of options for configuring connections for CSV file types. To ensure backwards compatibility, we'll want to update the AbstractFileBasedSpec and config adapter to handle them.
In #28131, we're creating a new S3 FileBasedConfig object. This ticket involves extending that object to handle CSV-specific options, and will also require the creation of a custom parser that handles the old options.
Verify that all options that we still support are appropriately mapped to the name in the file-based CDK:
delimiter
quote_char
escape_char
encoding
double_quote
newlines_in_values
Handle options that we don't offer in the new file-based CDK
infer_datatypes: Configures whether a schema for the source should be inferred from the current data or not. If this is set to True, we'll want to infer & cast types even if the user has not provided a schema. Unfortunately this is set to True for the vast majority of connectors so it does feel like we should handle it as opposed to letting this be a breaking change.
additional_reader_options: Options provided to the CSV reader. There are only a handful of connectors with these set. The file-based CDK will be updated to support the following
strings_can_be_null: this should always be True.
null_values: this should be offered as a CSV-specific config option, so the spec should be updated accordingly.
We should confirm that these options are not necessary: autogenerate_column_names, compression, include_missing_columns, and check_utf8.
One connector is using {"column_types":{"Zipcode": "string"}}. Because this is a single connector we should consider deprecating this option.
Double-check to verify that we have a plan to either support or deprecate all additional_reader_options that are in use by connectors in cloud.
advanced_options: Options provided to Pyarrow, used by a handful of connectors. Instead of blindly passing these options to pyarrow, we should deliberately surface those that we want to support, and deprecate the rest, as follows.
column_types: this allows us to support headerless CSVs. We should surface it as an option in the CSV-specific section of the spec.
skip_rows & skip_rows_after_names: select one of these and offer it as a CSV-specific config option. (For existing connectors, we should be able to support both by calculating the value for skip_rows based on skip_rows_after_names or vice-versa.)
Verify that encoding is already handled, and that compression will be handled by the stream reader without requiring additional config options.
Double-check to verify that we have a plan to either support or deprecate alladvanced_options that are in use by connectors in cloud.
Acceptance Criteria
The existing CSV config options are mapped and handled appropriately by the S3 connector.
Any options that we cannot support are identified, along with the connectors that will be impacted.
The text was updated successfully, but these errors were encountered:
The existing S3 connector offers a number of options for configuring connections for CSV file types. To ensure backwards compatibility, we'll want to update the
AbstractFileBasedSpec
and config adapter to handle them.In #28131, we're creating a new S3 FileBasedConfig object. This ticket involves extending that object to handle CSV-specific options, and will also require the creation of a custom parser that handles the old options.
Verify that all options that we still support are appropriately mapped to the name in the file-based CDK:
delimiter
quote_char
escape_char
encoding
double_quote
newlines_in_values
Handle options that we don't offer in the new file-based CDK
infer_datatypes
: Configures whether a schema for the source should be inferred from the current data or not. If this is set to True, we'll want to infer & cast types even if the user has not provided a schema. Unfortunately this is set to True for the vast majority of connectors so it does feel like we should handle it as opposed to letting this be a breaking change.additional_reader_options
: Options provided to the CSV reader. There are only a handful of connectors with these set. The file-based CDK will be updated to support the followingstrings_can_be_null
: this should always be True.null_values
: this should be offered as a CSV-specific config option, so the spec should be updated accordingly.autogenerate_column_names
,compression
,include_missing_columns
, andcheck_utf8
.{"column_types":{"Zipcode": "string"}}
. Because this is a single connector we should consider deprecating this option.additional_reader_options
that are in use by connectors in cloud.advanced_options
: Options provided to Pyarrow, used by a handful of connectors. Instead of blindly passing these options to pyarrow, we should deliberately surface those that we want to support, and deprecate the rest, as follows.column_types
: this allows us to support headerless CSVs. We should surface it as an option in the CSV-specific section of the spec.skip_rows
&skip_rows_after_names
: select one of these and offer it as a CSV-specific config option. (For existing connectors, we should be able to support both by calculating the value forskip_rows
based onskip_rows_after_names
or vice-versa.)encoding
is already handled, and thatcompression
will be handled by the stream reader without requiring additional config options.advanced_options
that are in use by connectors in cloud.Acceptance Criteria
The text was updated successfully, but these errors were encountered: