-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle uppercase letters in imported data column names #15231
Comments
I've checked and the loss of the capital letters occurs not in the cartodbfication nor in odbc_fdw. So it must be in the import api, perhaps related to the db connectors. |
The problem seems to be here: Lines 1347 to 1348 in 5b1a8a5
This is executed when a new Table is created for an existing table; which is performed through TableRegistrar by the data imports. I think it would be nice to replace the
We're already using I think the greatest risk of breaking things with this change would be with existing sync tables, because column names would change for existing tables. We could add a rollbar trace for syncs that have Another option would be to only change the sanitization for new import types (like the BQ connector) to avoid affecting existing syncs, by sanitizing column names before table registration, but I don't like having such difference between import methods. |
Thanks for the research @jgoizueta That sounds like a plan 👍 |
There are some changes of mine that already modified sanitization of col names: https://github.com/CartoDB/cartodb/pull/14937/files I couldn't reproduce the issue with a regular import, so obviously there's yet a mismatch between imports and syncs, which is unfortunate :( |
I could reproduce the issue with Postgres DB Connector. I'm 99% sure this problem is specific to DB Connectors. On the bright side, any potential fix will be safer as they won't generically affect other kind of imports/syncs. |
By instrumenting the code and reproducing the case I figured that it does something like (in pseudo code) So I don't really get why it affects DB Connectors but no regular imports. I suspect The patch: diff --git a/app/models/table.rb b/app/models/table.rb
index 16e3f30f9b..b1099a38c4 100644
--- a/app/models/table.rb
+++ b/app/models/table.rb
@@ -1329,6 +1329,9 @@ class Table
valid_column_name = get_valid_column_name(table_name, column_name, options)
if valid_column_name != column_name
+ puts "<RTORRE>"
+ puts Thread.current.backtrace.join("\n")
+ puts "<\RTORRE>"
connection.run(%Q{ALTER TABLE "#{database_schema}"."#{table_name}" RENAME COLUMN "#{column_name}" TO "#{valid_column_name}";})
end
The trace I got (cleaned up of framework hooks and beautified):
|
Column names are usually converted to lowercase when they go through test=# CREATE TABLE capital_column_names (id int, OtherColumn varchar);
CREATE TABLE
test=# \d capital_column_names
Table "public.capital_column_names"
Column | Type | Collation | Nullable | Default
-------------+-------------------+-----------+----------+---------
id | integer | | |
othercolumn | character varying | | | vs test=# CREATE TABLE capital_column_names_for_real (id int, "OtherColumn" varchar);
CREATE TABLE
test=# \d capital_column_names_for_real
Table "public.capital_column_names_for_real"
Column | Type | Collation | Nullable | Default
-------------+-------------------+-----------+----------+---------
id | integer | | |
OtherColumn | character varying | | | (mind that column names like that one have to be quoted to operate with them) |
https://gdal.org/drivers/vector/pg.html
(we're using This shouldn't affect much this issue, it just makes it harder to figure out, reproduce and test because of the many moving parts with different behaviors. |
Being pragmatic, a possible fix (not the best but good enough): #15252 |
This PR just adds a warning but does not perform any actual change: #15253 |
There are many such instances of |
I see a few problems with this approach (in order of difficulty to solve):
I'm reverting the PR to add the traces, as it easily produces 6K log entries/day (one per column-sync). |
Many of the cases are due not to capitals but to multiple underscores vs single underscore, but there's a good number of capitals lost too. Interesting data. The problem is worse I had anticipated: many syncs would be affected by the change. We could change the behaviour only for new connector providers or only for syncs created since some date, but I found it cumbersome. |
I was wrong about The problemImported and synced tables have For file imports this has no effect because For tables created manually (SQL API) and cartodbfied this is not executed when the table is registered, so they can have uppercase letters, etc. in the column names. The detailed process:
Proposed Versioning ImplementationLet's assume we already have some method that applies normalization for any version that will replace module ColumnNormalization
INITIAL_NORMALIZATION = 0
CURRENT_NORMALIZATION = 1
def self.normalize_columns(table_name, version=CURRENT_NORMALIZATION)
# ...
end
end In all cases except for manually created tables (i.e. all cases that sanitize columns) the UserTable register has a non NULL data_import_id. We can use the DataImport record to store the kind of normalization performed (using the class DataImport
attr_setter :normalization
def before_create
# ... (keep existing before create stuff here)
self.normalization ||= ColumnNormalization::CURRENT_NORMALIZATION
self.extra_options.merge! normalization: normalization
end
def applied_normalization
extra_options[:normalizatio] || ColumnNormalization::INITIAL_NORMALIZATION
end
end Normalization can be applied through a Table instance method: class Table
# Use without arguments to normalize the registered table,
# use with a different table name to apply the same normalization
# to other (unregistered) table
def normalize_columns(table_name=name)
if data_import
version = data_import.applied_normalization
ColumnNormalization.normalize_columns(table_name, version)
end
end
end Now, Synchronizations ( We're doing an additional query to get the Table here, which is done also in other places like Then we should make Internally the normalization implementation used in The idea is having a single source for (low level) sanitization (e.g. Note normalization should also handle reserved words/column names which is done inconsistently now.
DetailsThe method
Now
Imports call it through the Table constructor ( Note that we have foreign keys We have Note that syncs can't handle multiple files: cartodb/app/models/synchronization/adapter.rb Lines 29 to 30 in 9391e76
Sanitation BonanzaThere's a plethora of other sanitization methods in addition to Table.sanitize_columns`
|
# Already done by Table#sanitize_columns in app/models/table.rb | |
#sanitize_columns! | |
Table#add_column!
and #modify_column!
use String#sanitize
;
and uses Table::RESERVED_COLUMN_NAMES
to rename reserved names to _xxx.
Also Table#create_table_in_database!
calls String#sanitize and some rake tasks, etc.
DB::Sanitize#sanitize_identifier
It is used for table names only, not columns.
Called by APiKey#create_db_config
, FDW#server_name
, ValidTableNameProposer#propose_valid_table_name
, ConnectorRunner#result_table_name
, ...)
Uses DB::Sanitize::RESERVED_WORDS
StringSanitizer#sanitize
Is used by Column#sanitized_name
(uses Column::RESERVED_WORDS
).
Used by Column#sanitize
which seems unused.
Other Redundancies
Two RESERVED_WORDS
:
- In
DB::Sanitize
used byString#sanitize_column_name
- In
Column
used byDataImport#from_table
andColumn#sanitized_name
Two RESERVED_COLUMN_NAMES
:
- In
CartoDB
used byString#sanitize_column_name
- In
Table
used byTable#add_column!
,Table#modify_column!
(the latter throughrename_column
)
The trick there is that once you applied a certain version, you have to apply the same version from that moment on. That is, it needs to store the version used (or the mapping of source cols to dest cols) somewhere, along with the sync.
Is that
I'd fix it first, avoid extra risk of refactor, then consider it
That can be solved by quoting column names. Actually many issues can be solved that way. https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS (refer to quoted identifiers) E.g: postgres=# CREATE TABLE test_table("INSERT" int);
CREATE TABLE About the plan, I'd start with the versioning and then move from there My 2 cents |
While preparing some tests cases to check the PR in staging I've noticed this: Due to the double sanitization (before/after cartodbfication) described aboved, and the way sanitization was applied we had cases of columns that changed their name after the first synchronization. 🤦♂ For example, column like So, if we want to preserve the exact behaviour of existing syncs we should keep the double sanitization; but for new syncs we shouldn't (unless we make the new sanitization idempotent). |
For the current state of this issue, please look at this comment in the PR: #15326 (comment) |
Close via #15326 |
Column names in imported data are normalized as lower-case identifiers by removing upper case letters and other characters, then adding numerical suffixes if necessary to avoid name collisions. For example, when importing a table with columns
ID_X, ID_Y, IC_gral
they get renamed as_, _1, _gral
.It would be nicer to preserve uppercase letters by converting them to lowercase, but changing this behaviour for existing import sources may break user cases in use.
For new sources, e.g. the BQ connector, we could change it without risk of breaking anything, but it might not be easy (is the name change part of cartodbfication?).
In any case there are some questions I think we should consider for discussion:
cc @alejandrohall @alonsogarciapablo @ilbambino
note that a table with all columns in uppercase would be imported as
_, _1, _2, _3, ...
which is very unfortunate.The text was updated successfully, but these errors were encountered: