Handle uppercase letters in imported data column names #15231

jgoizueta · 2019-11-13T16:24:38Z

Column names in imported data are normalized as lower-case identifiers by removing upper case letters and other characters, then adding numerical suffixes if necessary to avoid name collisions. For example, when importing a table with columns ID_X, ID_Y, IC_gral they get renamed as _, _1, _gral.

It would be nicer to preserve uppercase letters by converting them to lowercase, but changing this behaviour for existing import sources may break user cases in use.

For new sources, e.g. the BQ connector, we could change it without risk of breaking anything, but it might not be easy (is the name change part of cartodbfication?).

In any case there are some questions I think we should consider for discussion:

Should be change this behaviour globally (and hope we don't break too many cases)
Are uppercase letters common in BQ column names? (same for other data sources)

cc @alejandrohall @alonsogarciapablo @ilbambino

note that a table with all columns in uppercase would be imported as _, _1, _2, _3, ... which is very unfortunate.

The text was updated successfully, but these errors were encountered:

alrocar · 2019-11-13T17:13:24Z

AFAIK The column name normalization is part of the Import API, but also part of the import process with COPY in CARTOframes.

I wasn't aware we were dropping uppercase letters... See this and this, so maybe we have a bug somewhere.

How can we reproduce the issue?

jgoizueta · 2019-11-14T17:04:42Z

I've checked and the loss of the capital letters occurs not in the cartodbfication nor in odbc_fdw. So it must be in the import api, perhaps related to the db connectors.

jgoizueta · 2019-11-15T06:54:02Z

The problem seems to be here:

cartodb/app/models/table.rb

Lines 1347 to 1348 in 5b1a8a5

    
           candidate_column_name = candidate_column_name.gsub(/[^a-z0-9]/,'_').gsub(/_{2,}/, '_')

This is executed when a new Table is created for an existing table; which is performed through TableRegistrar by the data imports.

I think it would be nice to replace the gsub(/[^a-z0-9]/,'_') by StringSanitizer#sanitize which converts to downcase and also transliterates non-ASCII letters to a-z:

CartoDB::Importer2::StringSanitizer.new.sanitize('aAÁäÂ') # => "aaaaa"

We're already using StringSanitizer in Column to sanitize geometry column names (in the importer Georeferencer).

I think the greatest risk of breaking things with this change would be with existing sync tables, because column names would change for existing tables.

We could add a rollbar trace for syncs that have [^a-z0-9] characters in the column names (before registration) and run it for some days before deciding about doing the change.

Another option would be to only change the sanitization for new import types (like the BQ connector) to avoid affecting existing syncs, by sanitizing column names before table registration, but I don't like having such difference between import methods.

alrocar · 2019-11-15T08:10:31Z

Thanks for the research @jgoizueta

That sounds like a plan 👍

rafatower · 2019-11-19T17:36:22Z

There are some changes of mine that already modified sanitization of col names:

https://github.com/CartoDB/cartodb/pull/14937/files

I couldn't reproduce the issue with a regular import, so obviously there's yet a mismatch between imports and syncs, which is unfortunate :(

rafatower · 2019-11-20T15:25:55Z

I could reproduce the issue with Postgres DB Connector. I'm 99% sure this problem is specific to DB Connectors.

On the bright side, any potential fix will be safer as they won't generically affect other kind of imports/syncs.

rafatower · 2019-11-20T16:57:06Z

By instrumenting the code and reproducing the case I figured that it does something like (in pseudo code) importer->run->register->table_registrar->register->table->before_create->sanitize_columns->ensure_column_has_valid_name.

So I don't really get why it affects DB Connectors but no regular imports. I suspect ogr2ogr may have something to do here.

The patch:

diff --git a/app/models/table.rb b/app/models/table.rb
index 16e3f30f9b..b1099a38c4 100644
--- a/app/models/table.rb
+++ b/app/models/table.rb
@@ -1329,6 +1329,9 @@ class Table
 
     valid_column_name = get_valid_column_name(table_name, column_name, options)
     if valid_column_name != column_name
+      puts "<RTORRE>"
+      puts Thread.current.backtrace.join("\n")
+      puts "<\RTORRE>"
       connection.run(%Q{ALTER TABLE "#{database_schema}"."#{table_name}" RENAME COLUMN "#{column_name}" TO "#{valid_column_name}";})
     end

The trace I got (cleaned up of framework hooks and beautified):

cartodb/app/models/table.rb:1333:in `ensure_column_has_valid_name'
cartodb/app/models/table.rb:1313:in `block in sanitize_columns'
cartodb/app/models/table.rb:1310:in `each'
cartodb/app/models/table.rb:1310:in `sanitize_columns'
cartodb/app/models/table.rb:230:in `import_to_cartodb'
cartodb/app/models/table.rb:336:in `before_create'
cartodb/app/models/carto/user_table.rb:281:in `service_before_create'
cartodb/app/models/table.rb:79:in `save'
cartodb/app/models/table_registrar.rb:16:in `register'
cartodb/app/connectors/importer.rb:266:in `persist_metadata'
cartodb/app/connectors/importer.rb:118:in `register'
cartodb/app/connectors/importer.rb:90:in `block (2 levels) in register_results'
cartodb/app/connectors/importer.rb:89:in `each'
cartodb/app/connectors/importer.rb:89:in `block in register_results'
cartodb/lib/carto/ghost_tables_manager.rb:54:in `block in run_synchronized'
cartodb/lib/carto/bolt.rb:27:in `run_locked'
cartodb/lib/carto/ghost_tables_manager.rb:53:in `run_synchronized'
cartodb/app/connectors/importer.rb:83:in `register_results'
cartodb/app/connectors/importer.rb:73:in `run'
cartodb/app/models/data_import.rb:851:in `execute_importer'
cartodb/app/models/data_import.rb:440:in `dispatch'
cartodb/app/models/data_import.rb:188:in `run_import!'
cartodb/lib/resque/importer_jobs.rb:8:in `block in perform'
cartodb/lib/resque/base_job.rb:19:in `run_action'
cartodb/lib/resque/importer_jobs.rb:8:in `perform'

rafatower · 2019-11-20T17:09:49Z

Column names are usually converted to lowercase when they go through ogr2ogr, but it is because of the way PG works. E.g:

test=# CREATE TABLE capital_column_names (id int, OtherColumn varchar);
CREATE TABLE
test=# \d capital_column_names 
               Table "public.capital_column_names"
   Column    |       Type        | Collation | Nullable | Default 
-------------+-------------------+-----------+----------+---------
 id          | integer           |           |          | 
 othercolumn | character varying |           |          |

vs

test=# CREATE TABLE capital_column_names_for_real (id int, "OtherColumn" varchar);
CREATE TABLE
test=# \d capital_column_names_for_real 
           Table "public.capital_column_names_for_real"
   Column    |       Type        | Collation | Nullable | Default 
-------------+-------------------+-----------+----------+---------
 id          | integer           |           |          | 
 OtherColumn | character varying |           |          |

(mind that column names like that one have to be quoted to operate with them)

rafatower · 2019-11-20T17:30:33Z

ogr2ogr still has something to do:

https://gdal.org/drivers/vector/pg.html

LAUNDER: This may be “YES” to force new fields created on this layer to have their field names “laundered” into a form more compatible with PostgreSQL. This converts to lower case and converts some special characters like “-” and “#” to “_”. If “NO” exact names are preserved. The default value is “YES”. If enabled the table (layer) name will also be laundered.

(we're using LAUNDER defaulting to YES and to me it turns out a bad idea to start setting it to NO at this point)

This shouldn't affect much this issue, it just makes it harder to figure out, reproduce and test because of the many moving parts with different behaviors.

rafatower · 2019-11-20T18:00:47Z

Being pragmatic, a possible fix (not the best but good enough): #15252

rafatower · 2019-11-21T09:42:38Z

This PR just adds a warning but does not perform any actual change: #15253

rafatower · 2019-11-22T08:07:26Z

There are many such instances of Differences in column name sanitization/normalization. See rollbar, #40477

rafatower · 2019-11-22T08:32:31Z

I see a few problems with this approach (in order of difficulty to solve):

There are affected syncs, by inspecting the backtraces from the traces. They will be broken unless we add extra logic.
That sanitization produces wrong identifiers, starting with leading numbers. E.g: 10__name
It does not account for collisions in column names.

I'm reverting the PR to add the traces, as it easily produces 6K log entries/day (one per column-sync).

jgoizueta · 2019-11-22T15:22:51Z

Many of the cases are due not to capitals but to multiple underscores vs single underscore, but there's a good number of capitals lost too. Interesting data. The problem is worse I had anticipated: many syncs would be affected by the change.

We could change the behaviour only for new connector providers or only for syncs created since some date, but I found it cumbersome.

rafatower · 2019-12-02T11:23:03Z

Some public-facing documentation on the topic:

jgoizueta · 2019-12-10T20:35:12Z

I was wrong about Column using StringSanitizer (there's a method #sanitize but is not used; Column is used by Georeferencer but not the sanitize method), and the whole sanitization picture is more complicated in our current database, but I've take a deep look into it and have survived to tell you about.

The problem

Imported and synced tables have Table.sanitize_columns called on the new tables which eliminates uppercase and special letters.

For file imports this has no effect because ogr2ogr already converted column names to downcase.

For tables created manually (SQL API) and cartodbfied this is not executed when the table is registered, so they can have uppercase letters, etc. in the column names.

The detailed process:

Imported tables: DataImport#run (dispatch/new_importer/new_importer_with_connecto etc)
uses a TableRegistrar object to create a Table (setting migrate_existing_table;
the presence of it in Table#before_create makes it call Table.sanitize_columns).
Sync tables: no new UserTable record created so before_create is not called,
but Synchronization::Adapter#run calls explicitly (twice) Table.sanitize_columns`` (before cartodbfication through #import_cleanupand after through#setup_table`)
Manually cartodbfied tables are registered by GTM, table is created without migrate_existing_table, so no sanitization.
Special imports (table_copy, from_query) call Table.sanitize_columns explicitly (from DataImport#from_table).

Proposed Versioning Implementation

Let's assume we already have some method that applies normalization for any version that will replace Table.sanitize_columns: (I'm using normalization here but we may end up calling it sanitization or some other nonsense)

module ColumnNormalization
  INITIAL_NORMALIZATION = 0
  CURRENT_NORMALIZATION = 1
  def self.normalize_columns(table_name, version=CURRENT_NORMALIZATION)
    # ...
  end
end

In all cases except for manually created tables (i.e. all cases that sanitize columns) the UserTable register has a non NULL data_import_id. We can use the DataImport record to store the kind of normalization performed (using the extra_options JSON field) so that if the table is syncronized,
the same normalization can be applied to newly imported data.

class DataImport
  attr_setter :normalization
  def before_create
    # ... (keep existing before create stuff here)
    self.normalization ||= ColumnNormalization::CURRENT_NORMALIZATION
    self.extra_options.merge! normalization: normalization
  end
  def applied_normalization
    extra_options[:normalizatio] || ColumnNormalization::INITIAL_NORMALIZATION
  end
end

Normalization can be applied through a Table instance method:

class Table
  # Use without arguments to normalize the registered table,
  # use with a different table name to apply the same normalization
  # to other (unregistered) table
  def normalize_columns(table_name=name)
    if data_import
      version = data_import.applied_normalization
      ColumnNormalization.normalize_columns(table_name, version)
    end
  end
end

Now, Table#import_to_cartodb (called on Table creation for imported tables) should
call the new instance mehod (self.normalize_columns) without arguments instead of Table.sanitize_columns.

Synchronizations (Adapter) must call table.normalize(result.table_name)
with table = ::Table.new(name: @table_name, user_id: @user.id) ~~or @user.tables.where(name: @table_name)~~ [the latter produces a UserTable, not a Table].

We're doing an additional query to get the Table here, which is done also in other places like TableSetup. In the Synchronization::Member class this operation is available and memoized, but not in Adapter. We could refactor to avoid multiple queries for this. (note that putting the data import or version reference in the synchronizations table would not save doing the query at least once).

Then we should make Table#add_column! and #modify_column! consistent with the new normalization and change to call #normalize_columns on self insteald of the String methods used now.

Internally the normalization implementation used in ColumnNormalization could use the String methods, or, better, StringSanitizer which should probably replace all use of the String methods and perhaps be integraged in DB::Sanitize.

The idea is having a single source for (low level) sanitization (e.g. DB::Sanitize) then versioned
sanitization provided through Table.

Note normalization should also handle reserved words/column names which is done inconsistently now.

Unified sanitatization: Remove Column & String sanitization methods,
move into DB::Sanitize? Remove reserved words redundancy/inconsistency
Implement versioned normalization module
Implement per data-import versioned normalization
Add tests
Add documentation

Details

The method Table.sanitize_columns is used in:

DataImport#from_table in turn from DataImport#dispatch (for table copy & query imports)
and passing Column::RESERVED_WORDS (!). This doesn't happen in syncs, nor user (API) imports, only
in internal imports. In any case we have the DataImport to obtain the correct version.
Table#import_to_cartodb from Table#before_create (i.e. for all registered tables)
and also from Synchronization::Adapter#setup_table.
When called from setup_table we should get the version from the DataImport corresponding to to Sync,
but it should match the table's DataImport... so..
When called from Table creation, we should have assigned a version previously to the table's data import if it exists...
[check how DataImport creates Table, to see if DI is available at construction]
Synchronization::Adapter#import_cleanup

Now Synchronization::Adapter#run uses Table.sanitize_columns twice:

Before cartodbification it calls Synchronization::Adapter#import_cleanup
After it it calls Synchronization::Adapter#setup_table

Imports call it through the Table constructor (before_create with migrate_existing_table option set by TableRegistrar) (if import uses ogr2ogr, i.e. is file-based the table will already be downcase)

Note that we have foreign keys user_tables.data_import_id and data_import.table_id, (so Table <-> DataImport navigation is possible), but we have only data_imports.synchronization_id which is not indexed so Synchronization::Member -> DataImport is not possible efficiently.

We have Synchronization::Member#table that finds the table by user and name, but there's no such thing in the Adapter (the importer).
The Adapter is instantiated with a table_name: the name of the synchronization (which is the table name) then it uses result.table_name (temp table) to normalize columns, cartodbfy, etc.
So we can't rely on Table doing the normalization with previously registered info, since we must normalize the imported table (unregistered) with info from the sync name table (registered).

Note that syncs can't handle multiple files:

cartodb/app/models/synchronization/adapter.rb

Lines 29 to 30 in 9391e76

	result = runner.results.select(&:success?).first

Sanitation Bonanza

There's a plethora of other sanitization methods in addition to Table.sanitize_columns`

`String#sanitize_column_name`

It uses DB::Sanitize::RESERVED_WORDS and CartoDB::RESERVED_COLUMN_NAMES.
It calls String#sanitize -> String#normalize.
It used to be called by Migrator, but currenlty that is disabled now:

cartodb/lib/importer/lib/cartodb-migrator/migrator.rb

Lines 44 to 46 in 9391e76

    
           # Already done by Table#sanitize_columns in app/models/table.rb 
        
           #sanitize_columns!

Table#add_column! and #modify_column! use String#sanitize;
and uses Table::RESERVED_COLUMN_NAMES to rename reserved names to _xxx.
Also Table#create_table_in_database! calls String#sanitize and some rake tasks, etc.

`DB::Sanitize#sanitize_identifier`

It is used for table names only, not columns.
Called by APiKey#create_db_config, FDW#server_name, ValidTableNameProposer#propose_valid_table_name, ConnectorRunner#result_table_name, ...)
Uses DB::Sanitize::RESERVED_WORDS

`StringSanitizer#sanitize`

Is used by Column#sanitized_name (uses Column::RESERVED_WORDS).
Used by Column#sanitize which seems unused.

Other Redundancies

Two RESERVED_WORDS:

In DB::Sanitize used by String#sanitize_column_name
In Column used by DataImport#from_table and Column#sanitized_name

Two RESERVED_COLUMN_NAMES:

In CartoDB used by String#sanitize_column_name
In Table used by Table#add_column!, Table#modify_column! (the latter through rename_column)

rafatower · 2019-12-11T08:43:16Z

Proposed Versioning Implementation

The trick there is that once you applied a certain version, you have to apply the same version from that moment on. That is, it needs to store the version used (or the mapping of source cols to dest cols) somewhere, along with the sync.

We can use the DataImport record to store the kind of normalization performed (using the extra_options JSON field)

Is that extra_options field persisted? is it the import_extra_options field in the data_imports table?

We could refactor to avoid multiple queries for this

I'd fix it first, avoid extra risk of refactor, then consider it

normalization should also handle reserved words/column names which is done inconsistently now.

That can be solved by quoting column names. Actually many issues can be solved that way.

https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS (refer to quoted identifiers)

E.g:

postgres=# CREATE TABLE test_table("INSERT" int);
CREATE TABLE

About the plan, I'd start with the versioning and then move from there

My 2 cents

See #15231

jgoizueta · 2019-12-19T15:56:47Z

While preparing some tests cases to check the PR in staging I've noticed this:

Due to the double sanitization (before/after cartodbfication) described aboved, and the way sanitization was applied we had cases of columns that changed their name after the first synchronization. 🤦‍♂

For example, column like XX was sanitized as __1 (throuh db connector imports). In the initial import, since theres not double sanitization in imports, it appeared as __1. But then, the first synchronization (which could happen a month after the import!) performs the double sanitizations that converts XX to __1 and then to _1. Ouch.

So, if we want to preserve the exact behaviour of existing syncs we should keep the double sanitization; but for new syncs we shouldn't (unless we make the new sanitization idempotent).

jgoizueta · 2019-12-20T17:51:15Z

For the current state of this issue, please look at this comment in the PR: #15326 (comment)

jgoizueta · 2020-01-20T08:54:19Z

Close via #15326

rafatower self-assigned this Nov 19, 2019

rafatower removed their assignment Dec 2, 2019

jgoizueta self-assigned this Dec 16, 2019

jgoizueta added a commit that referenced this issue Dec 17, 2019

Version column sanitization (WIP)

c26772c

See #15231

jgoizueta added a commit that referenced this issue Dec 17, 2019

Version column sanitization (WIP)

33c9a8b

See #15231

jgoizueta mentioned this issue Dec 17, 2019

Versioned column sanitization #15326

Merged

jgoizueta closed this as completed Jan 20, 2020

jgoizueta mentioned this issue Oct 6, 2020

Fix bug affecting sanitization version of sync tables #15885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle uppercase letters in imported data column names #15231

Handle uppercase letters in imported data column names #15231

jgoizueta commented Nov 13, 2019 •

edited

Loading

alrocar commented Nov 13, 2019

jgoizueta commented Nov 14, 2019

jgoizueta commented Nov 15, 2019

alrocar commented Nov 15, 2019

rafatower commented Nov 19, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 21, 2019

rafatower commented Nov 22, 2019

rafatower commented Nov 22, 2019

jgoizueta commented Nov 22, 2019

rafatower commented Dec 2, 2019

jgoizueta commented Dec 10, 2019 •

edited

Loading

rafatower commented Dec 11, 2019

jgoizueta commented Dec 19, 2019 •

edited

Loading

jgoizueta commented Dec 20, 2019

jgoizueta commented Jan 20, 2020

Handle uppercase letters in imported data column names #15231

Handle uppercase letters in imported data column names #15231

Comments

jgoizueta commented Nov 13, 2019 • edited Loading

alrocar commented Nov 13, 2019

jgoizueta commented Nov 14, 2019

jgoizueta commented Nov 15, 2019

alrocar commented Nov 15, 2019

rafatower commented Nov 19, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 20, 2019

rafatower commented Nov 21, 2019

rafatower commented Nov 22, 2019

rafatower commented Nov 22, 2019

jgoizueta commented Nov 22, 2019

rafatower commented Dec 2, 2019

jgoizueta commented Dec 10, 2019 • edited Loading

The problem

Proposed Versioning Implementation

Details

Sanitation Bonanza

String#sanitize_column_name

DB::Sanitize#sanitize_identifier

StringSanitizer#sanitize

Other Redundancies

rafatower commented Dec 11, 2019

jgoizueta commented Dec 19, 2019 • edited Loading

jgoizueta commented Dec 20, 2019

jgoizueta commented Jan 20, 2020

jgoizueta commented Nov 13, 2019 •

edited

Loading

jgoizueta commented Dec 10, 2019 •

edited

Loading

`String#sanitize_column_name`

`DB::Sanitize#sanitize_identifier`

`StringSanitizer#sanitize`

jgoizueta commented Dec 19, 2019 •

edited

Loading