Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle uppercase letters in imported data column names #15231

Closed
jgoizueta opened this issue Nov 13, 2019 · 20 comments
Closed

Handle uppercase letters in imported data column names #15231

jgoizueta opened this issue Nov 13, 2019 · 20 comments
Assignees

Comments

@jgoizueta
Copy link
Contributor

jgoizueta commented Nov 13, 2019

Column names in imported data are normalized as lower-case identifiers by removing upper case letters and other characters, then adding numerical suffixes if necessary to avoid name collisions. For example, when importing a table with columns ID_X, ID_Y, IC_gral they get renamed as _, _1, _gral.

It would be nicer to preserve uppercase letters by converting them to lowercase, but changing this behaviour for existing import sources may break user cases in use.

For new sources, e.g. the BQ connector, we could change it without risk of breaking anything, but it might not be easy (is the name change part of cartodbfication?).

In any case there are some questions I think we should consider for discussion:

  • Should be change this behaviour globally (and hope we don't break too many cases)
  • Are uppercase letters common in BQ column names? (same for other data sources)

cc @alejandrohall @alonsogarciapablo @ilbambino

note that a table with all columns in uppercase would be imported as _, _1, _2, _3, ... which is very unfortunate.

@alrocar
Copy link
Contributor

alrocar commented Nov 13, 2019

AFAIK The column name normalization is part of the Import API, but also part of the import process with COPY in CARTOframes.

I wasn't aware we were dropping uppercase letters... See this and this, so maybe we have a bug somewhere.

How can we reproduce the issue?

@jgoizueta
Copy link
Contributor Author

I've checked and the loss of the capital letters occurs not in the cartodbfication nor in odbc_fdw. So it must be in the import api, perhaps related to the db connectors.

@jgoizueta
Copy link
Contributor Author

The problem seems to be here:

cartodb/app/models/table.rb

Lines 1347 to 1348 in 5b1a8a5

candidate_column_name = candidate_column_name.gsub(/[^a-z0-9]/,'_').gsub(/_{2,}/, '_')

This is executed when a new Table is created for an existing table; which is performed through TableRegistrar by the data imports.

I think it would be nice to replace the gsub(/[^a-z0-9]/,'_') by StringSanitizer#sanitize which converts to downcase and also transliterates non-ASCII letters to a-z:

CartoDB::Importer2::StringSanitizer.new.sanitize('aAÁäÂ') # => "aaaaa"

We're already using StringSanitizer in Column to sanitize geometry column names (in the importer Georeferencer).

I think the greatest risk of breaking things with this change would be with existing sync tables, because column names would change for existing tables.

We could add a rollbar trace for syncs that have [^a-z0-9] characters in the column names (before registration) and run it for some days before deciding about doing the change.

Another option would be to only change the sanitization for new import types (like the BQ connector) to avoid affecting existing syncs, by sanitizing column names before table registration, but I don't like having such difference between import methods.

@alrocar
Copy link
Contributor

alrocar commented Nov 15, 2019

Thanks for the research @jgoizueta

That sounds like a plan 👍

@rafatower rafatower self-assigned this Nov 19, 2019
@rafatower
Copy link
Contributor

There are some changes of mine that already modified sanitization of col names:

https://github.com/CartoDB/cartodb/pull/14937/files

I couldn't reproduce the issue with a regular import, so obviously there's yet a mismatch between imports and syncs, which is unfortunate :(

@rafatower
Copy link
Contributor

I could reproduce the issue with Postgres DB Connector. I'm 99% sure this problem is specific to DB Connectors.

On the bright side, any potential fix will be safer as they won't generically affect other kind of imports/syncs.

@rafatower
Copy link
Contributor

By instrumenting the code and reproducing the case I figured that it does something like (in pseudo code) importer->run->register->table_registrar->register->table->before_create->sanitize_columns->ensure_column_has_valid_name.

So I don't really get why it affects DB Connectors but no regular imports. I suspect ogr2ogr may have something to do here.

The patch:

diff --git a/app/models/table.rb b/app/models/table.rb
index 16e3f30f9b..b1099a38c4 100644
--- a/app/models/table.rb
+++ b/app/models/table.rb
@@ -1329,6 +1329,9 @@ class Table
 
     valid_column_name = get_valid_column_name(table_name, column_name, options)
     if valid_column_name != column_name
+      puts "<RTORRE>"
+      puts Thread.current.backtrace.join("\n")
+      puts "<\RTORRE>"
       connection.run(%Q{ALTER TABLE "#{database_schema}"."#{table_name}" RENAME COLUMN "#{column_name}" TO "#{valid_column_name}";})
     end
 

The trace I got (cleaned up of framework hooks and beautified):

cartodb/app/models/table.rb:1333:in `ensure_column_has_valid_name'
cartodb/app/models/table.rb:1313:in `block in sanitize_columns'
cartodb/app/models/table.rb:1310:in `each'
cartodb/app/models/table.rb:1310:in `sanitize_columns'
cartodb/app/models/table.rb:230:in `import_to_cartodb'
cartodb/app/models/table.rb:336:in `before_create'
cartodb/app/models/carto/user_table.rb:281:in `service_before_create'
cartodb/app/models/table.rb:79:in `save'
cartodb/app/models/table_registrar.rb:16:in `register'
cartodb/app/connectors/importer.rb:266:in `persist_metadata'
cartodb/app/connectors/importer.rb:118:in `register'
cartodb/app/connectors/importer.rb:90:in `block (2 levels) in register_results'
cartodb/app/connectors/importer.rb:89:in `each'
cartodb/app/connectors/importer.rb:89:in `block in register_results'
cartodb/lib/carto/ghost_tables_manager.rb:54:in `block in run_synchronized'
cartodb/lib/carto/bolt.rb:27:in `run_locked'
cartodb/lib/carto/ghost_tables_manager.rb:53:in `run_synchronized'
cartodb/app/connectors/importer.rb:83:in `register_results'
cartodb/app/connectors/importer.rb:73:in `run'
cartodb/app/models/data_import.rb:851:in `execute_importer'
cartodb/app/models/data_import.rb:440:in `dispatch'
cartodb/app/models/data_import.rb:188:in `run_import!'
cartodb/lib/resque/importer_jobs.rb:8:in `block in perform'
cartodb/lib/resque/base_job.rb:19:in `run_action'
cartodb/lib/resque/importer_jobs.rb:8:in `perform'

@rafatower
Copy link
Contributor

Column names are usually converted to lowercase when they go through ogr2ogr, but it is because of the way PG works. E.g:

test=# CREATE TABLE capital_column_names (id int, OtherColumn varchar);
CREATE TABLE
test=# \d capital_column_names 
               Table "public.capital_column_names"
   Column    |       Type        | Collation | Nullable | Default 
-------------+-------------------+-----------+----------+---------
 id          | integer           |           |          | 
 othercolumn | character varying |           |          | 

vs

test=# CREATE TABLE capital_column_names_for_real (id int, "OtherColumn" varchar);
CREATE TABLE
test=# \d capital_column_names_for_real 
           Table "public.capital_column_names_for_real"
   Column    |       Type        | Collation | Nullable | Default 
-------------+-------------------+-----------+----------+---------
 id          | integer           |           |          | 
 OtherColumn | character varying |           |          | 

(mind that column names like that one have to be quoted to operate with them)

@rafatower
Copy link
Contributor

ogr2ogr still has something to do:

https://gdal.org/drivers/vector/pg.html

LAUNDER: This may be “YES” to force new fields created on this layer to have their field names “laundered” into a form more compatible with PostgreSQL. This converts to lower case and converts some special characters like “-” and “#” to “_”. If “NO” exact names are preserved. The default value is “YES”. If enabled the table (layer) name will also be laundered.

(we're using LAUNDER defaulting to YES and to me it turns out a bad idea to start setting it to NO at this point)

This shouldn't affect much this issue, it just makes it harder to figure out, reproduce and test because of the many moving parts with different behaviors.

@rafatower
Copy link
Contributor

Being pragmatic, a possible fix (not the best but good enough): #15252

@rafatower
Copy link
Contributor

This PR just adds a warning but does not perform any actual change: #15253

@rafatower
Copy link
Contributor

There are many such instances of Differences in column name sanitization/normalization. See rollbar, #40477

@rafatower
Copy link
Contributor

I see a few problems with this approach (in order of difficulty to solve):

  • There are affected syncs, by inspecting the backtraces from the traces. They will be broken unless we add extra logic.
  • That sanitization produces wrong identifiers, starting with leading numbers. E.g: 10__name
  • It does not account for collisions in column names.

I'm reverting the PR to add the traces, as it easily produces 6K log entries/day (one per column-sync).

@jgoizueta
Copy link
Contributor Author

Many of the cases are due not to capitals but to multiple underscores vs single underscore, but there's a good number of capitals lost too. Interesting data. The problem is worse I had anticipated: many syncs would be affected by the change.

We could change the behaviour only for new connector providers or only for syncs created since some date, but I found it cumbersome.

@rafatower rafatower removed their assignment Dec 2, 2019
@jgoizueta
Copy link
Contributor Author

jgoizueta commented Dec 10, 2019

I was wrong about Column using StringSanitizer (there's a method #sanitize but is not used; Column is used by Georeferencer but not the sanitize method), and the whole sanitization picture is more complicated in our current database, but I've take a deep look into it and have survived to tell you about.

The problem

Imported and synced tables have Table.sanitize_columns called on the new tables which eliminates uppercase and special letters.

For file imports this has no effect because ogr2ogr already converted column names to downcase.

For tables created manually (SQL API) and cartodbfied this is not executed when the table is registered, so they can have uppercase letters, etc. in the column names.

The detailed process:

  • Imported tables: DataImport#run (dispatch/new_importer/new_importer_with_connecto etc)
    uses a TableRegistrar object to create a Table (setting migrate_existing_table;
    the presence of it in Table#before_create makes it call Table.sanitize_columns).
  • Sync tables: no new UserTable record created so before_create is not called,
    but Synchronization::Adapter#run calls explicitly (twice) Table.sanitize_columns`` (before cartodbfication through #import_cleanupand after through#setup_table`)
  • Manually cartodbfied tables are registered by GTM, table is created without migrate_existing_table, so no sanitization.
  • Special imports (table_copy, from_query) call Table.sanitize_columns explicitly (from DataImport#from_table).

Proposed Versioning Implementation

Let's assume we already have some method that applies normalization for any version that will replace Table.sanitize_columns: (I'm using normalization here but we may end up calling it sanitization or some other nonsense)

module ColumnNormalization
  INITIAL_NORMALIZATION = 0
  CURRENT_NORMALIZATION = 1
  def self.normalize_columns(table_name, version=CURRENT_NORMALIZATION)
    # ...
  end
end

In all cases except for manually created tables (i.e. all cases that sanitize columns) the UserTable register has a non NULL data_import_id. We can use the DataImport record to store the kind of normalization performed (using the extra_options JSON field) so that if the table is syncronized,
the same normalization can be applied to newly imported data.

class DataImport
  attr_setter :normalization
  def before_create
    # ... (keep existing before create stuff here)
    self.normalization ||= ColumnNormalization::CURRENT_NORMALIZATION
    self.extra_options.merge! normalization: normalization
  end
  def applied_normalization
    extra_options[:normalizatio] || ColumnNormalization::INITIAL_NORMALIZATION
  end
end

Normalization can be applied through a Table instance method:

class Table
  # Use without arguments to normalize the registered table,
  # use with a different table name to apply the same normalization
  # to other (unregistered) table
  def normalize_columns(table_name=name)
    if data_import
      version = data_import.applied_normalization
      ColumnNormalization.normalize_columns(table_name, version)
    end
  end
end

Now, Table#import_to_cartodb (called on Table creation for imported tables) should
call the new instance mehod (self.normalize_columns) without arguments instead of Table.sanitize_columns.

Synchronizations (Adapter) must call table.normalize(result.table_name)
with table = ::Table.new(name: @table_name, user_id: @user.id) or @user.tables.where(name: @table_name) [the latter produces a UserTable, not a Table].

We're doing an additional query to get the Table here, which is done also in other places like TableSetup. In the Synchronization::Member class this operation is available and memoized, but not in Adapter. We could refactor to avoid multiple queries for this. (note that putting the data import or version reference in the synchronizations table would not save doing the query at least once).

Then we should make Table#add_column! and #modify_column! consistent with the new normalization and change to call #normalize_columns on self insteald of the String methods used now.

Internally the normalization implementation used in ColumnNormalization could use the String methods, or, better, StringSanitizer which should probably replace all use of the String methods and perhaps be integraged in DB::Sanitize.

The idea is having a single source for (low level) sanitization (e.g. DB::Sanitize) then versioned
sanitization provided through Table.

Note normalization should also handle reserved words/column names which is done inconsistently now.

  • Unified sanitatization: Remove Column & String sanitization methods,
    move into DB::Sanitize? Remove reserved words redundancy/inconsistency
  • Implement versioned normalization module
  • Implement per data-import versioned normalization
  • Add tests
  • Add documentation

Details

The method Table.sanitize_columns is used in:

  • DataImport#from_table in turn from DataImport#dispatch (for table copy & query imports)
    and passing Column::RESERVED_WORDS (!). This doesn't happen in syncs, nor user (API) imports, only
    in internal imports. In any case we have the DataImport to obtain the correct version.
  • Table#import_to_cartodb from Table#before_create (i.e. for all registered tables)
    and also from Synchronization::Adapter#setup_table.
    When called from setup_table we should get the version from the DataImport corresponding to to Sync,
    but it should match the table's DataImport... so..
    When called from Table creation, we should have assigned a version previously to the table's data import if it exists...
    [check how DataImport creates Table, to see if DI is available at construction]
  • Synchronization::Adapter#import_cleanup

Now Synchronization::Adapter#run uses Table.sanitize_columns twice:

  • Before cartodbification it calls Synchronization::Adapter#import_cleanup
  • After it it calls Synchronization::Adapter#setup_table

Imports call it through the Table constructor (before_create with migrate_existing_table option set by TableRegistrar) (if import uses ogr2ogr, i.e. is file-based the table will already be downcase)

Note that we have foreign keys user_tables.data_import_id and data_import.table_id, (so Table <-> DataImport navigation is possible), but we have only data_imports.synchronization_id which is not indexed so Synchronization::Member -> DataImport is not possible efficiently.

We have Synchronization::Member#table that finds the table by user and name, but there's no such thing in the Adapter (the importer).
The Adapter is instantiated with a table_name: the name of the synchronization (which is the table name) then it uses result.table_name (temp table) to normalize columns, cartodbfy, etc.
So we can't rely on Table doing the normalization with previously registered info, since we must normalize the imported table (unregistered) with info from the sync name table (registered).

Note that syncs can't handle multiple files:

result = runner.results.select(&:success?).first

Sanitation Bonanza

There's a plethora of other sanitization methods in addition to Table.sanitize_columns`

String#sanitize_column_name

It uses DB::Sanitize::RESERVED_WORDS and CartoDB::RESERVED_COLUMN_NAMES.
It calls String#sanitize -> String#normalize.
It used to be called by Migrator, but currenlty that is disabled now:

# Already done by Table#sanitize_columns in app/models/table.rb
#sanitize_columns!

Table#add_column! and #modify_column! use String#sanitize;
and uses Table::RESERVED_COLUMN_NAMES to rename reserved names to _xxx.
Also Table#create_table_in_database! calls String#sanitize and some rake tasks, etc.

DB::Sanitize#sanitize_identifier

It is used for table names only, not columns.
Called by APiKey#create_db_config, FDW#server_name, ValidTableNameProposer#propose_valid_table_name, ConnectorRunner#result_table_name, ...)
Uses DB::Sanitize::RESERVED_WORDS

StringSanitizer#sanitize

Is used by Column#sanitized_name (uses Column::RESERVED_WORDS).
Used by Column#sanitize which seems unused.

Other Redundancies

Two RESERVED_WORDS:

  • In DB::Sanitize used by String#sanitize_column_name
  • In Column used by DataImport#from_table and Column#sanitized_name

Two RESERVED_COLUMN_NAMES:

  • In CartoDB used by String#sanitize_column_name
  • In Table used by Table#add_column!, Table#modify_column! (the latter through rename_column)

@rafatower
Copy link
Contributor

Proposed Versioning Implementation

The trick there is that once you applied a certain version, you have to apply the same version from that moment on. That is, it needs to store the version used (or the mapping of source cols to dest cols) somewhere, along with the sync.

We can use the DataImport record to store the kind of normalization performed (using the extra_options JSON field)

Is that extra_options field persisted? is it the import_extra_options field in the data_imports table?

We could refactor to avoid multiple queries for this

I'd fix it first, avoid extra risk of refactor, then consider it

normalization should also handle reserved words/column names which is done inconsistently now.

That can be solved by quoting column names. Actually many issues can be solved that way.

https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS (refer to quoted identifiers)

E.g:

postgres=# CREATE TABLE test_table("INSERT" int);
CREATE TABLE

About the plan, I'd start with the versioning and then move from there

My 2 cents

@jgoizueta jgoizueta self-assigned this Dec 16, 2019
jgoizueta added a commit that referenced this issue Dec 17, 2019
jgoizueta added a commit that referenced this issue Dec 17, 2019
@jgoizueta
Copy link
Contributor Author

jgoizueta commented Dec 19, 2019

While preparing some tests cases to check the PR in staging I've noticed this:

Due to the double sanitization (before/after cartodbfication) described aboved, and the way sanitization was applied we had cases of columns that changed their name after the first synchronization. 🤦‍♂

For example, column like XX was sanitized as __1 (throuh db connector imports). In the initial import, since theres not double sanitization in imports, it appeared as __1. But then, the first synchronization (which could happen a month after the import!) performs the double sanitizations that converts XX to __1 and then to _1. Ouch.

So, if we want to preserve the exact behaviour of existing syncs we should keep the double sanitization; but for new syncs we shouldn't (unless we make the new sanitization idempotent).

@jgoizueta
Copy link
Contributor Author

For the current state of this issue, please look at this comment in the PR: #15326 (comment)

@jgoizueta
Copy link
Contributor Author

Close via #15326

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants