Versioned column sanitization #15326

jgoizueta · 2019-12-17T18:53:25Z

See #15231

Note that the extra_options accessor canot be used to modify the underlying column

jgoizueta · 2019-12-19T11:31:32Z

retest this, please

Note that ogc_fid shouldn't be altered since it is handled by Table#import_cleanup

jgoizueta · 2019-12-19T23:08:34Z

retest this, please

jgoizueta · 2019-12-20T08:28:36Z

pretty pretty please 🙏

jgoizueta · 2019-12-20T08:29:24Z

retest this, please

jgoizueta · 2019-12-20T08:29:52Z

🙄 the things you have to do to make that CI work

Synchronization::Member#table expects user_id to be assignable in the Table constructor (and now also Synchronization::Adapter#import_cleanup expects it)

Sanitization of colum names was being accidentally ommited in Synchronization::Adapter#import_cleanup This wasn't actually too bad because the sanitization was peformed after cartodbfication by Synchronization::Adapter#setup_table. The cause of this was that ::Table relies on its @data_import for the sanitization version, and since it hadn't been assigned in the Table instance used, it used the version 0 which means "no sanitization". This also had another side effect: we have been using double sanitization in syncs (legacy version 1) possibly altering column names (double underscores caused by anti-collision suffixes). So this commit restores that behaviour for v1 syncs (although a less accidental way of achieving the effect would be desiderable, such as preventing double sanitization and altering v1 sanitization to remove double underscores after collision prevention).

jgoizueta · 2019-12-20T17:40:56Z

This is ready for review, but there are some pending actions discussed below, and
this should be deployed carefully (support awareness, keep and eye on rollbar, ready to take action, etc.)

@rafatower @gonzaloriestra I've made no formal review request as I don't want to bother you too much these days, but feel free to review it if you see this

The long following notes are mostly for myself when I'm back to work on this.

Description of the changes

A Table object has an associated column_sanitization_version value that determines how it's columns should be sanitized.

If the table doesn't have an associated DataImport (which happens if the user manually cartodbfied a table) then the version is 0, meaning "no sanitization should take place". This preserves the current behaviour.

Otherwise it will use the column_sanitization_version from the DataImport, which is stored in the JSON-serialized import_extra_options column.

If the DataImport has no column_sanitization_version (i.e. all existing data import records) it will
be mapped to the version value 1, meaning "the original sanitization we have been applying so far".

Newly created DataImport records will have the value determined by the CURRENT_COLUMN_SANITIZATION_VERSION constant, with value 2 at the moment, meaning "the new sanitization introduced here".

The legacy sanitization (version 1) dropped uppercase letters and other non-ASCII characters from the column names (except that for file imports ogr2ogr previously converts names to downcase, so uppercase is not lost).

The new one (version 2) applies a transliteration to downcase ASCII letters that we already had implemented in two places: String and StringSanitizer. I've removed redundancy leaving it just in the latter and also made some corrections. (we could include the module StringSanitizer in class String to restore access through
strings)

The new sanitization also differs in prepending an underscore to reserved keywords and system columns, but this is one of the questions to be discussed below as I'm not quite sure this is a good idea. (by changing these names we make it more convenient to handle
them in SQL since they don't require quoting)

Sanitization flavours

While testing this I noticed there are cases in which sanitization perform by imports differ from that of syncs, so some users could be affected by this having their column names change between the initial import and the first synchronization.

This example shows the columns I've been using for testing, all belong to the same import/sync table, so this shows how the sanitization and name collision-prevention transforms the names:

input column	file import	db import	db sync	v2 file	v2 db
abc	abc	abc	abc	abc	abc
Abc	abc2	_bc	_bc	abc2	abc_1
2abc	column_2abc	column_2ab	column_2ab	_2abc	_2abc
Регион	_	_	_	region	region
description/名稱	description	description	description	description	description
XX	xx	__1	_1	xx	xx
YY	yy	__2	_2	yy	yy
> min	_min	_min	_min	min	min
any	any	any	any	_any	_any
a+++b	a_b	a_b	a_b	a_b	a_b

old file import/sync:

collision-avoiding suffixes style: x, x2, x3, ... (as set by ogr2ogr)
invalid identifier prefix: column_
cyrillic: _
other scripts/symbols: ignored
uppercase: to lowercase
reserved keywords: kept (so they require quoting in SQL)

old db-connector import

collision-avoiding suffixes style: x, x_1, x_2, ...
invalid identifier prefix: column_
cyrillic: _
other scripts/symbols: ignored
uppercase: _
uppercase: to lowercase

old db-connector synchronization

Like import, except:
- suffix double underscore converted to single (_1 instead of __1)

new (version 2)

Applied only to new imports/syncs; existing syncs will not be altered

collision avoiding style still different for files and db: x2 vs x_1
invalid prefix: _
reserved word prefix: _ (no need for quoting)
cyrillic transliterated
uppercase: to lowercase

Fear, uncertainty, and doubt

Import/Sync, and specially sanitization code has become very messy over the years, so I've tried to leave it in a slightly tidier state while implementing this (just slightly enough that I could keep my sanity while doing these changes). That means there could be some deviation in the legacy sanitization or some other side effects affecting imports, editing columns, etc. I think only those that affect existing syncs should be relevant, and in the tests I've performed (see below) things work as before, but there could be corner cases.

Staging tests

For testing this in staging I've used both db-connector imports and CSV file imports.

For the db-connector I've been rather hacky, by using a Postgres database as the source, created in the same database host as the user database. The name of the database is testconn

In it I created this table:

CREATE TABLE weirder ("abc" int, "Abc" int, "2abc" int ,"Регион" int ,"description/名稱" int ,"XX" int ,"YY" int ,"> min" int, "any" int, "a++b" int);
INSERT INTO weirder VALUES (11,22,33,44,55,66,77,88,99,10), (110,220,330,440,550,660,770,880,990,100);

Imports and sycs can be created like this:

curl -k -H "Content-Type: application/json" -d '{"connector":{ "provider":"postgres", "connection":{ "sslmode": "disable", "server":"localhost", "username":"postgres", "database":"testconn"}, "table":"weirder","import_as":"weirder0"}}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"

curl -k -H "Content-Type: application/json" -d '{"interval":900, "connector":{ "provider":"postgres", "connection":{ "sslmode": "disable", "server":"localhost", "username":"postgres", "database":"testconn"}, "table":"weirder"}}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"

For the file imports a created a similar CSV file:

weirder.csv:

"abc","Abc","2abc","Регион","description/名稱","XX","YY","> min","any","a+++b"
11,22,33,44,55,66,77,88,99,10
110,220,330,440,550,660,770,880,990,100

Then stored it in a bucket accessible via https://storage.googleapis.com/test_carto/weirder.csv:

gsutil cp weirder.csv gs://test_carto/ && gsutil acl ch -u AllUsers:R gs://test_carto/weirder.csv

Imports and export can be performed now as:

curl -k -H "Content-Type: application/json" -d '{"url":"https://storage.googleapis.com/test_carto/weirder.csv"}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"

curl -k -H "Content-Type: application/json" -d '{"interval":900,"url":"https://storage.googleapis.com/test_carto/weirder.csv"}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"

Open questions

Double sanitization (import-sync differences)

The double sanitization performed in synchronizations is still in place.
It makes legacy (v1) syncs keep their behaviour (produce _1 instead of __1) and doesn't affect new syncs (v2) because new sanitization is idemponent, so there's no difference between the import (single sanitization) and the sync (double sanitization).

But I don't like depending on that oddity, so I would consider removing the double sanitization
and changing the legacy (v1) sanitization to act as if applied twice, so that old syncs are not affected. (we don't need to reproduce the effect of old imports, so no problem here)

Changing the legacy sanitization is a matter of avoiding double underscores after adding the collision suffix.

Differences between file and db-connector imports

As can be seen in the table above, ogr2ogr has it's own logic to avoid column name collisions
(since it changes names to downcase) which is different from ours.

We could change the logic for our new (version 2) sanitization to match ogr2ogr.

Reserved names

Our new sanitization also has changed what we do with PostgreSQL reserved keywords (e.g. AS)
and system columns (e.g. xmin). If we use such names in a table then they must be quoted
every time they appear in SQL. The new sanitization is currently altering these names by
prepending an underscore. (this is done because use the sanitization functions in places
where we were expecting this behaviour) For imported data, this makes column names more convenient for users in the sense that they don't need to use quotes in SQL, but maybe it's better
to avoid unnecessary changes.

кириллица

Since we had an implementation of cyrillic transliteration I've applied it in the new sanitization: some of our users are importing tables with cyrillic column names and without this their columns end up as _, _1, _2, so I think it's nicer for them. We're still discriminating users of many other alphabets, of course. Arguably, we could preserve cyrillic and other alphabets (by using quotes they should be usable in SQL), but I don't want to bring in that question in this PR. So the question is simply: perform the cyrillic transliteration?

Review collision logic

For collision handling, the sanitization method Column::get_valid_column_name accepts a list of existing column names that should be avoided.

The name to be sanitized is removed from this list inside this function.

I think we should review the uses of this function (direct or indirect) because there could be cases in which we shouldn't remove the name from the list.

I can think of these cases (and I'm not sure if and how are we passing existing column names in each case):

Add a new column to existing table. The new name may already exist, we shouldn't remove it from the list.
Alter the name of an existing column: We should remove the old name, not the new one.
Sanitize one existing column (usually used to sanitize all one by one): in this case the original column must be removed.
Prepare column names for table creation: here we should incrementally fill columns with previously sanitized names as in the column_spec unit tests, and not remove the current one.

Missing tests

We should add tests for the versioning mechanism:

No sanitization applied for tables registered by ghost tables (i.e. version 0 applied)
For data imports without versioning info, version 1 is applied
For new data imports, version 2 is applied

It would also be nice to add integration or end-to-end tests similar to the ones performed in staging.

gonzaloriestra

Awesome work!! 👏 👏

I left some comments, but it looks good to me for now. Are you planning to add the missing tests you mentioned?

Regarding the open questions, my opinion is that we should change as few things as possible now and go step by step, because this feels a bit dangerous. But all of them make sense to me 👍

app/models/table.rb

services/importer/lib/importer/column.rb

gonzaloriestra · 2019-12-23T08:22:48Z

services/importer/lib/importer/column.rb

+          candidate_column_name = candidate_column_name.to_s.squish
+
+          # Subsequent characters can be letters, underscores or digits
+          candidate_column_name = candidate_column_name.gsub(/[^a-z0-9]/,'_').gsub(/_{2,}/, '_')


Take into account that original names with a double underscore, which should be valid, will be changed. Although it's not a common case, I guess.

services/importer/spec/unit/column_spec.rb

rafatower · 2019-12-31T09:40:06Z

This topic is complex enough to have it buried in issues. Could you please add an entry to doc-internal as part of it?

# Conflicts: # NEWS.md

NEWS.md

gonzaloriestra

👌

jgoizueta added 11 commits December 17, 2019 19:52

Version column sanitization (WIP)

33c9a8b

See #15231

Fix extra_options modiication

ae5622b

Note that the extra_options accessor canot be used to modify the underlying column

Remove reserved_words options

2f1a1e9

Fix sanitization version

4d27846

Use existing column sanitization

40585a9

Fix fetching Table

5cc3e11

Make method public

e282e04

Add NEWS entry

69ce107

Refactor sanitization functions

77228e5

Fix syntax

198a30e

Fix reserved name detection

ab4db5a

jgoizueta added 17 commits December 19, 2019 13:30

Tests for legacy sanitization

86a9ed2

Remove unnecessary parameter

4b7be30

More sanitization tests

c24ea38

Implement v2 sanitization

b0a65b4

Rename cyrillic parameter

9e2dc67

Add rejected columns to preserve behaviour when adding columns

777f755

Make StringSanitizer methods module functions

03bea34

Fix bug

0527f31

Fix column modification

b2e30b5

Preserve sanitization behaviour for multiple underscores

cc788b3

Adapt tests to new multiple underscore behaviour

c553e43

Neew sanitization tests

bcf74c9

Bugfixing

952b0bf

Remove test case

4183bbf

Test using ActiveRecord transliteration

6b481f8

Adjust tests to new sanitization

e1b2e15

Fix reserved column names

1ff4270

Note that ogc_fid shouldn't be altered since it is handled by Table#import_cleanup

jgoizueta added 2 commits December 20, 2019 16:16

Allow asigning user_id in Table constructor

d537a80

Synchronization::Member#table expects user_id to be assignable in the Table constructor (and now also Synchronization::Adapter#import_cleanup expects it)

jgoizueta mentioned this pull request Dec 20, 2019

Handle uppercase letters in imported data column names #15231

Closed

gonzaloriestra reviewed Dec 23, 2019

View reviewed changes

jgoizueta added 3 commits January 13, 2020 15:46

Code simplification

1726064

Separate methods per sanitization version

c0d773e

Merge branch 'master' into column_sanitization

635f924

# Conflicts: # NEWS.md

jgoizueta requested a review from gonzaloriestra January 17, 2020 08:11

gonzaloriestra suggested changes Jan 17, 2020

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

Fix NEWS

8f7c420

jgoizueta requested a review from gonzaloriestra January 17, 2020 08:58

gonzaloriestra approved these changes Jan 17, 2020

View reviewed changes

jgoizueta merged commit e57ab5d into master Jan 17, 2020

jgoizueta deleted the column_sanitization branch January 17, 2020 09:11

rafatower mentioned this pull request Sep 29, 2020

15231 Use the StringSanitizer to normalize column names #15252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioned column sanitization #15326

Versioned column sanitization #15326

jgoizueta commented Dec 17, 2019

jgoizueta commented Dec 19, 2019

jgoizueta commented Dec 19, 2019

jgoizueta commented Dec 20, 2019

jgoizueta commented Dec 20, 2019

jgoizueta commented Dec 20, 2019

jgoizueta commented Dec 20, 2019 •

edited

Loading

gonzaloriestra left a comment

gonzaloriestra Dec 23, 2019

rafatower commented Dec 31, 2019

gonzaloriestra left a comment

Versioned column sanitization #15326

Versioned column sanitization #15326

Conversation

jgoizueta commented Dec 17, 2019

jgoizueta commented Dec 19, 2019

jgoizueta commented Dec 19, 2019

jgoizueta commented Dec 20, 2019

jgoizueta commented Dec 20, 2019

jgoizueta commented Dec 20, 2019

jgoizueta commented Dec 20, 2019 • edited Loading

Description of the changes

Sanitization flavours

old file import/sync:

old db-connector import

old db-connector synchronization

new (version 2)

Fear, uncertainty, and doubt

Staging tests

Open questions

Double sanitization (import-sync differences)

Differences between file and db-connector imports

Reserved names

кириллица

Review collision logic

Missing tests

gonzaloriestra left a comment

Choose a reason for hiding this comment

gonzaloriestra Dec 23, 2019

Choose a reason for hiding this comment

rafatower commented Dec 31, 2019

gonzaloriestra left a comment

Choose a reason for hiding this comment

jgoizueta commented Dec 20, 2019 •

edited

Loading