Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioned column sanitization #15326

Merged
merged 34 commits into from
Jan 17, 2020
Merged

Versioned column sanitization #15326

merged 34 commits into from
Jan 17, 2020

Conversation

jgoizueta
Copy link
Contributor

See #15231

@jgoizueta
Copy link
Contributor Author

retest this, please

@jgoizueta
Copy link
Contributor Author

retest this, please

@jgoizueta
Copy link
Contributor Author

pretty pretty please 🙏

@jgoizueta
Copy link
Contributor Author

retest this, please

@jgoizueta
Copy link
Contributor Author

🙄 the things you have to do to make that CI work

Synchronization::Member#table expects user_id to be assignable in the Table constructor
(and now also Synchronization::Adapter#import_cleanup expects it)
Sanitization of colum names was being accidentally ommited in Synchronization::Adapter#import_cleanup
This wasn't actually too bad because the sanitization was peformed after cartodbfication by Synchronization::Adapter#setup_table.
The cause of this was that ::Table relies on its @data_import for the sanitization version,
and since it hadn't been assigned in the Table instance used, it used the version 0 which means
"no sanitization".
This also had another side effect: we have been using double sanitization in syncs
(legacy version 1) possibly altering column names (double underscores caused by anti-collision suffixes).
So this commit restores that behaviour for v1 syncs (although a less accidental way of achieving
the effect would be desiderable, such as preventing double sanitization and altering v1 sanitization
to remove double underscores after collision prevention).
@jgoizueta
Copy link
Contributor Author

jgoizueta commented Dec 20, 2019

This is ready for review, but there are some pending actions discussed below, and
this should be deployed carefully (support awareness, keep and eye on rollbar, ready to take action, etc.)

@rafatower @gonzaloriestra I've made no formal review request as I don't want to bother you too much these days, but feel free to review it if you see this

The long following notes are mostly for myself when I'm back to work on this.

Description of the changes

A Table object has an associated column_sanitization_version value that determines how it's columns should be sanitized.

If the table doesn't have an associated DataImport (which happens if the user manually cartodbfied a table) then the version is 0, meaning "no sanitization should take place". This preserves the current behaviour.

Otherwise it will use the column_sanitization_version from the DataImport, which is stored in the JSON-serialized import_extra_options column.

If the DataImport has no column_sanitization_version (i.e. all existing data import records) it will
be mapped to the version value 1, meaning "the original sanitization we have been applying so far".

Newly created DataImport records will have the value determined by the CURRENT_COLUMN_SANITIZATION_VERSION constant, with value 2 at the moment, meaning "the new sanitization introduced here".

The legacy sanitization (version 1) dropped uppercase letters and other non-ASCII characters from the column names (except that for file imports ogr2ogr previously converts names to downcase, so uppercase is not lost).

The new one (version 2) applies a transliteration to downcase ASCII letters that we already had implemented in two places: String and StringSanitizer. I've removed redundancy leaving it just in the latter and also made some corrections. (we could include the module StringSanitizer in class String to restore access through
strings)

The new sanitization also differs in prepending an underscore to reserved keywords and system columns, but this is one of the questions to be discussed below as I'm not quite sure this is a good idea. (by changing these names we make it more convenient to handle
them in SQL since they don't require quoting)

Sanitization flavours

While testing this I noticed there are cases in which sanitization perform by imports differ from that of syncs, so some users could be affected by this having their column names change between the initial import and the first synchronization.

This example shows the columns I've been using for testing, all belong to the same import/sync table, so this shows how the sanitization and name collision-prevention transforms the names:

input column file import db import db sync v2 file v2 db
abc abc abc abc abc abc
Abc abc2 _bc _bc abc2 abc_1
2abc column_2abc column_2ab column_2ab _2abc _2abc
Регион _ _ _ region region
description/名稱 description description description description description
XX xx __1 _1 xx xx
YY yy __2 _2 yy yy
> min _min _min _min min min
any any any any _any _any
a+++b a_b a_b a_b a_b a_b

old file import/sync:

  • collision-avoiding suffixes style: x, x2, x3, ... (as set by ogr2ogr)
  • invalid identifier prefix: column_
  • cyrillic: _
  • other scripts/symbols: ignored
  • uppercase: to lowercase
  • reserved keywords: kept (so they require quoting in SQL)

old db-connector import

  • collision-avoiding suffixes style: x, x_1, x_2, ...
  • invalid identifier prefix: column_
  • cyrillic: _
  • other scripts/symbols: ignored
  • uppercase: _
  • uppercase: to lowercase

old db-connector synchronization

  • Like import, except:
    • suffix double underscore converted to single (_1 instead of __1)

new (version 2)

Applied only to new imports/syncs; existing syncs will not be altered

  • collision avoiding style still different for files and db: x2 vs x_1
  • invalid prefix: _
  • reserved word prefix: _ (no need for quoting)
  • cyrillic transliterated
  • uppercase: to lowercase

Fear, uncertainty, and doubt

Import/Sync, and specially sanitization code has become very messy over the years, so I've tried to leave it in a slightly tidier state while implementing this (just slightly enough that I could keep my sanity while doing these changes). That means there could be some deviation in the legacy sanitization or some other side effects affecting imports, editing columns, etc. I think only those that affect existing syncs should be relevant, and in the tests I've performed (see below) things work as before, but there could be corner cases.

Staging tests

For testing this in staging I've used both db-connector imports and CSV file imports.

For the db-connector I've been rather hacky, by using a Postgres database as the source, created in the same database host as the user database. The name of the database is testconn

In it I created this table:

CREATE TABLE weirder ("abc" int, "Abc" int, "2abc" int ,"Регион" int ,"description/名稱" int ,"XX" int ,"YY" int ,"> min" int, "any" int, "a++b" int);
INSERT INTO weirder VALUES (11,22,33,44,55,66,77,88,99,10), (110,220,330,440,550,660,770,880,990,100);

Imports and sycs can be created like this:

curl -k -H "Content-Type: application/json" -d '{"connector":{ "provider":"postgres", "connection":{ "sslmode": "disable", "server":"localhost", "username":"postgres", "database":"testconn"}, "table":"weirder","import_as":"weirder0"}}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"
curl -k -H "Content-Type: application/json" -d '{"interval":900, "connector":{ "provider":"postgres", "connection":{ "sslmode": "disable", "server":"localhost", "username":"postgres", "database":"testconn"}, "table":"weirder"}}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"

For the file imports a created a similar CSV file:

weirder.csv:

"abc","Abc","2abc","Регион","description/名稱","XX","YY","> min","any","a+++b"
11,22,33,44,55,66,77,88,99,10
110,220,330,440,550,660,770,880,990,100

Then stored it in a bucket accessible via https://storage.googleapis.com/test_carto/weirder.csv:

gsutil cp weirder.csv gs://test_carto/ && gsutil acl ch -u AllUsers:R gs://test_carto/weirder.csv

Imports and export can be performed now as:

curl -k -H "Content-Type: application/json" -d '{"url":"https://storage.googleapis.com/test_carto/weirder.csv"}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"
curl -k -H "Content-Type: application/json" -d '{"interval":900,"url":"https://storage.googleapis.com/test_carto/weirder.csv"}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY"

Open questions

Double sanitization (import-sync differences)

The double sanitization performed in synchronizations is still in place.
It makes legacy (v1) syncs keep their behaviour (produce _1 instead of __1) and doesn't affect new syncs (v2) because new sanitization is idemponent, so there's no difference between the import (single sanitization) and the sync (double sanitization).

But I don't like depending on that oddity, so I would consider removing the double sanitization
and changing the legacy (v1) sanitization to act as if applied twice, so that old syncs are not affected. (we don't need to reproduce the effect of old imports, so no problem here)

Changing the legacy sanitization is a matter of avoiding double underscores after adding the collision suffix.

Differences between file and db-connector imports

As can be seen in the table above, ogr2ogr has it's own logic to avoid column name collisions
(since it changes names to downcase) which is different from ours.

We could change the logic for our new (version 2) sanitization to match ogr2ogr.

Reserved names

Our new sanitization also has changed what we do with PostgreSQL reserved keywords (e.g. AS)
and system columns (e.g. xmin). If we use such names in a table then they must be quoted
every time they appear in SQL. The new sanitization is currently altering these names by
prepending an underscore. (this is done because use the sanitization functions in places
where we were expecting this behaviour) For imported data, this makes column names more convenient for users in the sense that they don't need to use quotes in SQL, but maybe it's better
to avoid unnecessary changes.

кириллица

Since we had an implementation of cyrillic transliteration I've applied it in the new sanitization: some of our users are importing tables with cyrillic column names and without this their columns end up as _, _1, _2, so I think it's nicer for them. We're still discriminating users of many other alphabets, of course. Arguably, we could preserve cyrillic and other alphabets (by using quotes they should be usable in SQL), but I don't want to bring in that question in this PR. So the question is simply: perform the cyrillic transliteration?

Review collision logic

For collision handling, the sanitization method Column::get_valid_column_name accepts a list of existing column names that should be avoided.

The name to be sanitized is removed from this list inside this function.

I think we should review the uses of this function (direct or indirect) because there could be cases in which we shouldn't remove the name from the list.

I can think of these cases (and I'm not sure if and how are we passing existing column names in each case):

  • Add a new column to existing table. The new name may already exist, we shouldn't remove it from the list.
  • Alter the name of an existing column: We should remove the old name, not the new one.
  • Sanitize one existing column (usually used to sanitize all one by one): in this case the original column must be removed.
  • Prepare column names for table creation: here we should incrementally fill columns with previously sanitized names as in the column_spec unit tests, and not remove the current one.

Missing tests

We should add tests for the versioning mechanism:

  • No sanitization applied for tables registered by ghost tables (i.e. version 0 applied)
  • For data imports without versioning info, version 1 is applied
  • For new data imports, version 2 is applied

It would also be nice to add integration or end-to-end tests similar to the ones performed in staging.

Copy link
Contributor

@gonzaloriestra gonzaloriestra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!! 👏 👏

I left some comments, but it looks good to me for now. Are you planning to add the missing tests you mentioned?

Regarding the open questions, my opinion is that we should change as few things as possible now and go step by step, because this feels a bit dangerous. But all of them make sense to me 👍

app/models/table.rb Outdated Show resolved Hide resolved
services/importer/lib/importer/column.rb Show resolved Hide resolved
candidate_column_name = candidate_column_name.to_s.squish

# Subsequent characters can be letters, underscores or digits
candidate_column_name = candidate_column_name.gsub(/[^a-z0-9]/,'_').gsub(/_{2,}/, '_')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take into account that original names with a double underscore, which should be valid, will be changed. Although it's not a common case, I guess.

services/importer/spec/unit/column_spec.rb Show resolved Hide resolved
@rafatower
Copy link
Contributor

This topic is complex enough to have it buried in issues. Could you please add an entry to doc-internal as part of it?

NEWS.md Outdated Show resolved Hide resolved
Copy link
Contributor

@gonzaloriestra gonzaloriestra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@jgoizueta jgoizueta merged commit e57ab5d into master Jan 17, 2020
@jgoizueta jgoizueta deleted the column_sanitization branch January 17, 2020 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants