-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Versioned column sanitization #15326
Conversation
Note that the extra_options accessor canot be used to modify the underlying column
retest this, please |
Note that ogc_fid shouldn't be altered since it is handled by Table#import_cleanup
retest this, please |
pretty pretty please 🙏 |
retest this, please |
🙄 the things you have to do to make that CI work |
Synchronization::Member#table expects user_id to be assignable in the Table constructor (and now also Synchronization::Adapter#import_cleanup expects it)
Sanitization of colum names was being accidentally ommited in Synchronization::Adapter#import_cleanup This wasn't actually too bad because the sanitization was peformed after cartodbfication by Synchronization::Adapter#setup_table. The cause of this was that ::Table relies on its @data_import for the sanitization version, and since it hadn't been assigned in the Table instance used, it used the version 0 which means "no sanitization". This also had another side effect: we have been using double sanitization in syncs (legacy version 1) possibly altering column names (double underscores caused by anti-collision suffixes). So this commit restores that behaviour for v1 syncs (although a less accidental way of achieving the effect would be desiderable, such as preventing double sanitization and altering v1 sanitization to remove double underscores after collision prevention).
This is ready for review, but there are some pending actions discussed below, and @rafatower @gonzaloriestra I've made no formal review request as I don't want to bother you too much these days, but feel free to review it if you see this The long following notes are mostly for myself when I'm back to work on this. Description of the changesA If the table doesn't have an associated Otherwise it will use the If the DataImport has no Newly created DataImport records will have the value determined by the The legacy sanitization (version 1) dropped uppercase letters and other non-ASCII characters from the column names (except that for file imports ogr2ogr previously converts names to downcase, so uppercase is not lost). The new one (version 2) applies a transliteration to downcase ASCII letters that we already had implemented in two places: The new sanitization also differs in prepending an underscore to reserved keywords and system columns, but this is one of the questions to be discussed below as I'm not quite sure this is a good idea. (by changing these names we make it more convenient to handle Sanitization flavoursWhile testing this I noticed there are cases in which sanitization perform by imports differ from that of syncs, so some users could be affected by this having their column names change between the initial import and the first synchronization. This example shows the columns I've been using for testing, all belong to the same import/sync table, so this shows how the sanitization and name collision-prevention transforms the names:
old file import/sync:
old db-connector import
old db-connector synchronization
new (version 2)Applied only to new imports/syncs; existing syncs will not be altered
Fear, uncertainty, and doubtImport/Sync, and specially sanitization code has become very messy over the years, so I've tried to leave it in a slightly tidier state while implementing this (just slightly enough that I could keep my sanity while doing these changes). That means there could be some deviation in the legacy sanitization or some other side effects affecting imports, editing columns, etc. I think only those that affect existing syncs should be relevant, and in the tests I've performed (see below) things work as before, but there could be corner cases. Staging testsFor testing this in staging I've used both db-connector imports and CSV file imports. For the db-connector I've been rather hacky, by using a Postgres database as the source, created in the same database host as the user database. The name of the database is In it I created this table: CREATE TABLE weirder ("abc" int, "Abc" int, "2abc" int ,"Регион" int ,"description/名稱" int ,"XX" int ,"YY" int ,"> min" int, "any" int, "a++b" int);
INSERT INTO weirder VALUES (11,22,33,44,55,66,77,88,99,10), (110,220,330,440,550,660,770,880,990,100); Imports and sycs can be created like this: curl -k -H "Content-Type: application/json" -d '{"connector":{ "provider":"postgres", "connection":{ "sslmode": "disable", "server":"localhost", "username":"postgres", "database":"testconn"}, "table":"weirder","import_as":"weirder0"}}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY" curl -k -H "Content-Type: application/json" -d '{"interval":900, "connector":{ "provider":"postgres", "connection":{ "sslmode": "disable", "server":"localhost", "username":"postgres", "database":"testconn"}, "table":"weirder"}}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY" For the file imports a created a similar CSV file:
Then stored it in a bucket accessible via gsutil cp weirder.csv gs://test_carto/ && gsutil acl ch -u AllUsers:R gs://test_carto/weirder.csv Imports and export can be performed now as: curl -k -H "Content-Type: application/json" -d '{"url":"https://storage.googleapis.com/test_carto/weirder.csv"}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY" curl -k -H "Content-Type: application/json" -d '{"interval":900,"url":"https://storage.googleapis.com/test_carto/weirder.csv"}' "https://$USER_NAME.carto-staging.com/api/v1/imports/?api_key=$API_KEY" Open questionsDouble sanitization (import-sync differences)The double sanitization performed in synchronizations is still in place. But I don't like depending on that oddity, so I would consider removing the double sanitization Changing the legacy sanitization is a matter of avoiding double underscores after adding the collision suffix. Differences between file and db-connector importsAs can be seen in the table above, We could change the logic for our new (version 2) sanitization to match ogr2ogr. Reserved namesOur new sanitization also has changed what we do with PostgreSQL reserved keywords (e.g. кириллицаSince we had an implementation of cyrillic transliteration I've applied it in the new sanitization: some of our users are importing tables with cyrillic column names and without this their columns end up as _, _1, _2, so I think it's nicer for them. We're still discriminating users of many other alphabets, of course. Arguably, we could preserve cyrillic and other alphabets (by using quotes they should be usable in SQL), but I don't want to bring in that question in this PR. So the question is simply: perform the cyrillic transliteration? Review collision logicFor collision handling, the sanitization method The name to be sanitized is removed from this list inside this function. I think we should review the uses of this function (direct or indirect) because there could be cases in which we shouldn't remove the name from the list. I can think of these cases (and I'm not sure if and how are we passing existing column names in each case):
Missing testsWe should add tests for the versioning mechanism:
It would also be nice to add integration or end-to-end tests similar to the ones performed in staging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!! 👏 👏
I left some comments, but it looks good to me for now. Are you planning to add the missing tests you mentioned?
Regarding the open questions, my opinion is that we should change as few things as possible now and go step by step, because this feels a bit dangerous. But all of them make sense to me 👍
candidate_column_name = candidate_column_name.to_s.squish | ||
|
||
# Subsequent characters can be letters, underscores or digits | ||
candidate_column_name = candidate_column_name.gsub(/[^a-z0-9]/,'_').gsub(/_{2,}/, '_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take into account that original names with a double underscore, which should be valid, will be changed. Although it's not a common case, I guess.
This topic is complex enough to have it buried in issues. Could you please add an entry to doc-internal as part of it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
See #15231