Is it expected for column names to be alpha-sorted? #222

aaronsteers · 2021-10-27T04:19:29Z

It looks like columns may be alpha-sorted and I'm not sure if this is to-be-expected / intentional.

The sorting appears to happen here in flatten_schema:

https://github.com/transferwise/pipelinewise-target-snowflake/blob/master/target_snowflake/flattening.py#L68:L73

    sorted_items = sorted(items, key=key_func)
    for k, g in itertools.groupby(sorted_items, key=key_func):
        if len(list(g)) > 1:
            raise ValueError(f'Duplicate column name produced in schema: {k}')

    return dict(sorted_items)

Seems like this might also work for the dupe check, while still returning the (original) pre-sort column ordering:

    sorted_items = sorted(items, key=key_func)
    for k, g in itertools.groupby(sorted_items, key=key_func):
        if len(list(g)) > 1:
            raise ValueError(f'Duplicate column name produced in schema: {k}')

    return dict(items). # <<<

Curious as to the thoughts around this - if sorting is necessary for other reasons or if I'm perhaps reading this incorrectly.

Thank in advance!

s7clarke10 · 2021-11-02T21:31:59Z

Adding my 2 cents to the question. It would be great if there was a choice / option on whether the columns are alpha sorted or not if I understand this comment correctly. Even if the default behaviour was to alpha sort columns, it would be nice to be able to override this via an environment variable.
For me personally, when I push content from a tap, I would prefer to have the columns sorted in the same order as the original database.

Samira-El · 2021-11-05T10:34:18Z

Hey Aaron, I don't honestly recall why columns are being sorted here, one thing I reckon this sorting might be helping with is the order of columns in CSV and matching that to the order in COPY/MERGE commands, I could be wrong though, this needs to be tested.

have you tried removing the sorting and testing if loading still works?

@koszti do you perhaps recall why we're sorting here?

aaronsteers · 2021-11-05T19:39:57Z

Hi, @Samira-El - Thanks for your comments. I've not tested myself, but I can understand if there could be some other factors at play.

koszti · 2021-11-06T07:55:51Z

Interesting. We never wanted the columns to be alpha-sorted and I'm not sure why the extra sort here. 🤔 Even if we do the sort, the flatten_schema returns a dict which doesn't guarantee the order of the items when we're using it to generate the create_table_query. I think the columns in the target table is sometimes alpha-sorted but sometimes not and it's safe to remove this extra sort. What do you think?

Originally we wanted the columns to be sorted in the same order as the original database, but the singer spec doesn't support this option. In the singer messages the columns are dict key, and the order is more like random.

s7clarke10 · 2022-06-16T21:22:00Z

Hi,

A colleague of mine has re-raised the issue of ordering records written to Snowflake and the preference to retain the tap ordering if possible.

I'm wondering if the sorting was to do with older versions of Python v 3.5 and below which didn't guarantee the order of items. Certain versions of Python 3.6 would retain order, and from Python 3.7+ it is guaranteed to retain the insertion order for a Dictionary. Given we are seeing a standard to deprecate older versions of Python i.e. 3.6 and below, is it worth looking at the ability to retain the original tap column order?

Perhaps it could be implemented via a Environment Variable setting which by default sorts alphabetically for backwards compatibility, but can be overwritten to not sort the dictionary so if you are running Python 3.7+ you could obtain the original tap column ordering?

I haven't experimented with this but I'm interested in your thoughts on this and whether you think this is a feature you would support?

Samira-El · 2022-06-17T08:35:55Z

Hi Steve, I don't have any strong opinions about keeping the alphabetical sorting.

I don't fancy having this a feature flag - it'd be yet another path in the code-, I reckon we can get rid of the sort and honor the order of fields as present in the schema emitted by the tap.

Would be happy to review a PR with this change.

s7clarke10 · 2022-06-17T10:45:30Z

Hi Samira, Thank you very much for your feedback on that. I will look to build and test this and will create a PR when I am happy with the change. As a FYI, I have pushed through two PR’s for tap-s3-csv with a new feature and bug fix. I note that integration tests work on my local when I set the required environment variables but seem to fail in the github ci/cd tests. Thanks Steve From: Samira El Aabidi ***@***.***> Sent: Friday, 17 June 2022 8:36 PM To: transferwise/pipelinewise-target-snowflake ***@***.***> Cc: Steve Clarke ***@***.***>; Comment ***@***.***> Subject: Re: [transferwise/pipelinewise-target-snowflake] Is it expected for column names to be alpha-sorted? (Issue #222) Hi Steve, I don't have any strong opinions about keeping the alphabetical sorting. I don't fancy having this a feature flag - it'd be yet another path in the code-, I reckon we can get rid of the sort and honor the order of fields as present in the schema emitted by the tap. Would be happy to review a PR with this change. — Reply to this email directly, view it on GitHub <#222 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUDU42QKXWJVRKDB7KMNJYLVPQ2HRANCNFSM5GZKZ6UA> . You are receiving this because you commented. <https://github.com/notifications/beacon/AUDU42XN5MEJH7MYAYQDGNTVPQ2HRA5CNFSM5GZKZ6UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIUHWOGI.gif> Message ID: ***@***.*** ***@***.***> >

mjsqu · 2022-10-17T02:33:12Z

We've been running with this commit from my fork for quite a while:

4354b8f

Which I notice is exactly the same as suggested at the top of this thread.

The incoming SCHEMA messages dictate the ordering of columns, so we have also been configuring our taps to provide an ordered list of columns (based on their column ordinal ID rather than name). Once a SCHEMA message has been processed, it doesn't matter what ordering the RECORD messages take.

mjsqu · 2022-10-17T02:36:10Z

Now I recall the reason for sorting is simply to verify that there are no duplicated column names, so I agree with @aaronsteers code change. The sorted_items object is created for the purposes of checking, but the unsorted items object is what should be used to generate the target table/object in Snowflake.

aaronsteers added the help wanted Extra attention is needed label Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it expected for column names to be alpha-sorted? #222

Is it expected for column names to be alpha-sorted? #222

aaronsteers commented Oct 27, 2021

s7clarke10 commented Nov 2, 2021

Samira-El commented Nov 5, 2021

aaronsteers commented Nov 5, 2021

koszti commented Nov 6, 2021 •

edited

Loading

s7clarke10 commented Jun 16, 2022

Samira-El commented Jun 17, 2022

s7clarke10 commented Jun 17, 2022 via email

mjsqu commented Oct 17, 2022 •

edited

Loading

mjsqu commented Oct 17, 2022

Is it expected for column names to be alpha-sorted? #222

Is it expected for column names to be alpha-sorted? #222

Comments

aaronsteers commented Oct 27, 2021

s7clarke10 commented Nov 2, 2021

Samira-El commented Nov 5, 2021

aaronsteers commented Nov 5, 2021

koszti commented Nov 6, 2021 • edited Loading

s7clarke10 commented Jun 16, 2022

Samira-El commented Jun 17, 2022

s7clarke10 commented Jun 17, 2022 via email

mjsqu commented Oct 17, 2022 • edited Loading

mjsqu commented Oct 17, 2022

koszti commented Nov 6, 2021 •

edited

Loading

mjsqu commented Oct 17, 2022 •

edited

Loading