Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Is it expected for column names to be alpha-sorted? #222

Open
aaronsteers opened this issue Oct 27, 2021 · 9 comments
Open

Is it expected for column names to be alpha-sorted? #222

aaronsteers opened this issue Oct 27, 2021 · 9 comments
Labels
help wanted Extra attention is needed

Comments

@aaronsteers
Copy link

It looks like columns may be alpha-sorted and I'm not sure if this is to-be-expected / intentional.

The sorting appears to happen here in flatten_schema:

https://github.com/transferwise/pipelinewise-target-snowflake/blob/master/target_snowflake/flattening.py#L68:L73

    sorted_items = sorted(items, key=key_func)
    for k, g in itertools.groupby(sorted_items, key=key_func):
        if len(list(g)) > 1:
            raise ValueError(f'Duplicate column name produced in schema: {k}')

    return dict(sorted_items)

Seems like this might also work for the dupe check, while still returning the (original) pre-sort column ordering:

    sorted_items = sorted(items, key=key_func)
    for k, g in itertools.groupby(sorted_items, key=key_func):
        if len(list(g)) > 1:
            raise ValueError(f'Duplicate column name produced in schema: {k}')

    return dict(items). # <<<

Curious as to the thoughts around this - if sorting is necessary for other reasons or if I'm perhaps reading this incorrectly.

Thank in advance!

@aaronsteers aaronsteers added the help wanted Extra attention is needed label Oct 27, 2021
@s7clarke10
Copy link
Contributor

Adding my 2 cents to the question. It would be great if there was a choice / option on whether the columns are alpha sorted or not if I understand this comment correctly. Even if the default behaviour was to alpha sort columns, it would be nice to be able to override this via an environment variable.
For me personally, when I push content from a tap, I would prefer to have the columns sorted in the same order as the original database.

@Samira-El
Copy link
Contributor

Hey Aaron, I don't honestly recall why columns are being sorted here, one thing I reckon this sorting might be helping with is the order of columns in CSV and matching that to the order in COPY/MERGE commands, I could be wrong though, this needs to be tested.

have you tried removing the sorting and testing if loading still works?

@koszti do you perhaps recall why we're sorting here?

@aaronsteers
Copy link
Author

Hi, @Samira-El - Thanks for your comments. I've not tested myself, but I can understand if there could be some other factors at play.

@koszti
Copy link
Contributor

koszti commented Nov 6, 2021

Interesting. We never wanted the columns to be alpha-sorted and I'm not sure why the extra sort here. 🤔 Even if we do the sort, the flatten_schema returns a dict which doesn't guarantee the order of the items when we're using it to generate the create_table_query. I think the columns in the target table is sometimes alpha-sorted but sometimes not and it's safe to remove this extra sort. What do you think?

Originally we wanted the columns to be sorted in the same order as the original database, but the singer spec doesn't support this option. In the singer messages the columns are dict key, and the order is more like random.

@s7clarke10
Copy link
Contributor

Hi,

A colleague of mine has re-raised the issue of ordering records written to Snowflake and the preference to retain the tap ordering if possible.

I'm wondering if the sorting was to do with older versions of Python v 3.5 and below which didn't guarantee the order of items. Certain versions of Python 3.6 would retain order, and from Python 3.7+ it is guaranteed to retain the insertion order for a Dictionary. Given we are seeing a standard to deprecate older versions of Python i.e. 3.6 and below, is it worth looking at the ability to retain the original tap column order?

Perhaps it could be implemented via a Environment Variable setting which by default sorts alphabetically for backwards compatibility, but can be overwritten to not sort the dictionary so if you are running Python 3.7+ you could obtain the original tap column ordering?

I haven't experimented with this but I'm interested in your thoughts on this and whether you think this is a feature you would support?

@Samira-El
Copy link
Contributor

Hi Steve, I don't have any strong opinions about keeping the alphabetical sorting.

I don't fancy having this a feature flag - it'd be yet another path in the code-, I reckon we can get rid of the sort and honor the order of fields as present in the schema emitted by the tap.

Would be happy to review a PR with this change.

@s7clarke10
Copy link
Contributor

s7clarke10 commented Jun 17, 2022 via email

@mjsqu
Copy link

mjsqu commented Oct 17, 2022

We've been running with this commit from my fork for quite a while:

4354b8f

Which I notice is exactly the same as suggested at the top of this thread.

The incoming SCHEMA messages dictate the ordering of columns, so we have also been configuring our taps to provide an ordered list of columns (based on their column ordinal ID rather than name). Once a SCHEMA message has been processed, it doesn't matter what ordering the RECORD messages take.

@mjsqu
Copy link

mjsqu commented Oct 17, 2022

Now I recall the reason for sorting is simply to verify that there are no duplicated column names, so I agree with @aaronsteers code change. The sorted_items object is created for the purposes of checking, but the unsorted items object is what should be used to generate the target table/object in Snowflake.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants