Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys__id breaks merge operations #114

Closed
volkfox opened this issue Jul 21, 2024 · 9 comments · Fixed by #120
Closed

sys__id breaks merge operations #114

volkfox opened this issue Jul 21, 2024 · 9 comments · Fixed by #120
Assignees
Labels
bug Something isn't working p0-critical

Comments

@volkfox
Copy link
Contributor

volkfox commented Jul 21, 2024

Description

This is the same problem as was reported before with id field.

Now the id is hidden as __sys_id but nonetheless is not ignored in the merge operations.
As a result, merging two datasets creates a column right_sys__id, and merging three datasets still fails:

from datachain.lib.dc import Column, DataChain 

image_uri="gs://datachain-demo/coco2017/images/val/"
coco_json="gs://datachain-demo/coco2017/annotations_captions/" 

images = DataChain.from_storage(image_uri)
meta = DataChain.from_json(coco_json, jmespath = "images")
annotations = DataChain.from_json(coco_json, jmespath = "annotations")  
              
images_meta = images.merge(meta, on="file.name", right_on="images.file_name")
annotated_images = images_meta.merge(annotations, on="images.id", right_on="annotations.image_id")                


print(annotated_images.limit(1).results())
>>> print(annotated_images.limit(1).to_pandas())
Processed: 5000 rows [00:00, 15030.28 rows/s]
Processed: 1 rows [00:00, 1131.46 rows/s]
Download: 3.69MB [00:00, 17.4MB/s]]
Processed: 1 rows [00:00,  1.65 rows/s]
Generated: 5000 rows [00:00, 28006.95 rows/s]
Processed: 1 rows [00:00, 1232.89 rows/s]
Download: 3.69MB [00:00, 6.08MB/s]]
Processed: 1 rows [00:00,  1.09 rows/s]
Generated: 25014 rows [00:00, 42276.87 rows/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 988, in to_pandas
    transposed_result = list(map(list, zip(*self.results())))
  File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 746, in results
    return list(self.iterate_flatten(row_factory=row_factory))
  File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 732, in iterate_flatten
    with super().select(*db_signals).as_iterable() as rows:
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 1251, in as_iterable
    query = self.apply_steps().select()
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 1191, in apply_steps
    result = step.apply(
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 780, in apply
    query = query_generator.select()
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 131, in select
    return self.func(*self.columns)
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 1002, in q
    return sqlalchemy.select(*subquery.c).select_from(subquery)
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 1141, in __get__
    obj.__dict__[self.__name__] = result = self.fget(obj)
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 860, in c
    self._populate_column_collection()
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 1633, in _populate_column_collection
    self.element._generate_fromclause_column_proxies(self)
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 6316, in _generate_fromclause_column_proxies
    prox = [
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 6317, in <listcomp>
    c._make_proxy(
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 4810, in _make_proxy
    raise exc.InvalidRequestError(
sqlalchemy.exc.InvalidRequestError: Label name right_sys__id is being renamed to an anonymous label due to disambiguation which is not supported right now.  Please use unique names for explicit labels.

Version Info

0.2.6.dev4+gef2347f

Python 3.9.4
@volkfox volkfox added bug Something isn't working priority-p1 labels Jul 21, 2024
@volkfox volkfox changed the title _sys__id breaks merge operations sys__id breaks merge operations Jul 21, 2024
@dmpetrov
Copy link
Member

@skshetry @rlamy any idea how to fix it?

The sys was not requested in the code. So, user should not see tis issue.

PS: it makes me think - merge() has to recreate sys columns (this only way to guaranty uniqueness of IDs). WDYT?

@dmpetrov
Copy link
Member

made it P0 - it's should be in getting started for the release.

@skshetry skshetry self-assigned this Jul 22, 2024
@skshetry
Copy link
Member

The problem is not with sys__id AFAIU. (We do need to remove it from schema but that's a separate issue).

sqlalchemy.exc.InvalidRequestError: Label name right_sys__id is being renamed to an anonymous label due to disambiguation which is not supported right now. Please use unique names for explicit labels.

^ The problem is this.

@skshetry
Copy link
Member

skshetry commented Jul 22, 2024

Okay, so this happens due to double merge, where we use the right_ prefix for columns for the merge of the right side.

Since the right_ prefix is already used on first merge, second merge also uses that prefix and the column names conflict.

The fix is to pass a custom rname for the second merge:

images_meta = images.merge(meta, on="file.name", right_on="images.file_name")
annotated_images = images_meta.merge(annotations, on="images.id", right_on="annotations.image_id", rname="right2_")

@dmpetrov
Copy link
Member

The fix is to pass a custom rname for the second merge:

This is a great workaround.
@volkfox could you please use this for now.

@skshetry the bigger issue is - user again got into the ID issue where it was not requested. We need to improve it.

@dmpetrov
Copy link
Member

The issue was not fixed

@dmpetrov dmpetrov reopened this Jul 22, 2024
@skshetry
Copy link
Member

@dmpetrov, the issue was not related to id. It just happened that id is the first column in the table and raised an issue. You'd get same issue with duplication with other columns eg: file.name`, etc.

@skshetry
Copy link
Member

I know there is a separate issue with join, but that is separate issue (and which is what I am trying to fix right now).

@dmpetrov
Copy link
Member

that's right - it's just a first column. However, the file.name issue would be in a user's shoulders since user created this while sys_id is the issue we created.

User should never run intro sys id when it was not requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p0-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants