sys__id breaks merge operations #114

volkfox · 2024-07-21T21:33:36Z

Description

This is the same problem as was reported before with id field.

Now the id is hidden as __sys_id but nonetheless is not ignored in the merge operations.
As a result, merging two datasets creates a column right_sys__id, and merging three datasets still fails:

from datachain.lib.dc import Column, DataChain 

image_uri="gs://datachain-demo/coco2017/images/val/"
coco_json="gs://datachain-demo/coco2017/annotations_captions/" 

images = DataChain.from_storage(image_uri)
meta = DataChain.from_json(coco_json, jmespath = "images")
annotations = DataChain.from_json(coco_json, jmespath = "annotations")  
              
images_meta = images.merge(meta, on="file.name", right_on="images.file_name")
annotated_images = images_meta.merge(annotations, on="images.id", right_on="annotations.image_id")                


print(annotated_images.limit(1).results())

>>> print(annotated_images.limit(1).to_pandas())
Processed: 5000 rows [00:00, 15030.28 rows/s]
Processed: 1 rows [00:00, 1131.46 rows/s]
Download: 3.69MB [00:00, 17.4MB/s]]
Processed: 1 rows [00:00,  1.65 rows/s]
Generated: 5000 rows [00:00, 28006.95 rows/s]
Processed: 1 rows [00:00, 1232.89 rows/s]
Download: 3.69MB [00:00, 6.08MB/s]]
Processed: 1 rows [00:00,  1.09 rows/s]
Generated: 25014 rows [00:00, 42276.87 rows/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 988, in to_pandas
    transposed_result = list(map(list, zip(*self.results())))
  File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 746, in results
    return list(self.iterate_flatten(row_factory=row_factory))
  File "/Users/dkh/datachain/src/datachain/lib/dc.py", line 732, in iterate_flatten
    with super().select(*db_signals).as_iterable() as rows:
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 1251, in as_iterable
    query = self.apply_steps().select()
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 1191, in apply_steps
    result = step.apply(
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 780, in apply
    query = query_generator.select()
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 131, in select
    return self.func(*self.columns)
  File "/Users/dkh/datachain/src/datachain/query/dataset.py", line 1002, in q
    return sqlalchemy.select(*subquery.c).select_from(subquery)
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 1141, in __get__
    obj.__dict__[self.__name__] = result = self.fget(obj)
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 860, in c
    self._populate_column_collection()
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 1633, in _populate_column_collection
    self.element._generate_fromclause_column_proxies(self)
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 6316, in _generate_fromclause_column_proxies
    prox = [
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/selectable.py", line 6317, in <listcomp>
    c._make_proxy(
  File "/Users/dkh/datachain/venv/datachain/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 4810, in _make_proxy
    raise exc.InvalidRequestError(
sqlalchemy.exc.InvalidRequestError: Label name right_sys__id is being renamed to an anonymous label due to disambiguation which is not supported right now.  Please use unique names for explicit labels.

Version Info

0.2.6.dev4+gef2347f

Python 3.9.4

The text was updated successfully, but these errors were encountered:

dmpetrov · 2024-07-21T22:10:19Z

@skshetry @rlamy any idea how to fix it?

The sys was not requested in the code. So, user should not see tis issue.

PS: it makes me think - merge() has to recreate sys columns (this only way to guaranty uniqueness of IDs). WDYT?

dmpetrov · 2024-07-21T22:10:43Z

made it P0 - it's should be in getting started for the release.

skshetry · 2024-07-22T04:28:28Z

The problem is not with sys__id AFAIU. (We do need to remove it from schema but that's a separate issue).

sqlalchemy.exc.InvalidRequestError: Label name right_sys__id is being renamed to an anonymous label due to disambiguation which is not supported right now. Please use unique names for explicit labels.

^ The problem is this.

skshetry · 2024-07-22T04:45:35Z

Okay, so this happens due to double merge, where we use the right_ prefix for columns for the merge of the right side.

Since the right_ prefix is already used on first merge, second merge also uses that prefix and the column names conflict.

The fix is to pass a custom rname for the second merge:

images_meta = images.merge(meta, on="file.name", right_on="images.file_name")
annotated_images = images_meta.merge(annotations, on="images.id", right_on="annotations.image_id", rname="right2_")

dmpetrov · 2024-07-22T04:55:25Z

The fix is to pass a custom rname for the second merge:

This is a great workaround.
@volkfox could you please use this for now.

@skshetry the bigger issue is - user again got into the ID issue where it was not requested. We need to improve it.

dmpetrov · 2024-07-22T04:56:56Z

The issue was not fixed

skshetry · 2024-07-22T04:58:30Z

@dmpetrov, the issue was not related to id. It just happened that id is the first column in the table and raised an issue. You'd get same issue with duplication with other columns eg: file.name`, etc.

skshetry · 2024-07-22T04:59:07Z

I know there is a separate issue with join, but that is separate issue (and which is what I am trying to fix right now).

dmpetrov · 2024-07-22T05:02:23Z

that's right - it's just a first column. However, the file.name issue would be in a user's shoulders since user created this while sys_id is the issue we created.

User should never run intro sys id when it was not requested.

volkfox added bug Something isn't working priority-p1 labels Jul 21, 2024

volkfox changed the title ~~_sys__id breaks merge operations~~ sys__id breaks merge operations Jul 21, 2024

skshetry added p0-critical and removed priority-p1 labels Jul 22, 2024

skshetry self-assigned this Jul 22, 2024

skshetry closed this as completed Jul 22, 2024

dmpetrov reopened this Jul 22, 2024

skshetry mentioned this issue Jul 22, 2024

merge/join: exclude sys signals #120

Merged

skshetry closed this as completed in #120 Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sys__id breaks merge operations #114

sys__id breaks merge operations #114

volkfox commented Jul 21, 2024 •

edited by skshetry

Loading

dmpetrov commented Jul 21, 2024

dmpetrov commented Jul 21, 2024

skshetry commented Jul 22, 2024

skshetry commented Jul 22, 2024 •

edited

Loading

dmpetrov commented Jul 22, 2024

dmpetrov commented Jul 22, 2024

skshetry commented Jul 22, 2024

skshetry commented Jul 22, 2024

dmpetrov commented Jul 22, 2024

sys__id breaks merge operations #114

sys__id breaks merge operations #114

Comments

volkfox commented Jul 21, 2024 • edited by skshetry Loading

Description

Version Info

dmpetrov commented Jul 21, 2024

dmpetrov commented Jul 21, 2024

skshetry commented Jul 22, 2024

skshetry commented Jul 22, 2024 • edited Loading

dmpetrov commented Jul 22, 2024

dmpetrov commented Jul 22, 2024

skshetry commented Jul 22, 2024

skshetry commented Jul 22, 2024

dmpetrov commented Jul 22, 2024

volkfox commented Jul 21, 2024 •

edited by skshetry

Loading

skshetry commented Jul 22, 2024 •

edited

Loading