-
Notifications
You must be signed in to change notification settings - Fork 116
Adding DataChain.export_files(...)
#30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
facabdb
21e95d5
a47f219
f382739
c8625cc
8ba52a7
4f0b2df
9b0fca1
3217021
90c3ac1
86355dd
f66cac3
55b2253
bbba52d
2285554
bb6fb50
505b1e7
0a3fa78
e82a48a
e04d0dd
6a9c43b
406feb6
4c91fdb
e630077
175d40d
4775cb1
caaaa5f
a8ceb22
35b47b5
1b1f15a
9459eb6
cdd6a87
8fffba0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -865,6 +865,12 @@ def apply_sql_clause(self, query): | |
return sqlalchemy.select(f.count(1)).select_from(query.subquery()) | ||
|
||
|
||
@frozen | ||
class SQLDistinct(SQLClause): | ||
ilongin marked this conversation as resolved.
Show resolved
Hide resolved
|
||
def apply_sql_clause(self, query): | ||
return query.distinct() | ||
|
||
|
||
@frozen | ||
class SQLUnion(Step): | ||
query1: "DatasetQuery" | ||
|
@@ -1407,6 +1413,12 @@ def offset(self, offset: int) -> "Self": | |
query.steps.append(SQLOffset(offset)) | ||
return query | ||
|
||
@detach | ||
def distinct(self) -> "Self": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This signature is not comprehensive The main use case for distinct() in the datasets is removal of duplicate entries - for that, the function should take signal (or signal list) as an argument There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right! @ilongin could you please implement this as a follow up issue? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Created #89 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I will create a followup issue. It seems like we need sometning like PostgreSQL specific DISTINCT ON which is not available in SQLite though (it has just "normal" distinct which returns unique column(s)) where we will prob need to implement it with group by or something else under the hood There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if self.select(f"{signal}.name").distinct().count() != self.count():
raise ValueError("Files with the same name found") This statement might not ideal for two resons:
This seems like group by with a count is the right way to solve this, not distinct. |
||
query = self.clone() | ||
query.steps.append(SQLDistinct()) | ||
return query | ||
|
||
def as_scalar(self) -> Any: | ||
with self.as_iterable() as rows: | ||
row = next(iter(rows)) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -434,6 +434,40 @@ def test_select_except(cloud_test_catalog): | |
] | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"cloud_type,version_aware", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we check distinct without File? It should not touch other parts if there is no need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd try to distinct on a list of integers... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we discussed, I would leave this for separated issue as there are multiple tests that could be refactored in this way in this file |
||
[("s3", True)], | ||
indirect=True, | ||
) | ||
def test_distinct(cloud_test_catalog): | ||
catalog = cloud_test_catalog.catalog | ||
path = cloud_test_catalog.src_uri | ||
ds = DatasetQuery(path=path, catalog=catalog) | ||
|
||
q = ds.select(C.parent).order_by(C.parent).distinct() | ||
result = q.results() | ||
|
||
assert result == [ | ||
("",), | ||
("cats",), | ||
("dogs",), | ||
("dogs/others",), | ||
] | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"cloud_type,version_aware", | ||
[("s3", True)], | ||
indirect=True, | ||
) | ||
def test_distinct_count(cloud_test_catalog): | ||
catalog = cloud_test_catalog.catalog | ||
path = cloud_test_catalog.src_uri | ||
ds = DatasetQuery(path=path, catalog=catalog) | ||
|
||
assert ds.select(C.parent).order_by(C.parent).distinct().count() == 4 | ||
|
||
|
||
@pytest.mark.parametrize("save", [True, False]) | ||
@pytest.mark.parametrize( | ||
"cloud_type,version_aware", | ||
|
Uh oh!
There was an error while loading. Please reload this page.