-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify ordering semantics #477
Comments
Good point. We need to keep information about the order and use it in the following statements when it makes sense. |
It does not seem possible to generate a sequential list of IDs in BigQuery (more details here). I am currently stuck on how we can provide deterministic results from all chains in an efficient way. With that said; Is there a clear use case for ordering an intermediate step in a chain? I can find superficial examples in tests such as this: def name_len(path):
return (len(posixpath.basename(path)),)
DataChain.from_storage(path, session=session).order_by("file.path").map(
name_len, params=["file.path"], output={"name_len": int}
).save(ds_name) but do we support cumulative values being calculated across rows, wouldn't this be done via
We could patch @detach
def select(self, *args, **kwargs) -> "Self":
named_args = [v.label(k) for k, v in kwargs.items()]
query = self.clone()
last_step = query.steps[-1] if query.steps else None
query.steps.append(SQLSelect((*args, *named_args)))
if isinstance(last_step, SQLOrderBy):
query.steps.append(SQLOrderBy(last_step.args))
return query but I think it would be better to explicitly state that |
I'd also like to understand why we are trying to stamp out the following SQL behaviour:
Why must the order of queries be deterministic in DataChain even when an order has not been specified? Edit: There are remnants in the system of this ordering behaviour but it looks like it is completely untested (see #489). |
It looks like there is some misunderstanding in the ordering. First, let's clarify basic assumptions:
Please LMK if you disagree with some of these or you'd like to add something. |
Thanks for clarifying those basic assumptions. I think I have identified some discrepancies between the assumptions and the current implementation.
I don't think this is strictly true on the ClickHouse side because we have this definition for Column("sys__id", UInt64, primary_key=True, server_default=func.rand64()), which means that if it is not explicitly set we end up with a random value. We do use
The preview will always appear stable now because we persist the dataset's first 20 rows in the metastore against the dataset version and use that instead of loading the dataset each time. The following test passes for both SQLite and ClickHouse: test_dataset_preview_orderdef test_dataset_preview_order(test_session):
ids = list(range(10000))
order = ids[::-1]
catalog = test_session.catalog
dataset_name = "test"
DataChain.from_values(id=ids, order=order, session=test_session).order_by(
"order"
).save(dataset_name)
preview_values = []
for r in catalog.get_dataset(dataset_name).get_version(1).preview:
id = ids.pop()
o = order.pop()
entry = (id, o)
preview_values.append((id, o))
assert (r["id"], r["order"]) == entry
DataChain.from_dataset(dataset_name, session=test_session).save(dataset_name)
for r in catalog.get_dataset(dataset_name).get_version(2).preview:
assert (r["id"], r["order"]) == preview_values.pop(0)
DataChain.from_dataset(dataset_name, 2, session=test_session).order_by("id").save(
dataset_name
)
for r in catalog.get_dataset(dataset_name).get_version(3).preview:
assert r["id"] == ids.pop(0)
assert r["order"] == order.pop(0) BUT it seems that it only passes by chance because to create the preview we do this: if not dataset_version.preview:
values["preview"] = (
DatasetQuery(name=dataset.name, version=version, catalog=self)
.limit(20)
.to_db_records()
) which roughly translates to
I have tested this and it seems to work but only by chance as well (no ordering provided in the SQL). As you can see from #489 we have some logic in the code base where we
IMO, this is worth investigating but as stated elsewhere this seems like a difficult problem to solve, especially across databases and at scale (see https://github.com/iterative/studio/issues/10635#issuecomment-2403630921 for some context). Having all the obvious use-cases like "Another good use case - streaming dataset to ML training with a given order" would be a great start and agreeing that we do not guarantee the order of "unordered chains" would be very helpful. |
From a long discussion in the planning meeting: For now, the order of a dataset will only be guaranteed when an |
Perfect, thanks @mattseddon . It should simplify things a lot. Thanks for pushing this forward. |
Can we close this / repurpose this btw? |
We need to specify clearly how
.order_by()
interacts with other methods.For instance, in SQL, the order of the results from a SELECT query is undefined unless there is an ORDER BY clause. Currently,
chain.order_by("a").select("a")
generates SQL that look likeSELECT a FROM (SELECT a FROM some_table ORDER BY a)
, which is therefore technically unordered, even though it'll usually return sorted results. While it might be fine to say that.select()
doesn't preserve ordering, it seems obvious thatchain.order_by("a").collect("a")
should return the "a" column in sorted order, andchain.collect("a")
is currently implemented aschain.select("a").collect()
...The text was updated successfully, but these errors were encountered: