Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support eager mode #2331

Merged
merged 4 commits into from
Jul 25, 2024
Merged

Conversation

jieguangzhou
Copy link
Collaborator

Description

Related Issues

Checklist

  • Is this code covered by new or existing unit tests or integration tests?
  • Did you run make unit_testing and make integration-testing successfully?
  • Do new classes, functions, methods and parameters all have docstrings?
  • Were existing docstrings updated, if necessary?
  • Was external documentation updated, if necessary?

Additional Notes or Comments

@jieguangzhou jieguangzhou force-pushed the feat/eager-mode branch 3 times, most recently from dbb1caf to b229454 Compare July 19, 2024 15:32
@jieguangzhou
Copy link
Collaborator Author

TODO: Need to add the document

@jieguangzhou jieguangzhou force-pushed the feat/eager-mode branch 4 times, most recently from a5ca324 to 00d77d3 Compare July 22, 2024 15:29
Comment on lines +727 to +735
def replace_id(part):
find_params = part[1]
if not find_params:
find_params = ({}, {})

assert isinstance(find_params[1], dict)
find_params = list(find_params)
find_params[1]['_id'] = 1
return (part[0], tuple(find_params), part[2])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only return _id

Comment on lines +744 to +760
def filter(self, filter):
"""Return a query that filters the documents.

:param filter: The filter to apply.
"""

def replace_function(part):
find_params = part[1]
if not find_params:
find_params = ({},)

assert isinstance(find_params[0], dict)

find_params[0].update(filter)
return (part[0], tuple(find_params), part[2])

return self._replace_part('find', replace_function)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the key of the filter to the list of returned fields.​⬤

Copy link
Collaborator

@blythed blythed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you are using the attribute/ method .datas() to fetch data eagerly.

Why not just use .execute() directly?

CFG.eager = True
r = table.select().execute()[0]

@jieguangzhou
Copy link
Collaborator Author

jieguangzhou commented Jul 24, 2024

I see that you are using the attribute/ method .datas() to fetch data eagerly.

Why not just use .execute() directly?

CFG.eager = True
r = table.select().execute()[0]

I think the Eager mode can be always enabled, so the returned data is not the raw data.

This approach allows users to purposefully use the Eager mode.

If we directly combine it with .execute(), we need to return the original data type (or write related records of SuperDuperData as hidden attributes into the original data without modifying its data type), otherwise, it will cause hidden bugs in downstream applications.

WDYT?


assert self.db is not None, 'No datalayer (db) provided'
query = self
if not len(query.parts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means query is Simply a Table, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

def _eager_call__(self, *args, **kwargs):
from superduper.misc.eager import SuperDuperData, SuperDuperDataType, TrackData

have_sdd, graph = SuperDuperData.detect_and_get_graph(*args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already been called at line 910?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the graph parameter is retrieved again because I don’t want the graph parameter to be passed from upstream. This function has very low overhead.

@jieguangzhou
Copy link
Collaborator Author

jieguangzhou commented Jul 24, 2024

I see that you are using the attribute/ method .datas() to fetch data eagerly.

Why not just use .execute() directly?

CFG.eager = True
r = table.select().execute()[0]

After discuss with @kartik4949 , I change to datas = list(db["documents"].select().execute(eager_mode=True))

When eager_mode=False, the original data is returned without any additional information. Otherwise, it enters eager mode, returning SuperDuperData and begins tracking operations performed on the data.

@jieguangzhou jieguangzhou requested a review from kartik4949 July 24, 2024 14:15
When using `select.execute(eager_mode=True)`, all returned data will enter eager mode, which can be used for interactive model pipeline construction.

```python
datas = list(db["documents"].select().execute(eager_mode=True))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting language thing: plural of "data" is "data", singular is "datum".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, Got it, I changed all of them

project.update({k: 1 for k in filter_mapping_base.keys()})

predict_ids_in_filter = [
key.replace("_outputs__", "") for key in filter_mapping_outptus.keys()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this _outputs__?

Copy link
Collaborator Author

@jieguangzhou jieguangzhou Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimized the logic by using predict_id.

Additionally, since MongoMock does not support ‘.’ in the ‘as’ field for lookups, the key was converted to ‘outputs_’.

After the join, the complete outputs data can be queried as _outputs__{predict_id}._outputs.{predict_id} : result.

Explanations have already been added in the code.

superduper/misc/eager.py Outdated Show resolved Hide resolved
@jieguangzhou jieguangzhou force-pushed the feat/eager-mode branch 3 times, most recently from 1981447 to eabcea8 Compare July 25, 2024 03:53
- Support eager mode condition filter

- Add eager mode tutorial

- Use select.execute instead of select.datas to initiate eager mode.
@jieguangzhou jieguangzhou merged commit 78f4ede into superduper-io:main Jul 25, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[USER-EXP-LIST] [FEAT] Optimizing the building experience of data pipelines
3 participants