Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to persist dataset even if exception is thrown #504

Merged
merged 11 commits into from
Oct 15, 2024

Conversation

amritghimire
Copy link
Contributor

@amritghimire amritghimire commented Oct 9, 2024

Refactoring upon the changes at #494 from the comments, this introduces the way to allow a way to persist dataset even if exception is thrown. With this change, the cleanup is now done at the active context. So, we can use the session context as checkpoints now. Example implementation is as:

import logging
from datachain import Session
from datachain.lib.dc import DataChain
logging.basicConfig(level=logging.INFO)
# Dataset with variable session. Will persist.
session = Session("asVariable")
dv = DataChain.from_values(key=["a", "b", "c"], session=session)
dv.save("passed_as_argument")
# A datachain created in global context.
# This will be reverted back due to the global exception.
DataChain.from_values(key=["a", "b", "c"]).save("global_test_datachain_v1")
with Session("local"):
# A datachain created in local context.
# This will persist since error occur in global context.
DataChain.from_values(key=["a", "b", "c"]).save("local_test_datachain")
try:
with Session("local_failure"):
# A datachain created in local context.
# This will not persist since error occur in local context.
DataChain.from_values(key=["a", "b", "c"]).save("local_test_datachain_v2")
raise ValueError("Local failure class")
except ValueError:
pass
# We return to global context. So, this will also be reverted.
DataChain.from_values(key=["a", "b", "c"]).save("global_error_class_v2")
raise Exception("This is a test exception")

In this file, local_test_datachain should only persist since it is the only dataset where the exception doesn't occur in current context.

Closes #500

@amritghimire amritghimire requested review from dreadatour, shcheklein and a team October 9, 2024 17:07
@amritghimire amritghimire self-assigned this Oct 9, 2024
Copy link

cloudflare-workers-and-pages bot commented Oct 9, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 501ad01
Status: ✅  Deploy successful!
Preview URL: https://87fed5ff.datachain-documentation.pages.dev
Branch Preview URL: https://amrit-persist-error.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Project coverage is 87.11%. Comparing base (e0c654e) to head (501ad01).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/query/session.py 95.45% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #504   +/-   ##
=======================================
  Coverage   87.10%   87.11%           
=======================================
  Files          92       92           
  Lines        9852     9832   -20     
  Branches     1350     1349    -1     
=======================================
- Hits         8582     8565   -17     
  Misses        917      917           
+ Partials      353      350    -3     
Flag Coverage Δ
datachain 87.08% <95.83%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -1218,6 +1218,10 @@ def register_dataset(
preview=dataset_version.preview,
job_id=dataset_version.job_id,
)

session = Session.get_current_context()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session can be also passed to DataChain explicitly (check from_storage, for example) ... it's probably expected that we pass it directly into all the datacnain factories even when we use contextmanager ... not sure get_current_context is the right approach ... at least it might be changing the assumption ... can we check docs / existing tests please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the logic along with example in the file where the variable session is used. PTAL.

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

src/datachain/query/session.py Outdated Show resolved Hide resolved
src/datachain/query/session.py Outdated Show resolved Hide resolved
src/datachain/query/session.py Show resolved Hide resolved
src/datachain/lib/dc.py Outdated Show resolved Hide resolved
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to move forward, but this design Session -> Catalog -> Sessions looks a bit weird to me.

@amritghimire
Copy link
Contributor Author

Approving to move forward, but this design Session -> Catalog -> Sessions looks a bit weird to me.

Thanks. Modified the chain logic.

@amritghimire amritghimire merged commit af2ebf8 into main Oct 15, 2024
38 checks passed
@amritghimire amritghimire deleted the amrit/persist-error branch October 15, 2024 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a way to persist dataset even if exception is thrown
3 participants