-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-download files from the staging directory to output #500
base: main
Are you sure you want to change the base?
Conversation
c75a717
to
eb8e18f
Compare
Can we support the following use case:
2 separate notebook jobs could depend on the same python file: With the current approach the customer has to copy the |
a28c881
to
383a37e
Compare
cfc4886
to
7bbed76
Compare
bfb68dd
to
92b9088
Compare
Upon first view, the term "Package input folder" doesn't mean anything to me. I don't know why, as a user, I would want to do that, or whether I would need to. This change also doesn't modify the documentation at all. I would recommend:
|
e868446
to
ee0fdb8
Compare
Kicking CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Andrii! I left a few comments regarding the auto-download capability being introduced in this branch, but didn't complete a full code review.
My main concern with this PR is that 1) the new default auto-download capability being proposed here is a potential breaking change, and 2) the changes to the download logic are not required to implement the 'include parent dir' capability, which was the original intent of this PR.
Let's explore the possibility opening another PR with just the 'include parent dir' capability such that it can be reviewed, merged, and released separately.
session.commit() | ||
|
||
|
||
class DownloadManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this class with DownloadRecordManager
above? I don't see the benefit of splitting the logic here into two separate classes if they are only used together anyways.
# Forces new processes to not be forked on Linux. | ||
# This is necessary because `asyncio.get_event_loop()` is bugged in | ||
# forked processes in Python versions below 3.12. This method is | ||
# called by `jupyter_core` by `nbconvert` in the default executor. | ||
|
||
# See: https://github.com/python/cpython/issues/66285 | ||
# See also: https://github.com/jupyter/jupyter_core/pull/362 | ||
multiprocessing.set_start_method("spawn", force=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this is that this line affects the multiprocessing
behavior globally for everything running on this main thread, i.e. the server and all server extensions running on a JupyterLab instance. We should really avoid doing this just to pass our GitHub workflows. Consider what happens if:
- Another extension is also calling
multiprocessing.set_start_method(..., force=True)
, or - Some part of the server / server extension breaks when the start method is changed during its lifetime by our extension's
initialize_settings()
method.
I don't have a solution for how this bug can be fixed. However, the error message is pretty specific about why an exception is being raised, so my intuition is that this bug can be fixed. I'm leaving some references here for us to review in the future.
- My write-up on the first encounter of this bug: Fix "event loop is already running" bug on Linux #450
- The exception being raised in the E2E test workflow: https://github.com/jupyter-server/jupyter-scheduler/actions/runs/8730209153/job/23953600827#step:9:161
async def process_download_queue(self): | ||
while not self.download_manager.queue.empty(): | ||
download = self.download_manager.queue.get() | ||
download_record = self.download_manager.record_manager.get(download.download_id) | ||
if not download_record: | ||
continue | ||
await self.job_files_manager.copy_from_staging(download.job_id, download.redownload) | ||
self.download_manager.delete_download(download.download_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid using multiprocessing.Queue
if we are already writing pending downloads to a DB. Can we read directly from self.download_manager.record_manager
instead?
I believe that this may fix the process bug previously raised on the E2E tests in this branch. This is the corresponding error message:
RuntimeError: A SemLock created in a fork context is being shared with a process in a spawn context. This is not supported. Please use the same context to create multiprocessing objects and Process.
If we remove the need for multiprocessing objects, we may be able to fix this bug without relying on multiprocessing.set_start_method()
. Can you give this a try?
@andrii-i can you confirm if this usecase can be supported? Can we support changing the input folder path on the UI? Rest of this lgtm |
@sravyasdh thank you for looking into this PR. In the initial version only packaging the input file's folder via checkbox control will be supported so use case in question will not be supported. Reason is that it's not clear if users would prefer more complex UI/UX that would allow packaging of the folder one level above or packaging of the arbitrary folder or ability to input folder path on the UI. So it makes sense to start with simpler implementation, let users try it and get feedback from them, and update the implementation based on feedback if needed. |
For package input folder, superseded by #510. |
…r and ExecutionManager
166ed05
to
9d41103
Compare
for more information, see https://pre-commit.ci
Adds auto-download functionality.
For packaging input folder, this PR was superseded by #510.