Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement autosync_interval trait #25

Merged
merged 2 commits into from
Oct 25, 2022
Merged

Conversation

dlqqq
Copy link
Collaborator

@dlqqq dlqqq commented Oct 24, 2022

See discussion in #20.

@kevin-bates
Copy link
Member

Hi @dlqqq - thanks for the ping.

We should avoid exposing implementation details, like a boolean autosync parameter, on get_path(). Instead, could we defer synchronization until we find that the path that corresponds to id does not exist? I think this will trigger far fewer synchronizations (and zero in some cases like cloud/controlled deployments).

get_path(id):

    query for id, if not found return None

    id found, 
    stat(path), if found return path

    path not found,
    if initial attempt - sync-all and repeat

    else (second attempt), drop row (or whatever should be done when files are deleted)
    return None

Regarding auto-sync functionality, I figured that would be done in a background process, but, regardless, you'd still run the risk of an out-of-band change between intervals. By deferring synchronization to only when necessary, yes, you penalize the poor client that found an anomaly, but that synchronization likely resolves the next client's anomaly (kinda like crowdsourcing 😄).

@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 25, 2022

@kevin-bates

We should avoid exposing implementation details, like a boolean autosync parameter, on get_path().

Yeah, I had initially added this because I figured we should give users some way of disabling autosync, but honestly I don't see much use-case for it given that autosync is already being smart about not calling sync_all() too many times. I'll remove it.

Instead, could we defer synchronization until we find that the path that corresponds to id does not exist? I think this will trigger far fewer synchronizations (and zero in some cases like cloud/controlled deployments).

Yeah, this would be ideal. This would be a huge performance gain for calling get_path() on a file that wasn't moved. I had planned on implementing this as soon as I got back into file ID, but unfortunately more pressing matters emerged. Let me track this in a separate issue: #27

Regarding auto-sync functionality, I figured that would be done in a background process, but, regardless, you'd still run the risk of an out-of-band change between intervals. By deferring synchronization to only when necessary, yes, you penalize the poor client that found an anomaly, but that synchronization likely resolves the next client's anomaly (kinda like crowdsourcing 😄).

Yup, running sync_all() in a separate process would be ideal because that would mean that get_path() will never block the current thread for too long by calling sync_all(). Right now, the behavior is that if your last invocation of get_path() was longer than 5 seconds ago, you're just going to have to endure the 0-400ms delay.

However, because sync_all() involves writing to the database, this won't work right now because SQLite does not work with concurrent writes. We'll likely have all database writes get routed to a separate process polling requests from ZMQ. We will need to think about this issue more deeply in the future, but for now the current implementation will not need to handle multiple processes. I think we can be track further discussion about this here: #13

As usual, thank you for asking the hard questions.

@dlqqq dlqqq added the enhancement New feature or request label Oct 25, 2022
@dlqqq dlqqq merged commit 867be79 into jupyter-server:main Oct 25, 2022
@dlqqq dlqqq deleted the autosync branch October 25, 2022 00:25
@kevin-bates
Copy link
Member

It's awesome to see #27 - yes, an optimistic approach is a simple way to express it (thanks).

I suspect the autosync_interval trait could go away if you find optimistic get_path() sufficient and I wonder if that approach should be checked out first.

@kevin-bates
Copy link
Member

oh - never mind I guess. I see this was just merged.

@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 25, 2022

I don't think any of the performance enhancements we had discussed are mutually exclusive with one another. They all have different use cases:

  • optimistic get_path(): applies to all use cases
  • autosync interval: used when having a slight delay is OK (e.g. path is only used for rendering the name of the file in the file browser, nothing important)
  • execution in a separate process: applies to cases when autosync is used and when multiple processes need to call file ID manager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants