Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: delete lots of data from indexers when leaving project #469

Merged
merged 14 commits into from
May 20, 2024

Conversation

EvanHahn
Copy link
Contributor

@EvanHahn EvanHahn commented Feb 6, 2024

When you leave a project, we now delete all indexed data outside of the auth namespace. Fixes #427.

Co-Authored-By: Andrew Chou andrewchou@fastmail.com

src/capabilities.js Outdated Show resolved Hide resolved
src/index-writer/index.js Show resolved Hide resolved
src/utils.js Outdated Show resolved Hide resolved
@EvanHahn EvanHahn marked this pull request as ready for review March 27, 2024 22:09
Copy link
Member

@gmaclennan gmaclennan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this and related code I have some thoughts:

  • We need to close cores before we unlink/purge them - this should cancel replication requests and closes file descriptors
  • We should probably unreplicate cores that we are deleting before deleting them
  • We should write the updated role for leaving a project before deleting anything
  • We should probably wait for (pre)sync to complete before deleting a project - ensuring connected peers get all the data from the departing device, including the role record that they have left.
  • I'm not entirely clear about the state of a project instance after leaving - not everything is closed, but I'm not sure if the instance will work any more, or just have undefined behaviour?
  • I'm not sure about deleting schemas from the indexer separately from deleting the multi-core-indexer state - what happens if we don't delete a schema in a namespace but we do delete the multi-core-indexer?
  • I'm worried we're introducing complicated states where the indexes could be out-of-sync with the underlying cores. Would it be safer to just delete all indexes and index reader states and then re-index the cores we have left?
  • I think we need to maintain some kind of left project state, so that when a new project instance is created for a project that has been left, it doesn't try to re-sync everything and re-create the writer cores.

Sorry for inconclusive review - this PR raises a lot of questions / highlights some potential issues.

src/datastore/index.js Show resolved Hide resolved
src/mapeo-project.js Outdated Show resolved Hide resolved
src/mapeo-project.js Outdated Show resolved Hide resolved
src/mapeo-project.js Outdated Show resolved Hide resolved
@EvanHahn
Copy link
Contributor Author

EvanHahn commented Apr 5, 2024

We need to close cores before we unlink/purge them - this should cancel replication requests and closes file descriptors

I think we're doing that here:

https://github.com/digidem/mapeo-core-next/blob/cdebfafa772958729f5ade5e64c04010b160b710/src/mapeo-project.js#L624-L629

We should probably unreplicate cores that we are deleting before deleting them

I'm not super familiar with its internals but it looks like Hypercore might do this for us?

We should write the updated role for leaving a project before deleting anything

Done in ab4d9c1.

I'm not sure about deleting schemas from the indexer separately from deleting the multi-core-indexer state - what happens if we don't delete a schema in a namespace but we do delete the multi-core-indexer?

At worst, we might've left a project but not cleaned up one or both of the indexers due to a crash or other failure.

We could add something to check this at startup. Pseudocode:

onStartup(() => {
  for (const leftProject in getLeftProjects()) {
    leftProject.makeSureCoresAndIndexesAreDestroyed()
  }
})

I'm worried we're introducing complicated states where the indexes could be out-of-sync with the underlying cores. Would it be safer to just delete all indexes and index reader states and then re-index the cores we have left?

I defer to your expertise here but I think we're just clearing the indexes and closing/deleting the cores. I suppose cores could have data in them that hasn't made it to the indexes and never will, but I think that's fine?

This feels like one of a few questions we haven't answered about leaving projects. Others include:

  • We should probably wait for (pre)sync to complete before deleting a project - ensuring connected peers get all the data from the departing device, including the role record that they have left.
  • I'm not entirely clear about the state of a project instance after leaving - not everything is closed, but I'm not sure if the instance will work any more, or just have undefined behaviour?
  • I think we need to maintain some kind of left project state, so that when a new project instance is created for a project that has been left, it doesn't try to re-sync everything and re-create the writer cores.

I think these are separate tasks, but I agree.

@EvanHahn
Copy link
Contributor Author

I'm going to file the following issues if that seems reasonable to you:

  • At startup, clean up left projects that failed to delete

    If we start to leave a project and the app crashes, we may fail to remove data from indexers. At startup, we should find left projects and remove the "orphaned" data.

  • Test leaving a project while offline, then connecting and syncing

    We don't test what happens when someone leaves a project while offline, then syncs that to others. This test may uncover bugs.

  • Test project instance after leaving

    We don't test what happens when someone leaves a project, then tries to perform certain operations on it. This test may uncover bugs or missing features.

Do those three tasks seem reasonable?

@EvanHahn
Copy link
Contributor Author

Just filed #583,#584, and #585 based on the above.

Copy link
Member

@gmaclennan gmaclennan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems ok to me, although I admit I can't quite get my head around possible edge-cases, but what is here looks like it should work. It could be worth spending some time writing out some more detailed tests of how we expect devices that have left a project to behave when connecting to others. I think the other aims of project leaving are:

  • Remove others data to free up disk space - can we test for this?
  • Keep own data so that it can be recovered later - can we test for this?

@EvanHahn
Copy link
Contributor Author

EvanHahn commented May 9, 2024

  • Remove others data to free up disk space - can we test for this?
  • Keep own data so that it can be recovered later - can we test for this?

Good idea. I'll plan to do this in #585.

@EvanHahn EvanHahn self-assigned this May 13, 2024
@EvanHahn EvanHahn merged commit 22aef9f into main May 20, 2024
7 checks passed
@EvanHahn EvanHahn deleted the remove-indexes-when-leaving branch May 20, 2024 18:40
EvanHahn added a commit that referenced this pull request May 20, 2024
When you leave a project, we now delete all indexed data outside of the
`auth` namespace. Fixes [#427].

[#427]: #427

Co-Authored-By: Andrew Chou <andrewchou@fastmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete indexed data when leaving a project
2 participants