Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for pruning and some internal renaming #4505

Merged
merged 7 commits into from
Mar 29, 2023
Merged

Docs for pruning and some internal renaming #4505

merged 7 commits into from
Mar 29, 2023

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Mar 28, 2023

This PR provides some user-level explanation of how pruning works and renames the copy strategy for pruning to 'rebuild'. Since we already have a 'copy' operation in graph-node, calling the strategy 'rebuild' reduces the risk for confusion between the two very different operations.

Copy link
Contributor

@azf20 azf20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just some ideas in the comments!

@@ -0,0 +1,82 @@
## Pruning deployments

Pruning is an operation that deletes data from a deployment that is only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might start with the higher level context for, maybe something like...
By default, subgraphs store a full version history for entities, allowing consumers to query the subgraph as of any historical block. Pruning is an operation that deletes entity versions from a deployment older than a certain block, so it is no longer possible to query that deployment as of prior blocks. In GraphQL...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, incorporated.

accumulated more history than that. Whenever the deployment does contain
more history than that, the deployment is automatically repruned. If
ongoing pruning is not desired, pass the `--once` flag to `graphman
prune`. Ongoing pruning can be turned off by setting `history_blocks` to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check my understanding, the turning off pointer here is saying that if you pruned once with (say) 10,000 blocks (setting history_block to 10,000), if you want to turn off pruning you might call graphman prune --history 1000000000 so 1B blocks, which is effectively no pruning)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's exactly what I meant here

existing tables into new tables and then replaces the existing tables with
these much smaller tables. Which strategy to use is determined for each
table individually, and governed by the settings for
`GRAPH_STORE_HISTORY_REBUILD_THRESHOLD` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these thresholds 0-1 (i.e. 0.5 is 50%)? Or 0-100?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's between 0 and 1, added that to the text

`GRAPH_STORE_HISTORY_DELETE_THRESHOLD`: if we estimate that we will remove
more than `REBUILD_THRESHOLD` of the table, the table will be rebuilt. If
we estimate that we will remove a fraction between `REBUILD_THRESHOLD` and
`DELETE_THRESHOLD` of the table, unneeded entity versions will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there checks that REBUILD_THRESHOLD is greater than DELETE_THRESHOLD (does it matter?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked the code - there are actually no checks; you could use that to make REBUILD_THRESHOLD lower than DELETE_THRESHOLD which would disable rebuilding

Pruning is a user-visible operation and does affect some of the things that
can be done with a deployment:

* because it removes history, it restricts how far back time-travel queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth linking to the time travel docs page?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked at the time-travel doc, and it's super low-level about how rows in the db are manipulated. Seems we miss more of a user-level explanation of it.

with pruning.

Pruning is started by running `graphman prune`. That command will perform
an initial prune of the deployment and set the subgraph's `history_blocks`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that initial prune now async (i.e. it doesn't block indexing?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a paragraph fro that. It blocks indexing with the rebuild strategy while it copies nonfinal entities. I also added another paragraph explaining what log output to look for.

@lutter lutter merged commit 1280949 into master Mar 29, 2023
@lutter lutter deleted the lutter/prune branch March 29, 2023 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants