Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive old posts to reduce disk usage #5016

Open
5 tasks done
Nutomic opened this issue Sep 12, 2024 · 14 comments
Open
5 tasks done

Archive old posts to reduce disk usage #5016

Nutomic opened this issue Sep 12, 2024 · 14 comments
Labels
area: database enhancement New feature or request

Comments

@Nutomic
Copy link
Member

Nutomic commented Sep 12, 2024

Requirements

  • Is this a feature request? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a feature request? Do not put multiple feature requests in one issue.
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.
  • Do you agree to follow the rules in our Code of Conduct?

Is your proposal related to a problem?

Votes make up the largest part of Lemmy's database size. In case of lemmy.ml, the database is 40.4 GB, with 15 GB of those being votes. Of those votes, 63% are more than 6 months old, and this proportion will only go up with time.

Describe the solution you'd like.

Really there is no reason to keep all these old votes around, because old posts are ignored by ranking algoritms. So we could save about 9.5 GB or 24% of disk space on lemmy.ml by deleting votes for posts older than 6 months. Votes displayed to users will still be correct as they are stored separately in post_aggregates table. We only need to ensure that ranking algorithms never recalculate scores for archived posts. Additionally it makes sense to lock commenting and other actions on posts after the same interval.

Describe alternatives you've considered.

Keep the current behaviour, but it will lead to very large database sizes in a few years.

Additional context

No response

@poVoq
Copy link

poVoq commented Sep 12, 2024

I am actually more concerned about the amount of writes votes do to SSD storage. I really chews through NVMe drives, and consumer grade SSDs with a low TDW are gone in about a year of Lemmy usage.

@dessalines
Copy link
Member

If we do something like this, I'd like to add an archived or archive_votes boolean column to the comment and post tables, so that any updates we make to the aggregate tables can easily ignore those ones, and make sure not to update them.

We could update that archived column as part of a periodic / startup job.

@MrKaplan-lw
Copy link

Speaking from Lemmy.World perspective, we do not want to drop old votes.
If this is implemented, it should be configurable. Perhaps by having an option to specify the minimum age, which could be set to never.

@Die4Ever
Copy link

Die4Ever commented Sep 18, 2024

Wouldn't this affect people who sort by New Comments or Active? Locking voting would be a shame but acceptable understanding the disk space costs of storing every individual vote.

Disabling comments would be bad though I think. It doesn't really save disk space, does it? I always thought being able to comment on and continue with old posts was a big strength of Lemmy over Reddit. Because Lemmy has the New Comments and Active sort methods it means you can actually have long term discussions like old forums could do. And it's really helpful with stickied/pinned posts too.

@dessalines
Copy link
Member

dessalines commented Sep 18, 2024

Its possible to clear out old votes in a way that doesn't lock the old content. We'd just need to make sure that it never recalculates the scores from scratch for those items.

@asudox
Copy link

asudox commented Oct 9, 2024

With storage prices being very cheap, I don't see a reason to archive votes. When Lemmy starts getting indexed more, I believe Lemmy will rise. If you archive votes, you will essentially confuse people that don't use Lemmy as to why a post has x comments but no votes. Furthermore, just like in YouTube, upvotes and downvotes are an indicator as to how a post is perceived by the community. If a post has bad advice/content/etc., downvotes will (hopefully) dominate and let other users know about it.

If at all, an optional auto deletion of old posts with little to no activity can be used instead (e.g. posts with 0-2 comments and/or 0-5 votes that are older than 2+ years)

@Nothing4You
Copy link
Contributor

Nothing4You commented Oct 9, 2024

While storage prices are cheap, votes are still one of the parts using the most storage in the db, as they are two of the tables with the most rows.
The proposal in this issue was to stop counting individual votes for old content but to lock the vote count in place and remove the individual votes from the DB. This way you only need to store two numbers per post/comment that stores the last value before archival and can remove potentially hundreds of rows from the table.

@DraconicNEO
Copy link

I'm against this, it severely limits the ability for people to engage with older content, it both confuses people and prevents people from adding to the discussion later if new information is present. I always hated the idea of "archiving" posts so people could no longer interact. You know how many times I've gotten useful information from replying to very old posts or given people useful information because they replied to one of my older posts. So yeah I'm against this, maybe if we can find a way to reduce the vote data without denying the ability for future votes to be added to it that would be good but I'm against locking old posts and saying they're archived, the Reddit way.

@DraconicNEO
Copy link

DraconicNEO commented Oct 10, 2024

Speaking from Lemmy.World perspective, we do not want to drop old votes. If this is implemented, it should be configurable. Perhaps by having an option to specify the minimum age, which could be set to never.

By default it should be off, majority of instance admins aren't going to touch the config, and that would cause headaces for other people, and possibly them when this starts happening. If you're going to have it at all, make it off by default. Still opposed to the Reddit way though of locking all further engagement of a post, that feels wrong, because as I said earlier, people do benefit from engagement with older posts.

@dessalines
Copy link
Member

dessalines commented Oct 15, 2024

It still kind of boggles me that 25% of our DB is just votes. People sometimes post pages of markdown.

Other possibilities for saving space, that wouldn't be archiving:

  • The score column is smallint, to allow easy summing (-1, 1), which postgres uses 2 bytes for. But we could also switch to boolean, which would take that down to 1 byte.
  • Remove the published column on comment_like and post_like. Its timestamp with time zone, which is 8 bytes. I don't really like this but its true that we rarely use the accounting columns.

@Nothing4You
Copy link
Contributor

published is needed to deal with activities received out of order, at least for some time.

it might make a difference if it's nullable and could be purged for older content, but there's also an argument to keep the dates, as people looking at liked content will usually prefer to have that sorted by when they liked it, not by when it was published, similar to #4446. while this isn't exposed in lemmy 0.19.5, i think #5034 makes this available in the next release?

@Die4Ever
Copy link

published is needed to deal with activities received out of order, at least for some time.

it might make a difference if it's nullable and could be purged for older content, but there's also an argument to keep the dates, as people looking at liked content will usually prefer to have that sorted by when they liked it, not by when it was published, similar to #4446. while this isn't exposed in lemmy 0.19.5, i think #5034 makes this available in the next release?

Local votes could keep their published timestamp but remote votes don't need it? Although it's kind of wasteful to have another table or something, but it would probably help small instances a bit

@Nothing4You
Copy link
Contributor

remote votes still do to properly deal with activities received out of order.

when someone downvotes and then upvotes something and this gets transmitted out of order, letting the receiving instance see the upvote first and then the downvote, there wouldn't be a way for the receiving instance to know to ignore the upvote otherwise.

@Nutomic
Copy link
Member Author

Nutomic commented Oct 16, 2024

It should be possible to send vote timestamps via federation, but not store them in the db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: database enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants