Skip to content
This repository has been archived by the owner on Oct 27, 2019. It is now read-only.

Don't store duplicate images #21

Open
eksopl opened this issue Mar 4, 2012 · 10 comments
Open

Don't store duplicate images #21

eksopl opened this issue Mar 4, 2012 · 10 comments

Comments

@eksopl
Copy link
Owner

eksopl commented Mar 4, 2012

A quick check for /a/ shows that 66% of all images are reposts. It seems that one can considerably lower the disk space needed for thumbnail and full image archival by storing each unique image only once.

Requires dumper changes and possibly a rather heavy migration script, but it's feasible.

@GXTX
Copy link

GXTX commented Mar 4, 2012

This will be nice. Sometimes I don't like the fact I have to backup 13GB of thumbnails.

@eksopl eksopl mentioned this issue Mar 13, 2012
@anounyym1
Copy link
Contributor

Bump! (of course that doesn't work here github)

This would fix also that problem when picture if fully archived later, older ones haven't correct link. For example http://archive.rebeccablacktech.com/g/image/HZZG8IG0r6r7M4STnJBs-w >>22525926 goes to 404

@eksopl
Copy link
Owner Author

eksopl commented Apr 22, 2012

This feature is under progress in asagi for testing. The Foolz guys (@woxxy, @oohnoitz) have taken it upon themselves to write the migration scripts to support that image storing scheme, so I'm currently waiting on their results, no point on repeating work.

@nstepien
Copy link

@woxxy, @oohnoitz

This is related so I'll post it here: upscaled thumbnails look like crap, you may want to save both small/reply and large/OP thumbnails.
See http://archive.foolz.us/a/search/image/Lv3xCerHLMp7MftjsqwHrA/

@eksopl
Copy link
Owner Author

eksopl commented May 19, 2012

They are. See woxxy's explanation on /foolz/.

Sent from my Nokia phone

@nstepien
Copy link

They are.

What are you replying to?

See woxxy's explanation on /foolz/.

Okay.

@eksopl
Copy link
Owner Author

eksopl commented May 19, 2012

I mean they do store both thumbnails, but since the old images aren't in place yet, ffuuka falls back to the reply thumb, if it was grabbed in the meanwhile.

Sent from my Nokia phone

@nstepien
Copy link

I see I see. My high standards expected ENTREPRISE QUALITY migrations with 0 impact on the end-user. My bad.

You should stop with these

Sent from my Nokia phone

It's really awkward.

@eksopl
Copy link
Owner Author

eksopl commented May 19, 2012

Yeah, everything about nokia is awkward at this point.

Anyway, I'll most likely come up with easier methods to speed things up for image migration, once things return to me.

@woxxy
Copy link

woxxy commented May 19, 2012

Matching media_id and media_hash is slow. /a/ has been processing since Monday, while other boards took at least half a day. I hope /a/ is done by Monday so I can start copying its images too.

This and its variations. We ran it in batches so it doesn't rollback if it dies (and I believe it would also lock all the rows if we didn't):

UPDATE `jp` board, `'jp_images` AS img
SET board.media_id = img.media_id
AND board.media_hash = img.media_hash
AND board.media_id = 0;

/a/ is gigantic so we might be a special case, lucky we're "soon" done anyway. Other archive hosts might have less powerful servers so this might be a situation even for smaller boards.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants