Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As of yesterday, Pixiv began removing metadata from images. #807

Closed
charzho opened this issue Oct 6, 2020 · 9 comments
Closed

As of yesterday, Pixiv began removing metadata from images. #807

charzho opened this issue Oct 6, 2020 · 9 comments
Labels

Comments

@charzho
Copy link

charzho commented Oct 6, 2020

This isn't so much of an issue that anyone can fix (I think), but anyone using the database functionality of PixivUtil2 to check for duplicates and image edits should be aware.
It is very likely that new files and old files will look the same and you may re-download the same image, but with metadata removed.

I don't know if anything can be done in PixivUtil2 without slowing down the checking process drastically. Determining the hash of an image with metadata removed would basically require doing the exact same removal as Pixiv.

I don't know if the original, unedited images can be retrieved either.

@Nandaka
Copy link
Owner

Nandaka commented Oct 7, 2020

metadata removed

This will affect the filesize, right? If you enable alwaysCheckFileSize = True, then it will consider as different file.

original, unedited images can be retrieved either.

unlikely.

@charzho
Copy link
Author

charzho commented Oct 7, 2020

Yeah, that's the problem. The file size and file hash will be changed, so visually identical files will be downloaded with this setting, possibly every single file!

The fault lies mostly on Pixiv, as they have not been modifying files for years, but suddenly decided to do this.
However, people should be aware if they have this setting enabled, because this would be doubling almost every file as from what I've seen, files are being modified retroactively as well.

@xion2
Copy link

xion2 commented Oct 8, 2020

IMPORTANT: Do not delete the old copies of your files. I'm currently doing additional testing to determine whether Pixiv used a truly lossless method of metadata removal by trying to recreate the process they used.

Thanks for the heads up, it's appreciated. After looking into the situation a bit more here is what I discovered:

Why this probably happened:
The metadata (EXIF) was most likely removed due to privacy concerns due to the sensitive data EXIF can hold. As a result this decision is unlikely to be reversed and the "original" files probably no longer exist. Even if they could reverse it I doubt they will. No clue if something bad happened that prompted this sudden change after 13 years.

What images does this affect?
JPG's that held EXIF data. These files will no longer be possible to verify and will need to be re-downloaded if you want to utilize the functionality to check for updated images since some artists do update old images.

PNG are not affected since they don't hold EXIF data. Furthermore, not all JPG's are affected since some artists took extra steps to remove EXIF data before uploading to Pixiv. These images were still identical when I ran checksum verification on them. I need to test ugoira files more.

Why are some files now larger or smaller?
EXIF removal is a strange beast. You could end up with a smaller file, something around the same size or a larger file. I found examples of both smaller and larger which means the method they used to remove EXIF is most likely lossless in nature.

Has image quality been affected?
No. I can't guarantee this but in my tests inspecting images at 1000% zoom and also using image comparison software there were no signs of any re-compression or image quality degradation. This holds true for both the smaller and larger files.

What should you do?
If you don't want to lose your old files with metadata make sure "backupoldfile" is enabled. Otherwise they will be overwritten if you have "alwayscheckfilesize" enabled.

@Nandaka Nandaka pinned this issue Oct 8, 2020
@Nandaka Nandaka added the PSA label Oct 8, 2020
@pxssy
Copy link

pxssy commented Oct 10, 2020

On which exact date did this change happen, I was thinking if we just started updating our stashes from x date forwards and nothing before, then it shouldn't matter too much, yes? As long as you keep an eye not to mix the new and old stuffs.

Additionally, in the writeimageinfo txt file, there's a date line, which just glancing, matches the modified date of the image. Can anyone confirm if old, but exif cleaned images have a new modified date? If so then its possible to first match dates then if they different, run checksum or something.

Also since this came up, If say i had alwaysCheckFileSize = false , yet kept backupoldfile = true, pixiv would only download it once the first time, and skip over it every time without checkingsum or overwriting?

I'm just wondering what will the danboorus do, they basically run on md5 to delete duplicates.

Also curious how pixiv intends on doing this exif erasing retroactively, there's ~44 million posts on the site, who knows how many actual images.

Has image quality been affected?
No. I can't guarantee this but in my tests inspecting images at 1000% zoom and also using image comparison software there were no signs of any re-compression or image quality degradation. This holds true for both the smaller and larger files.

Not gonna lie, i thought i was crazy for awhile during june-august but it appeared like the images during that period was somehow very jpg-ed than what i'd expect from certain artists. I couldn't prove anything since it was the same regardless how i saved them, manually or via pixivutil2. Was that was their trial run?

@Nandaka
Copy link
Owner

Nandaka commented Oct 11, 2020

changing the exif/metadata should change the checksum/md5. I think danbooru keep the old images, as sometimes some artist update the old post to make revision.

@xion2
Copy link

xion2 commented Oct 13, 2020

Not gonna lie, i thought i was crazy for awhile during june-august but it appeared like the images during that period was somehow very jpg-ed than what i'd expect from certain artists. I couldn't prove anything since it was the same regardless how i saved them, manually or via pixivutil2. Was that was their trial run?

Over the years there have been many artists who started out uploading images at much higher quality then for whatever personal reasons started heavily compressing images, lowering resolution or changing the file format they upload. So that's mostly likely a coincidence but you could always ask the artist.

@signiramla
Copy link

I found that some image links recently throw an error 'Error 500: failed to thumbnailing' and didn't return image data. (they are all corrupted images, and they are downloadable before)
This error started to appear a few days ago, and has a relatively high recurrence rate. Is this related to the changes in EXIF metadata?

Example: https://www.pixiv.net/en/artworks/47412864
Default image(corrupted but download successfully): https://i.pximg.net/img-master/img/2014/12/05/20/26/57/47412864_p1_master1200.jpg
Original image(500 Error): https://i.pximg.net/img-original/img/2014/12/05/20/26/57/47412864_p1.jpg

Is there any other way to get the original images?

@Phenrei
Copy link

Phenrei commented Oct 14, 2020

Note if you end up with duplicate images in content but with different in file size files due to the metadata, there is a deduplication tool for Windows called AllDup that can specifically search by file content excluding any JPEG metadata.

Also, for a more accurate comparison of images that may have been reprocessed, I recommend using Irfanview which has a feature under Image Properties to list the exact number of unique colors in the image. If the content of a JPEG was changed in any way, it will show a different number.

@Nandaka
Copy link
Owner

Nandaka commented Oct 19, 2020

looks like they keep the works date (from "createDate" node).

If you set setlastmodified = True in config.ini, maybe I can add additional check if the local file last-modified date = createDate?

Nandaka added a commit that referenced this issue Oct 19, 2020
add new config `checkLastModified` in `[DownloadControl]` section to compare last-modified time with works date, require setlastmodified = True in config.ini to work properly.
35122 pushed a commit to 35122/PixivUtil2 that referenced this issue Oct 30, 2020
add new config `checkLastModified` in `[DownloadControl]` section to compare last-modified time with works date, require setlastmodified = True in config.ini to work properly.
@Nandaka Nandaka closed this as completed Feb 21, 2021
@Nandaka Nandaka unpinned this issue Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants