Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kemono.party Patreon posts always contain duplicate images #1667

Closed
ghost opened this issue Jul 2, 2021 · 10 comments
Closed

Kemono.party Patreon posts always contain duplicate images #1667

ghost opened this issue Jul 2, 2021 · 10 comments

Comments

@ghost
Copy link

ghost commented Jul 2, 2021

I have noticed an inconsistency in kemono.party. In short, I cannot seem to find a way to configure kemono.party to download non-duplicate pictures from Patreon posts even though my configuration works with other data sources like SubscribeStar. I believe this is due to the way kemono.party displays images from these two sites.

Example post (NSFW but no nudity): https://kemono.party/patreon/user/2909939/post/48126953
My config file:

"kemonoparty":
	{
		<cookie data>,
                 "filename":"{id}-{num}.{extension}"
	},

Attempting to download the example post with this config gets me two files with the same image and file size, but different names: 48126953-1.png and 48126953-2.png. For a SubscribeStar post, I would only get 48126953-1.png, which is fine for my organization needs.

I tried looking at the keywords for filenames to find something that would help, but there does not seem to be anything there that could help.

I also tried configuring a postprocessor option to compare images once they've been downloaded, but that has two problems:

  • I don't know how to get the compare.shallow postprocessor option to work properly. I had thought that I could use that option to compare the filesizes as a pseudo-checksum, but either I configured it wrong or it didn't work.
  • I'd prefer not to download any duplicates in the first place. I know that the first picture in any Patreon post will be a duplicate and can be ignored. I just can't seem to configure gallery-dl to distinguish between kemono:patreon and kemono:subscribestar.
@Skyofflad
Copy link

Skyofflad commented Jul 3, 2021

You can use "image-filter": "type != 'file'", to skip downloading duplicate header and download only attachments.
But beware - due to a site bug(?) some posts only have the header.

@Hrxn
Copy link
Contributor

Hrxn commented Jul 3, 2021

Yeah, I mean this definitely seems like an issue with the site.
The best would be to bring this up there, so it can get fixed.

@mikf
Copy link
Owner

mikf commented Jul 3, 2021

The main issue is that the main file and the first attachment of any Patreon post refers to the same file. Before v1.18.0 this was "solved" by effectively using {filename}.{extension} without {num} as default filename format, so that those two identical files have the same filename and the second one gets skipped. That didn't work for other services like Fanbox where it would skip files even though they weren't identical, so the default filenames got changed because downloading duplicates is still better than outright missing files.

It should be possible to distinguish between Patreon and everything else with conditional filenames to use the {filename} field there instead of {num}:

    "filename": {
        "service == 'patreon'": "{id}-{filename}.{extension}",
        ""                    : "{id}-{num}.{extension}"
    }

Or with image-filter like Skyofflad suggests, although I'd only apply it for patreon:
"image-filter": "service != 'patreon' or type != 'file'"

Yeah, I mean this definitely seems like an issue with the site.
The best would be to bring this up there, so it can get fixed.

The devs are already working on a solution (https://desuarchive.org/g/thread/82346276/#q82366219), although just having file hashes or some way to detect duplicates other than unreliable filenames in API responses would be very handy (@kemono-bugs)

@ghost
Copy link
Author

ghost commented Jul 3, 2021

The main issue is that the main file and the first attachment of any Patreon post refers to the same file.

But they're not quite the same file, because one uses spaces and the other uses underlines. I tried downloading the example post with the filename config block you provided, and it has the same problem: two identical images, one named 48126953-splat 1 and the other 48126953-splat_1. The same was true of the image-filter block.

The devs are already working on a solution (https://desuarchive.org/g/thread/82346276/#q82366219), although just having file hashes or some way to detect duplicates other than unreliable filenames in API responses would be very handy (@kemono-bugs)

Well that's good to hear, and it saves me the trouble of contacting them directly. Thanks for your help. I suppose I'll just wait for a fix and get used to cleaning out duplicates when I'm downloading from Patreon.

@mikf
Copy link
Owner

mikf commented Jul 3, 2021

because one uses spaces and the other uses underlines

You could use {filename:R /_/} to replace those spaces with underlines to make them match. Or the path-restrict option.

@rautamiekka
Copy link
Contributor

rautamiekka commented Jul 3, 2021

I think it's more reasonable to change the Patreon extractor itself to convert spaces to underscores.

^ Better yet if the extractor can tell those files apart before starting to download the file, so that the downloading doesn't have to be aborted before moving to the next one.

@ghost
Copy link
Author

ghost commented Jul 3, 2021

You could use {filename:R /_/} to replace those spaces with underlines to make them match. Or the path-restrict option.

That works when it's the only formatting in the filename value, but I get an error when I try to use different behaviour depending on the service:
[kemonoparty][error] FilenameFormatError: Applying filename format string failed (TypeError: expected str, got dict).

I think the issue is that the filename format string is being processed as a dictionary object in Python. I'm not sure what caused the issue, since I'm just copying the formatting from earlier in this thread.

"filename": {
	"service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
	""		      : "{id}-{num}.{extension}"
}

@Doofy420
Copy link

Doofy420 commented Jul 4, 2021

I also came across posts (lots of them) where the header file and a completely different attachment shared the same filename, so I guess that's also an issue, specially for people that prefer the {id}+{filename} format.
Sample 1 (nsfw): https://kemono.party/patreon/user/10215607/post/47941313
Sample 2 (nsfw): https://kemono.party/patreon/user/10215607/post/49961587

@mikf
Copy link
Owner

mikf commented Jul 7, 2021

@jlazarskiparkin9815 filename being a dict is only supported since version 1.18.0 and raises a FilenameFormatError/TypeError in all prior versions.

@ghost
Copy link
Author

ghost commented Jul 8, 2021

@jlazarskiparkin9815 filename being a dict is only supported since version 1.18.0 and raises a FilenameFormatError/TypeError in all prior versions.

Ah, that did it. I upgraded to 1.18.x and now the problem is solved when I use this filename format:

"filename": {
	"service == 'patreon'": "{id}-{filename:R /_/}.{extension}",
	""		      : "{id}-{num}.{extension}"
}

Testing the example from the OP gives me one file with the format id-num, which is exactly what I want. It continues to behave the same way it always has for other services like SubscribeStar. Thanks.

@ghost ghost closed this as completed Jul 8, 2021
nixxquality added a commit to nixxquality/gallery-dl that referenced this issue Jul 13, 2021
Fixes mikf#1667 

Judging by the discussion in that thread, the first file is always duplicated in the attachments list, except for some posts which only have the image and no attachments.

This change makes it so if attachments are present it only downloads those.

It might require testing from various posts though. It worked in the one I tried for what it's worth, but I'm not too familiar with the service.
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants