Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New tag lists: EnglishDictionary.csv and Derpibooru.csv #280

Closed
Nenotriple opened this issue Apr 13, 2024 · 3 comments
Closed

New tag lists: EnglishDictionary.csv and Derpibooru.csv #280

Nenotriple opened this issue Apr 13, 2024 · 3 comments

Comments

@Nenotriple
Copy link

I've put together two additional tag lists that should be formatted correctly for TAC to use. I made these for my tagging app and I figured you may be interested in them. I hope it's useful, but if you're not interested in incorporating them, no sweat!

I've had the English Dictionary tag list for awhile now, but recently created the Derpibooru list because of the popularity of Pony Diffusion V6 XL.

I also wanted to apologize because I've used the danbooru.csv and e621.csv tag lists in my own app without giving you credit, I wasn't thinking when I got the files from here. Sorry about that, I'll make sure to update the repo giving you credit.


EnglishDictionary.csv is in this format: <name>,<type>,<postCount>,"<aliases>"

The aliases for EnglishDictionary.csv are common typos and corrections. I wanted to classify words based on nouns, verbs, adjective, etc. but I couldn't find a good way to do this so the <type> column is always "0".

Because of the huge effort required, I tried to prioritize <aliases> with just the most common words. Also I was only able to retrieve useful <postCount> data for about half the total words, the rest are simply sorted alphabetically. This is less than ideal but I don't know how to improve it.


Derpibooru.csv is in this format: <name>,<type>,<postCount>, (no aliases)

The <type> key for Derpibooru is:

1 = content-official
2 = BLANK (general tag)
3 = species
4 = oc
5 = rating
6 = body-type
7 = character
8 = origin
9 = error
10 = spoiler
11 = content-fanmade

For many tags the "data-tag-category" was blank when scraping, it appears these are general tags.

I could probably create another scraper that could grab tag aliases, but I didn't have the time while setting this up originally.

I've tried to do my best to clean up the tag list without removing or altering useful tags, but there's always a chance of ruining something. Because of this I'm also including the file Derpibooru_original.csv. This is the direct result of scraping the tag list before any additional cleanup or changes. You may have a better idea of how to clean it up than I do.


CSV Files:

EnglishDictionary.csv

Derpibooru.csv

Derpibooru_original.csv

@DominikDoom
Copy link
Owner

Much appreciated! Just a quick question, the derpibooru list still seems to contain quite a lot of tags with very low post counts from what I've seen at a glance. For danbooru I just cut it off at 100k (even then there are a lot of not-so-useful tags still included), but here it might be better to go directly by post count and e.g. drop all lines with less than 20 posts or some higher cutoff number. What do you think? For the english dictionary this of course isn't an issue.

@Nenotriple
Copy link
Author

I had the same thought, and I was going to cut off the tags at low post counts, but there are many tags that seem useful regardless of the low count. I really don't have much experience using these tags, so I don't know how useful they actually are with Pony V6.

I'm totally in favor of pruning the derpibooru tags. The list is fairly chaotic because there's so much OC, and the users seem to invent new tags all the time. I even setup my scraper to ignore "oc:" and "artist:" tags with a post count less than 35 because there were so many results.

A cutoff of less than 20 posts would be fine, but that's slightly more than half the tags gone. There's certainly a lot of junk tags, but I wouldn't exactly call a tag junk because it has a low count. It may be well represented in Stable Diffusion, just not on the imageboard.

I found some issues with sorting and some other small stuff, and I removed all tags with a post count less than 5. I still think it's a fine idea to cutoff more, but this is probably how I'll use it just for more completeness.

Derpibooru_EDIT.csv

@DominikDoom
Copy link
Owner

Yes, it's always a balance problem, with the normal booru tags too. There are a lot of tags in the lower post counts that are perfectly understandable concepts for SD, just almost never used for tagging posts. Especially when it comes to landscape descriptions, lighting, styles etc. where the concept is common in natural language (and thus known by most SD models) but might get condensed to one tag like "outdoors" on the boorus, even if a more fine-grained distinction technically exists.

At the end it often comes down to how well the site enforces its tagging rules, and from what I've seen there at a glance, derpibooru tags are often more of a shitpost than in any way related to the images in the lower counts. Like, nobody can convince me "muffled_rap_music_playing_in_distance" or "less_than_five_seconds_in_mspaint" is a useful tag under any circumstances (and these two examples are at least understandable English).

Anyways, thanks for the edit.

There is one issue remaining that I encountered, which is the tag categories. Derpibooru uses pretty different categories and colors to normal danbooru. On the technical side, all is well, since TAC supports custom colors per file already. But the issue is that this is an option, so while I can update the default, existing users will not get a fitting color scheme unless they manually add it or delete that option from the webui's config file to reset it to default. That will result in a lot of tags being interpreted as -1 (unknown) since their number doesn't exist in the danbooru fallback color scheme, which marks them crimson red by default. Most noticeably every number > 5 and especially category 2, which is skipped in danbooru.

Since you mentioned the cat 2 was initially just blank, I have just replaced it all with 0 to match danbooru's general tag, which at least fixes a good amount. But that still leaves 6-11 unspecified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants