Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge: Download user favorites (and create recommendations) #1

Open
DonaldTsang opened this issue Jun 19, 2019 · 9 comments
Open

Comments

@DonaldTsang
Copy link

DonaldTsang commented Jun 19, 2019

Same as kent-lee/deviantart-scraper#2 but with a few key differences:

  • favorites are at https://www.pixiv.net/bookmark.php?id=<userID>&rest=show
  • collections are at https://www.pixiv.net/bookmark.php?id=<userID>&tag=<collection>&rest=show
@kent-lee
Copy link
Owner

@DonaldTsang I have completed and uploaded implementations of (1), (2), and (3), please have a look at the newest commits and readme for details.

For (1), you can download all bookmarked artworks with the command python main.py bookmarks. You can also call api.user_bookmarks() in main.py to get a list of JSON objects of the artworks, which contains information like artist_id, tag, etc.

For (2), the original implementation already stores some basic metadata for all downloaded artwork, such as the artwork_id, title, and filename. You can view them by printing the results of save_users(), save_artwork(), save_artworks(), and save_bookmarks(). The reasons I don't write them to files are: (1) there is no need. (2) I want to avoid I/O bound tasks as much as possible because they greatly impact performance.

For (3), you can get recommended artists by calling recommend(). This function uses percentage to sort artists, as suggested, though I am having difficulties determining the threshold of the cutoff point. You mentioned that to recommend artist A from user U's bookmarks, U's bookmarks should contain more than X% of artworks from A. So, what should X be?

  • Problem: suppose user U has 100 bookmarks and the maximum number of artworks from the same artist A is 15, then the percentage of A is 15%. Suppose another user C has 1000 bookmarks and the maximum number of artworks from the same artist A is also 15, then in this case the percentage of A is 1.5%. This means that the threshold should be changed based on the total number of bookmarks, but I have yet to figure out a suitable equation for that.

As for your questions. I think your suggested recommendation feature is quite useful in cases where the user's bookmarks are to your liking. I have thought of two approaches before for art discovery: (1) based on rankings. (2) based on related work. (1) has the problem of popular != what I like. (2) has the problem where you are only able to find artists with similar art style. So, I think your suggestion is better in terms of consistency and accuracy. However, this is only if the user's bookmarks are good, and if this is not the case, then this may perform worse than the above methods.

And yes, I do have a Discord account; my DiscordTag is Bruce Lee#5354. Feel free to add me :)

@DonaldTsang
Copy link
Author

@kent-lee thanks!

@DonaldTsang
Copy link
Author

DonaldTsang commented Jun 23, 2019

@kent-lee about (3) in the former you might want to read https://en.wikipedia.org/wiki/Penrose_square_root_law
It states that for any given any population X within a larger population A+B+C...+W+X, its voting power, or worth, by percentage is sqrt(X)/(sqrt(A)+sqrt(B)+sqrt(C)+...+sqrt(W)+sqrt(X)).

Regarding (1) and (2) in the latter (1) is mainly experiental, while (2) is much accurate in most cases, assuming most good artists are also good collectors. Thus we need to balance discovery and similarity.

@kent-lee
Copy link
Owner

@DonaldTsang sorry about the late reply; I am quite busy recently.

For (3), I guess my wording wasn't clear. I was just unsure of the value of the threshold. I don't know the correct method to determine the value of the cutoff point such that the recommendations are accurate and of good quality.

  • If I have a set of user accounts, should the threshold be 15% because the total number of artworks is low and the average number of common artworks from the same artists is high, or should it be 1.5% because the total number of artworks is high and the average number of common artworks from the same artists is low? What is the equation that determines the threshold given the total number of bookmarks and the average number of common artworks?

@DonaldTsang
Copy link
Author

@kent-lee the best thing to do is to separate the two views, for me personally "the total number of artworks is low and the average number of common artworks from the same artists is high" makes much more sense. Common shared artwork amounts (either by percentage or absolute amount) should be the base metric, of course we can do something more complex like two or more users sharing the same bookmarks, but right now we can assume all bookmarks from the list of users go into ONE pool.

@DonaldTsang
Copy link
Author

DonaldTsang commented Jul 6, 2019

Okay so I discovered PageRank and HITS (and also SALSA), maybe you can try and use this tool to find relevant items?
The network would basically have these three components:

  1. Artist =>Favorite page of the artist
  2. Favorites => List of favorited art
  3. Art => Artist that made the art

And if you would allow follows/followers:

  1. Artist =>Follow page of the artist
  2. Follows => List of followed artist

Of course if we want to have follows to go along with favorites there will need to be a weighing system between the two type of link "methods" when using PageRank or HITS.
See: https://networkx.github.io/documentation/stable/reference/algorithms/link_analysis.html

@kent-lee
Copy link
Owner

kent-lee commented Jul 8, 2019

@DonaldTsang thank you for yet another great suggestion! I just implemented ranking functionality to get the top N ranking artworks given certain parameter values. I was thinking to use this to find good recommendations and relevant items, however, you just gave me another method that I can try. So, thank you again for the information, I will certainly get to it shortly.

@DonaldTsang
Copy link
Author

@kent-lee no need to thank me, we are all on the same boat.

@DonaldTsang
Copy link
Author

DonaldTsang commented Dec 13, 2019

For Pixiv, to get what an artist follow just go to https://www.pixiv.net/bookmark.php?id=<some_id>&type=user, all followings within that page will be under <a> inside <li> within the page's <ul> list, and everything is paginated so you can move to the next page with p=2 or some other number
From that we can pull some tricks from "Twitter Following Graphs" (who follows who on Twitter) and rank people based on what they liked (link prediction and community detection).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants