Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce runtime by splitting work based on Domain #811

Open
stranger-danger-zamu opened this issue Feb 24, 2022 · 2 comments
Open

Reduce runtime by splitting work based on Domain #811

stranger-danger-zamu opened this issue Feb 24, 2022 · 2 comments

Comments

@stranger-danger-zamu
Copy link

stranger-danger-zamu commented Feb 24, 2022

Edit: A quick roughshod PoC implementation can be found at this commit.

Current Behavior

Currently Fanficfare sequentially processes through all the links provided. This is reasonable since it prevents hammering a site while retrieving content.

However, this is slow if you are retrieving works over a number of sites.

Expected Behavior

Fanficfare should minimize the (wall) time it takes to process all the links. Since it only needs to work sequentially per domain to prevent hammering a server, FFF should group the links by domain and concurrently process each domain.

Additionally, while FFF works concurrently across domains, FFF would still work sequentially through the links for each domain. So FFF would not be hammering the sites and end users would have noticeable improvements when grabbing links from more than one site.

Step for reproduction

To demonstrate the difference this makes compare the wall time between the two commands below

# Using the FFF CLI, grab two works from 
# fanfiction.net and two from archiveofourown.org.
fanficfare <ffn link1> <ffn link2> <ao3 link1> <ao3 link2>
# Launch an instance of FFF per domain 
# while getting the same links.
fanficfare <ffn link1> <ffn link2> & \
fanficfare <ao3 link1> <ao3 link2>

Environment

Observed on:

  • OS: OSX (12.2.1); python: 3.10.2, FFF: 4.10.0
  • OS: Linux (5.16.9), python: py3.10.2, FFF: 4.10.0

Anything else:

This is magnified since some sites are much slower than others due to a variety of reasons, so by splitting the work out by domain the runtime changes from the sum of the time to process all the links to the time to process all the links from the slowest domain.

I also understand that this is something that might be due to the way that Calibre integration works. But it seems that it should be possible to spin up a multiprocess pool with a worker per domain, send the list of links grouped by domain to the respective worker, then join on the pool for the results and continue on as normal. Even the normal worries of start up time is very much mitigated as FFF can take more than 10 minutes for longer fics.

I know that this is something I can do on my end with a fancy one-liner or a simple wrapper script around FFF, but I feel that this is something that would benefit everyone by being part of FFF.

It might also be something that isn't incorporated into the core of FFF, but maybe just a change in the CLI which groups the links and spawns subprocesses per domain.

@stranger-danger-zamu
Copy link
Author

stranger-danger-zamu commented Feb 24, 2022

A quick roughshod PoC implementation can be found at this commit.

It makes debug logging kind of useless, but then again, if an issue occurs it just tells the user to run it with one URL which would produce a readable debug log again.

I haven't gotten around to testing it on 2.7, but both collections.defaultdict and multiprocessing.Pool were around then and in the standard library and I used the 2.7 docs for which functions were available. I don't particularly know how to build a dev build and test it in Calibre though.

Additionally, while I don't think there are multiple adapters for the same domain, there might be which would need to be handled lest we hammer some servers (and maybe get people banned/blocked).

@JimmXinu
Copy link
Owner

You are welcome to submit a PR with such an implementation; as long as it is optional and defaults to off--at least to start with. And you are willing to help support it if there are troubles with that part.

FYI, the Calibre plugin version already runs one process per site using Calibre's server pool implementation (limited by the number CPUs Calibre reports).

I believe the only case of a domain with more than one adapter went away.

There are a few adapters for different domains that are apparently the same 'site' in-so-far-as they share logins. scifistories.com, storiesonline.net and finestories.com is the set that comes to mind. There may be others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants