Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: does gatsby-source-filesystem not cache downloaded remote files between builds? #4464

Closed
RobertBolender opened this issue Mar 9, 2018 · 14 comments

Comments

@RobertBolender
Copy link
Contributor

RobertBolender commented Mar 9, 2018

I'm still digging into why my site with gatsby-source-wordpress is taking so long to build. (#4293)

For context: I have a site with about 1,000 posts and about 2,000 (large) images.
My builds are currently taking 20-30 minutes to finish. If I skip downloading images, it takes 2.

Looking at create-remote-file-node it looks like the cache is only used to remember headers that were received during the current build, I guess to prevent downloading the same image twice during a single build in case the image is referenced multiple times.

From what I can tell, Gatsby will still download every remote file on every build.

Would it not be possible to store the whole image to prevent redownloading and reprocessing with gatsby-transformer-sharp?

@RobertBolender RobertBolender changed the title Question: does gatsby-source-filesystem not cache downloaded remote files? Question: does gatsby-source-filesystem not cache downloaded remote files between builds? Mar 9, 2018
@RobertBolender
Copy link
Contributor Author

Also, it appears that gatsby-source-wordpress is downloading the entire WordPress media library on each build, which for me is 3.5GB and quite a lot of time. Is this necessary?

@pieh
Copy link
Contributor

pieh commented Mar 12, 2018

Hey @awesomebob, would you like making PR to change that in gatsby-source-filesystem? gatsby-transformer-sharp will use images it produced in previous build if they are available, so it's mostly about downloading images

@RobertBolender
Copy link
Contributor Author

@pieh I would be willing to do that.

It looks like I just need to check if filename already exists with existsSync from fs-extra, which is what gatsby-transformer-sharp is already using.

@RobertBolender
Copy link
Contributor Author

Upon further reflection, I'm still going to run into problems even once #4493 is merged in.
Gatsby clears the .cache whenever a plugin is updated, and then gatsby-source-wordpress will try to download the entire media library again, sequentially, and that is taking longer than the 1 hour I have to work with in GitLab CI.

Should I try to parallelize that download, or what else can I do? I currently have 3.5GB in the WordPress media library, and that's going to keep growing.

@KyleAMathews
Copy link
Contributor

Yeah — files aren't redownloaded — it's the clearing of the cache that's the problem.

Perhaps we should make the automatic cache clearing less aggressive? Just remove the redux-state.json data?

@RobertBolender
Copy link
Contributor Author

Ah you're right, it's not redownloading files on a 304.

Clearing the cache less often might help, but at some point I'm going to have to redownload the entire WordPress Media Library again. It seems like I might have shot myself in the foot for adding 3.5GB of media there.

I might need to look into pre-processing the images first instead of during the Gatsby build if it's going to take more than an hour for my large site.

@pieh
Copy link
Contributor

pieh commented Mar 13, 2018

That's why I was so confused as I thought this was already working but reading code didn't make it obvious.

I'd be up for exploring less aggressive cache invalidation as this might be quick win for cases like this.
Should we also delete: (?)

  • cache/db.json persistent key-value store (for cache object passed to plugins)
  • json directory (results of queries)
  • layouts directory (layout components wrapped with data from the query)
  • api-runner-browser.js and api-runner-ssr.js (api runners with requires to plugins that use those apis)
  • async-requires.js and sync-requires.js (used to let webpack handle bundling/splitting of components and results of queries)
  • pages.json (mapping to chunks with page/layout components and json files with results of queries)

I think that's all the things gatsby dynamically generates on itself (part from redux-state.json). My bet would be to delete it all when gatsby invlidates cache as it will get regenerated anyway.

@KyleAMathews
Copy link
Contributor

Yeah, cache/db.json too. The others are written over on every bootstrap so not caches.

Yeah, this would be an easy quick win. You want to take this @pieh?

@pieh
Copy link
Contributor

pieh commented Mar 13, 2018

@KyleAMathews Yeah, I will try to prepare PR for tomorrow.

@awesomebob Downloading is already parallelized ( https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-wordpress/src/normalize.js#L382-L413 ). Maybe running so many concurrent requests is actually hurting more than it helps? Maybe it would be worth exploring some kind of async queue for that with sane limit of concurrent operations?

@pieh
Copy link
Contributor

pieh commented Mar 14, 2018

On the second thought - less aggressive cache invalidating with just deleting cache/db.json and redux-state.json will still cause redownloads anyway as content for If-None-Match header is stored in cache/db.json.

This would have to be more robust and probably would need separate cache instances for all plugins and deleting just caches for plugins that has changed. Thoughts?

@KyleAMathews
Copy link
Contributor

Oh hmmm yeah. Perhaps the remote file downloader can just use its own cache for now.

@pieh
Copy link
Contributor

pieh commented Mar 14, 2018

I'm not sure if it's worth time/effort to do this just for file downloader - it depends how do you see permanent cache implemented (one you mentioned here - #3977 (comment) ).

I could do initial part of just splitting global key-value cache into separate caches dedicated for each plugin ( this one where headers are stored - https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/utils/cache.js ) - mainly this would be needed to not redownload files after cache invalidation (if we don't change gatsby-source-filesystem)

In next iterations we can implement helper function for plugins to get their dedicated file cache path and store nodes and plugin status in dedicated partial state jsons.

Ultimately I would see it something like this:

.cache
├── plugins
│   └── gatsby-source-somename
│       └── files ( we can provide helper function to get path to that )
│           └── cached-file.jpg
│       ├── db.json ( dedicated key-value cache - this is what I would start on )
│       └── partial-redux-state.json ( nodes and plugin status - both version and what plugin will set using `setPluginStatus` function )
└── global-redux-state.json ( componentDataDependencies )

@KyleAMathews
Copy link
Contributor

Invalidating the cache by plugin version would be a great start to make the cache permanent/immutable. Instead of separate caches though, how about we just prepend cache keys with the plugin version which would accomplish the same thing & avoid having dozens of caches.

@KyleAMathews
Copy link
Contributor

Due to the high volume of issues, we're closing out older ones without recent activity. Please open a new issue if you need help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants