Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve expensive tasks in gatsby (a.k.a. Jobs API) #19831

Closed
3 of 5 tasks
wardpeet opened this issue Nov 27, 2019 · 1 comment
Closed
3 of 5 tasks

Improve expensive tasks in gatsby (a.k.a. Jobs API) #19831

wardpeet opened this issue Nov 27, 2019 · 1 comment
Assignees

Comments

@wardpeet
Copy link
Contributor

wardpeet commented Nov 27, 2019

Jobs api

Why/what?

Gatsby has some cpu/io intensive tasks while compiling your website to being blazing fast. Some examples of tasks I'm talking about are image processing, html generation, query running. Currently, it's up to the plugin author to cache & coordinates these tasks which can be burdensome.

We want to make it simpler and more robust, heavy tasks should only be handled once as they are expensive. Jobs should also be deterministic so we can save them to disk and re-run them when the gatsby process got interrupted. We're converting the old createJob API to a new one that handles most of the above-described issues.

Implementation details

How would this api look like?

actions.createJob = (eventName: string, {
  inputPaths: string[],
  outputDir: string,
  args: Record<string, *>
}): Promise<unknown>

Some notes about these properties:
All arguments need to be serializable which leads to no functions, classes,... InputPaths & outputDir need to live inside the gatsby root. We'll have some validation checks for this.

Now we have a job, how does a job know what action to execute? A plugin needs a worker.js that has an exported function as the name of the event. The function will receive inputPaths, outputDir & args as an argument. When the worker's promise is resolved we mark the job as complete.

What benefits does this Job API provide?
Well, we'll be able to create a deterministic hash based on the arguments, outputDir and inputPaths (content hash). Having a deterministic hash per job makes it super easy to cache results and avoid double work. Coupling a job with its process allows us to control the full flow of a job, applying backpressure, making sure the job is done and more. Saving jobs to disk is also a big win as we can re-run jobs when the process was interrupted.

Simple flowchart:
gatsby-jobs-api

Enough jibber-jabber, here's an example.

Gatsby-plugin-sharp will have a worker that looks like this

exports.IMAGE_PROCESSING = ({ inputPaths, outputDir, args }) => {
   return processFile(
      inputPaths[0],
      outputDir,
      args.contentDigest,
      args.operations,
      args.pluginOptions
    )
}

inside our plugin we can do

actions.createJob('IMAGE_PROCESSING', {
  inputPaths: file.absolutePath,
  outputDir: path.join(
    `public`,
    `static`,
    file.internal.contentDigest
  ),
  args: {
    operations: [{
      width: 300,
      height: 250,
    }],
    pluginOptions: {},
    contentDigest: '1234',
  }
})

TODO:

  • Implement worker for gatsby-plugin-sharp
  • implement basic jobs api without persisting to disk
  • implement disk persisting
  • add job API to gatsby-transformer-sqip
  • add job API to html generation
@wardpeet wardpeet self-assigned this Nov 27, 2019
wardpeet added a commit that referenced this issue Jan 21, 2020
Added a new action called createJobV2 to support the new api. createJob is still available to keep backward-compatibility. I'll add a deprecation message when creatJobV2 is fully operational.

The createJobV2 needs a job object that contains 4 arguments name, inputPaths, outputDir & args. These args are used to create a unique content digest to make sure a job is deterministic.

InputPaths are converted into relative paths before sending it to the worker as they need to be filesystem agnostic.

More info on why can be found in #19831
@freiksenet
Copy link
Contributor

Closing because it has been done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants