Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GraphQL::Dataloader, built-in batching system #2483

Closed
wants to merge 42 commits into from
Closed

Conversation

rmosolgo
Copy link
Owner

@rmosolgo rmosolgo commented Sep 18, 2019

Projects like https://github.com/Shopify/graphql-batch and https://github.com/exaspark/batch-loader have proven the value of batch loading in GraphQL. In fact, you really can't run a production GraphQL system without batching.

For this reason, I want to include a batching system in GraphQL-Ruby (without breaking compat with existing systems, of course!) Here are some goals for this system:

  • Feature parity with GraphQL-Batch
  • First-class support for pushing IO to a background thread
  • Traceable -- include graphql context with loads, so that a developer can see what GraphQL fields used which loaders.
  • Good built-in defaults (eg, ActiveRecord, Redis, HTTP resources ... others?)
  • Well-documented, low-friction custom loaders

If anyone has other suggestions for a built-in dataloader, please share them!


TODO

  • Test caching in queries vs. no cache in mutation root
  • Add a good API
  • Test Concurrent::Future-powered background loaders
  • Make sure @panthomakos's considerations regarding thread safety are addressed
  • Make sure traceability matches graphql-metrics
  • Add some built-in loaders & test them
  • Sort out Execution::Lazy vs Dataloader::PendingLoad vs ::Promise
  • Make sure graphql-batch's tests pass with GraphQL::Dataloader Allow running the test suite with GraphQL::Dataloader graphql-batch#1
  • Performance audit (GitHub uses a branch of Promise.rb for lower memory use)
  • Any reason not to install Dataloader by default? (dependency?)
  • Update docs for concurrent-ruby dependency in Dataloader main (not only background thread)
  • Support Lazy.sync as a public api? Or support instrumentation (see graphql-batch tests)
  • Docs:
    • Concepts
    • Installation
    • Built-in loaders
    • Custom loaders
    • Standardize language of batch keys / fetch parameters: what does graphql-batch do here?

@rmosolgo rmosolgo added this to the 1.10.0 milestone Sep 18, 2019
@rmosolgo rmosolgo self-assigned this Sep 18, 2019
@rmosolgo
Copy link
Owner Author

@panthomakos, I'd love to get your feedback on the goals discussed here. This is where I want to take inspiration your work from #1981 😊 , so please let me know if I've overlooked any of your goals and accomplishments from that branch.

@eapache
Copy link
Contributor

eapache commented Sep 18, 2019

One of our requirements at Shopify is that we make very heavy use of https://github.com/Shopify/graphql-metrics/blob/master/lib/graphql_metrics/timed_batch_executor.rb, so we'll need some sort of similar hook where we can collect performance data.

@rmosolgo
Copy link
Owner Author

👌 Thanks for the reference there, @eapache 👀 I'll keep it in mind!

@chrisbutcher
Copy link
Contributor

chrisbutcher commented Sep 19, 2019

Traceable -- include graphql context with loads, so that a developer can see what GraphQL fields used which loaders.

This would be amazing. With the existing graphql-ruby + graphql-batch, I couldn't find an obvious or clean way to, for example, attribute time spent batch loading a given field with a given ast_node.

With the existing executor / lazy loading implementation, I suppose this is true of any field resolvers that return promises, but I wonder if batch loader perform start / end hook methods that can read/write context would help?

@rmosolgo
Copy link
Owner Author

read/write context

Yeah, I think that's part of the ticket: right now, loaders (ours, anyway) throw out graphql context (including current field, path, etc) and operate independently of it. I'd love to find a way that keeps the low-overhead api of loaders but adds context-awareness. Maybe we could tack on the context info after the application's resolver (eg, if it returns a promise, tack context onto the promise) to make it easy.

@sfcgeorge
Copy link

I found the current ones a bit verbose so implemented a nice little DSL that looks like this:

def self.authorized?(timeslot, context)
  return false unless super
  return false unless context.current_instructor

  chain_load(timeslot).offer.outlet.instructors.then do |_offer, _outlet, instructors|
    instructors.include?(context.current_instructor)
  end
end

So you can load a chain of one-to-one relationships and optionally a one-to-many on the end, and access any of the models along the way in the block.

https://gist.github.com/sfcgeorge/e067822f174d42175fec0f2264fe399e

@rmosolgo
Copy link
Owner Author

Nice, thanks for sharing! We have some similar shortcuts in github/github. I like the approach of adding methods like chain_load, and that might be an option for sneaking in context without making the user-facing API to burdensome.

Copy link

@panthomakos panthomakos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting work. Sorry it has taken me so long to respond. I have a few questions about the concurrent implementation.

end

def load(value)
@promises[value] ||= GraphQL::Execution::Lazy.new do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that you will end up with two lazy executions for the same value based on how threads are scheduled? You might consider using a https://ruby-concurrency.github.io/concurrent-ruby/1.1.4/Concurrent/Map.html.

lib/graphql/dataloader/loader.rb Outdated Show resolved Hide resolved
lib/graphql/dataloader/background_loader.rb Outdated Show resolved Hide resolved
def load(value)
@promises[value] ||= GraphQL::Execution::Lazy.new do
if !@loaded_values.key?(value)
sync

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the concurrent example, sync will run in a separate thread. Correct?

If so, then how do you guarantee that @loaded_values[value] will be present on line 26 below?

Copy link
Contributor

@daemonsy daemonsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rmosolgo I was looking into batch loading for my company, hoping to contribute something and thankfully saw this ❤️. So instead, hoping to contribute by trying to adopt Dataloader for a nascent GraphQL in my company that doesn't do batch loading yet.

Specifically,

  • Using GraphQL::Dataloader::Loader as a consumer
  • Contributing to the usage guide on this PR as we did more testing


def self.load(context, key, value)
dl = context[:dataloader]
loader = dl.loaders[self][key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a while to get it, the way the dl.loaders[self][key] instantiates the loader is really clever 👍 .

So far in initial testing, I've already forgot to use context as the first argument twice 😺. That's actually the mental impedance so far, as I'm not thinking about the context object while trying to write a batch loading statement.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i think there's got to be some better API for this. But I'd really like to keep dataloading context-aware so that we can trace it as part of the GraphQL request.


def initialize(context, key)
@context = context
@key = key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the typical use case, where @key is the model, #perform looks a little weird:

class MyLoader < GraphQL::Dataloader::Loader
  def perform(ids)
    @key.where(id: ids)
  end
end

Also, what are the thoughts around supporting additional arguments?

In graphql-batch, it was common to have:

RecordLoader.for(Product, :other_id).load(object.other_id)

which gets passed into initializer of the loader. We used it for setting simple where conditions or having a different key as the loader above.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suppose it could be more readable like:

class MyLoader < GraphQL::Dataloader::Loader
  def initialize(context, model)
    @model = model 
    super 
  end 

  def perform(ids)
    @model.where(id: ids)
  end
end

This was required for graphql-batch (IIRC) because it didn't store state otherwise. It also didn't require the super call. I wonder how I can remove that boilerplate in this implementation 🤔

@rmosolgo rmosolgo mentioned this pull request Sep 27, 2019
14 tasks
@rmosolgo rmosolgo mentioned this pull request Oct 23, 2019
@rmosolgo rmosolgo removed this from the 1.10.0 milestone Jan 7, 2020
@rmosolgo
Copy link
Owner Author

rmosolgo commented Jan 7, 2020

I'm going to release 1.10 without this in the interest of time. That branch has other big changes on it already, and there's still a lot of work to do here, and I haven't gotten to it. And it isn't essential -- graphql-batch works great and you can build backgrounded IO on top of it AFAIK.

@rmosolgo rmosolgo mentioned this pull request Aug 1, 2020
33 tasks
def self.resolve(results)
# First, kick off any loaders that will resolve in background threads
Dataloader.current && Dataloader.current.process_async_loader_queue
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly subtle, but after a lot of attempts, I couldn't find a better way to work in a "kick off" step into the existing execution flow. This is very similar to the original suggestion, but at the loader level instead of the promise level.

Comment on lines 113 to 119
def current
Thread.current[:graphql_dataloader]
end

def current=(dataloader)
Thread.current[:graphql_dataloader] = dataloader
end
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this adds the requirement that GraphQL queries be executed within a single thread.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The alternative would be to use context[:dataloader], which earlier iterations used. But then you're stuck with, how to get that dataloader into each loader, so that the loader can register itself with the dataloader's cache.)

@swalkinshaw
Copy link
Collaborator

One small issue we've had with batch loaders is dealing with keys (or fetch parameters in your terminology?) that don't have data to fulfill. I think right now it leads to opaque Promise::BrokenError exceptions. Here's an example:

class ProductImageLoader < GraphQL::Batch::Loader
  def initialize(shop_id)
    @shop_id = shop_id
  end

  def perform(product_ids)
    Images
      .where(shop_id: @shop_id, product_id: product_ids)
      .group_by(&:product_id)
      .each { |product_id, images| fulfill(product_id, images) }
  end
end

If no images exist for a product id, it won't be fulfilled leading to that error. We have two common solutions:

  1. manually fulfill all unfulfilled keys
class ProductImageLoader < GraphQL::Batch::Loader
  def initialize(shop_id)
    @shop_id = shop_id
  end

  def perform(product_ids)
    Images
      .where(shop_id: @shop_id, product_id: product_ids)
      .group_by(&:product_id)
      .each { |product_id, images| fulfill(product_id, images) }

    product_ids.each { |id| fulfill(id, nil) unless fulfilled?(id) } # nil here, but any "default" value works
  end
end
  1. iterate over the keys/fetch parameters instead of the data
class ProductImageLoader < GraphQL::Batch::Loader
  def initialize(shop_id)
    @shop_id = shop_id
  end

  def perform(product_ids)
    images = Images
      .where(shop_id: @shop_id, product_id: product_ids)
      .group_by(&:product_id)

    product_ids.each do |id|
      fulfill(id, images[id])
    end
  end
end

I've had the idea before that we could have a better interface to enforce/prevent this situation. A dataloader could declare a default value explicitly. If set, the loader could automatically fulfill all missing keys with it?

But thinking more about this, perform has a fairly strict requirement that fulfill gets called for each of its fetch parameters, yet we have to manually do that work which is error prone (as seen above). I wonder if that's a better interface we could give people instead 🤔 I'll give it more thought

@rmosolgo
Copy link
Owner Author

in your terminology?

I started updating those docs and realize I didn't have a good word for those different kinds of keys. Now I see that Batch::Loader uses group_args and keys, seems good too. I just want to pick something that makes their usage clear.

Also, I'm torn between doubling down on the terms from graphql-batch and batch-loader, or picking new words to make it more googleable. Oh, and avoiding "-er" classes (http://wiki.c2.com/?DontCreateVerbClasses).

declare a default value explicitly

Yeah, I could see that, something like

unfulfilled_default nil 

Then the library could basically do

if self.class.set_unfulfilled_default?
  keys_to_load.each { |key| fulfill(key, self.class.unfulfilled_default) unless fulfilled?(key) }
end 

But probably only if unfulfilled_default was explicitly set, otherwise we'd raise an error of some kind. (Because I don't think we want it to silently ignore unfulfilled keys, that's important feedback to the application behavior.)

@gaffney
Copy link

gaffney commented Sep 27, 2020

Hey @rmosolgo thanks for the awesome library and hard work; we are currently using this in production without any issues.

If anyone has other suggestions for a built-in dataloader, please share them!

Our team decided to go with exAspArk/batch-loader due to its generic nature / the fact that it is not tied to GraphQL. We have plenty of external REST API calls we wanted to batch in addition to GraphQL fields.

batch-loader works great for us, but it is a little painful integrating with graphql-ruby, as illustrated by the graphql-ruby example in the README:

To avoid this problem, all we have to do is to change the resolver to return BatchLoader::GraphQL (#32 explains why not just BatchLoader):

I found that there were several issues around this and very long threads in 2018, but unfortunately this is where we landed:

I suggested a few other potentially more flexible solutions for graphql-ruby to detect lazy objects such as duck typing or using explicit arguments. But it looks like it won't be implemented. To fix the issue BatchLoader started wrapping BatchLoader objects with PORO (plain old ruby objects) by using graphql-ruby instrumentation.

Since you are hard at work at a major refactor I was wondering if you could consider revisiting some of the solutions proposed by exAspArk to make for a more seamless integration... or any alternative to avoid BatchLoader::GraphQL.for. The current batching workaround makes testing painful and is inherently inextensible.

@rmosolgo rmosolgo added this to the 1.12.0 milestone Sep 28, 2020
@rmosolgo
Copy link
Owner Author

Thanks for sharing that discussion, @gaffney. Unfortunately, this refactor doesn't touch the underlying behavior where GraphQL-Ruby detects lazy values by calling value.class. Last I checked, Batch-Loader objects implement #class by delegating to the batch-loaded object (instead of returning BatchLoader), so it just doesn't work.

Interestingly, the original suggestion in those issues was to add a lazy: true configuration to field(...). That would be possible now, something like:

class BaseField < GraphQL::Schema::Field 
  # When `lazy: true` is given, add a field extension to wrap the returned value of this field
  def initialize(*args, **kwargs, lazy: false, &block)
    if lazy 
      extensions = kwargs[:extensions] ||= [] 
      extensions << BatchLoaderExtension 
    end 
    super
  end 
end 

# When this extension is added, the field was configured with `lazy: true`, so apply a wrapper so 
# GraphQL-Ruby can identify the lazy object. 
class BatchLoaderExtension 
  def resolve(object:, arguments:, **_rest)
    # call normal field execution 
    return_value = yield(object, arguments)
    # apply a wrapper and return it, TODO is `.wrap` the correct method here? Not exactly sure. 
    BatchLoader::GraphQL.wrap(return_value)
  end 
end 

Anyways, just a thought after reviewing that code for the first time in a while. Interestingly, a recent Ruby version added Object#then which I could imagine using as the basis for a duck-typing approach to batching and lazy evaluation. But that's another matter than what's in the works here 🍻 .

@rmosolgo rmosolgo changed the base branch from master to 1.12-dev December 22, 2020 21:26
@rmosolgo rmosolgo mentioned this pull request Dec 27, 2020
18 tasks
@rmosolgo
Copy link
Owner Author

rmosolgo commented Jan 6, 2021

I ended up going a very different direction on this: #3264

@rmosolgo rmosolgo closed this Jan 6, 2021
@rmosolgo rmosolgo deleted the dataloader branch January 6, 2021 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants