Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy FileFetcher (#1402) #1411

Closed
wants to merge 2 commits into from
Closed

Conversation

laurb9
Copy link
Contributor

@laurb9 laurb9 commented Jul 21, 2020

Replaces the eager file fetcher for stored requests and categories with a lazy one.
This allows the cache layer to work and control refresh intervals for stored requests, but not for categories yet.

This configuration now works to cache files with the given settings. Before the change, it would cache data that is already in memory.

stored_requests:
  filesystem: true
  directorypath: "./stored_requests/data/by_id"
  in_memory_cache:
    type: lru
    ttl_seconds: 10
    request_cache_size_bytes: 1000
    imp_cache_size_bytes: 5000

@stale
Copy link

stale bot commented Jul 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 29, 2020
Copy link
Contributor

@SyntaxNode SyntaxNode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach. It's a lot cleaner. It does change the default behavior and not all hosts may want that. Can you please include an option either at this level or via a caching layer to retain the default "load all at once and cache" behavior with adding an option for on-demand loading?

stored_requests/backends/file_fetcher/fetcher.go Outdated Show resolved Hide resolved
@stale stale bot removed stale labels Jul 29, 2020
@laurb9
Copy link
Contributor Author

laurb9 commented Jul 30, 2020

It does change the default behavior and not all hosts may want that. Can you please include an option either at this level or via a caching layer to retain the default "load all at once and cache" behavior with adding an option for on-demand loading?

I have to give it some thought. The plan was to let the in_memory_cache do the storing, to avoid data duplication in memory. Getting the in_memory_cache layer primed by using something similar to the above filepath.Walk code can work but moves some of this directory handling to a module that should not care about it. Scanning and storing in this module makes this less of an improvement...

@laurb9
Copy link
Contributor Author

laurb9 commented Jul 30, 2020

I should also mention prebid-server currently uses 2 to 4 times the memory for these files due to this prefetching behavior, which is indirectly solved by this as well.

In https://github.com/prebid/prebid-server/blob/master/stored_requests/config/config.go#L119-L123
The amp and auction fetchers are instantiated from the same configuration, and each preload a copy of the same data dir. If the categories happen to be stored in the same root, then all three fetchers will load all the stored reqs and categories, etc.

I've verified this manually by adding a 500M json stored imp and watching the RSS of the process change by 3x that :)

@SyntaxNode
Copy link
Contributor

SyntaxNode commented Jul 30, 2020

I have to give it some thought. The plan was to let the in_memory_cache do the storing, to avoid data duplication in memory. Getting the in_memory_cache layer primed by using something similar to the above filepath.Walk code can work but moves some of this directory handling to a module that should not care about it. Scanning and storing in this module makes this less of an improvement...

Sure. The end goal is to keep the same behavior we have today and then let hosts opt-in to new / better behavior. What do you think about an option for the file fetcher to fetch everything vs fetch only the requested ids? .. and then rely on the caching layer to call fetch just once and store everything in memory? Not sure how happy I am about that suggestion, but wanted to throw some ideas out.

I'm really happy with the direction this is going in.

I should also mention prebid-server currently uses 2 to 4 times the memory for these files due to this prefetching behavior, which is indirectly solved by this as well.

I didn't realize it did that. Nice to have that addressed.

Maybe you can add a new cache type that just stores things forever, if we don't already have something like that today. Make that the default for the file fetcher and we should be good to go.

@stale
Copy link

stale bot commented Aug 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@laurb9
Copy link
Contributor Author

laurb9 commented Aug 12, 2020

I implemented the previous behavior by adding a files event producer to populate
the cache instead of just loading everything locally inside the fetcher.

This makes it consistent with the other modules, and able to use the cache. But it
resulted in more code and complexity than I'd liked, so I am less convinced of its
worth.

  1. The flag lazy_load=false (default) replicates existing behavior, but it does define
    an unbounded cache if no cache was defined, to not force hosts to change configs.

  2. With lazy_load=false and an explicit cache config, the loaded data will expire as
    prescribed and will be reloaded from filesystem on demand.

  3. With lazy_load=true, the stored data is loaded only when requested. Any cache
    will work if given (but is not required, to keep in line with how the other configs work).

CC @bsardo if you think this should wait until your refactor.

stored_requests/events/files/files.go Outdated Show resolved Hide resolved
stored_requests/events/files/files.go Show resolved Hide resolved
stored_requests/backends/file_fetcher/fetcher.go Outdated Show resolved Hide resolved
stored_requests/events/files/files.go Show resolved Hide resolved
stored_requests/fetcher.go Show resolved Hide resolved
stored_requests/config/config.go Outdated Show resolved Hide resolved
@laurb9 laurb9 force-pushed the lazy-file-fetcher branch 3 times, most recently from 5a90948 to 124a96d Compare August 21, 2020 23:13
@laurb9 laurb9 requested a review from bsardo August 21, 2020 23:18
@laurb9
Copy link
Contributor Author

laurb9 commented Aug 21, 2020

Integrated requested changes and rebased branch, please review
Travis CI failure appears to be due to build timeout on one of the builds, the other checked out fine.

@laurb9 laurb9 force-pushed the lazy-file-fetcher branch from 124a96d to 6121e25 Compare August 22, 2020 00:33
config/stored_requests.go Outdated Show resolved Hide resolved
stored_requests/backends/file_fetcher/fetcher.go Outdated Show resolved Hide resolved
stored_requests/config/config.go Outdated Show resolved Hide resolved
stored_requests/events/files/files.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mansinahar mansinahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Have left some minor comments

endpoints/openrtb2/amp_auction_test.go Outdated Show resolved Hide resolved
stored_requests/events/files/files.go Outdated Show resolved Hide resolved
@laurb9 laurb9 force-pushed the lazy-file-fetcher branch from ee1c30e to 0c48b0b Compare August 26, 2020 19:14
@laurb9
Copy link
Contributor Author

laurb9 commented Aug 26, 2020

Integrated feedback and rebased on latest master.

…s event producer to populate the cacheto be consistent with the other sources.Default behavior enables a static unbounded cache so all the objectsare preloaded, and do not expire.If a lru cache is explicitly defined, it will be used instead.Setting lazy_load=true will disable preload and operate inlazy load backed by cache mode.Addressing feedbackAddress https://github.com/prebid/prebid-server/pull/1411/files#r476907655Address https://github.com/prebid/prebid-server/pull/1411/files#r476926516Fixed bug uncovered by new unit test.
@laurb9 laurb9 force-pushed the lazy-file-fetcher branch from 0c48b0b to 3a851af Compare August 26, 2020 19:23
@SyntaxNode
Copy link
Contributor

I tested the changes with the AMP endpoint.

The default behavior is still a bit different from what we have today with lazy loading disabled. The files do indeed seem to be cached at startup, but if a file isn't cached the file system will still be dynamically checked. The expected behavior is with the lazy loading disabled, the file system is never checked. Perhaps a new cache type can solve the problem in which a cache miss will immediately return and not hit the store.

With lazy loading enabled, I verified that each request went out to the file system as expected. We might want to advise hosts of this behavior and encourage some kind of in-memory cache, even if short lived, to avoid hammering their disks.

@laurb9
Copy link
Contributor Author

laurb9 commented Aug 28, 2020

if a file isn't cached the file system will still be dynamically checked. The expected behavior is with the lazy loading disabled, the file system is never checked.

Also, with a manually configured cache other than in-memory, the items loaded would expire and have to be refreshed, even if we bulk loaded them at start.

There are ways around it, but the bottom line is that with the prefetch, this change ballooned from using less code than before to using twice as much, and my confidence that it's worth including in the code base is somewhat shaken.

With lazy loading enabled, I verified that each request went out to the file system as expected. We might want to advise hosts of this behavior and encourage some kind of in-memory cache, even if short lived, to avoid hammering their disks.

Indeed, one is expected to configure a cache as shown in #1411 (comment)
This PR makes the FileFetcher behavior consistent with the HTTP and DB ones, in that they all can be fronted by that cache.

In fact, the current configuration allows setting cache with a TTL that never works for files, which I'd argue is a little worse.

@SyntaxNode
Copy link
Contributor

Also, with a manually configured cache other than in-memory, the items loaded would expire and have to be refreshed, even if we bulk loaded them at start.

That's a good point.

There are ways around it, but the bottom line is that with the prefetch, this change ballooned from using less code than before to using twice as much, and my confidence that it's worth including in the code base is somewhat shaken.

Do you need this change made? I have no problem with proceeding. I rather we fix it right, and I think the approach this PR is heading in is very good.

@laurb9
Copy link
Contributor Author

laurb9 commented Aug 31, 2020

Thank you. Appreciate all the review work that has gone into this. I would like to focus on #1426 (account configuration) instead, and put this PR on the back-burner or close it.

Planning to keep the simple lazy fetcher internally for the time being and to start looking into setting up an HTTP API.

@laurb9 laurb9 closed this Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants