Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage disk: Fix badger.ErrTxnTooBig when too many partition/path a… #5937

Conversation

floriangasc
Copy link
Contributor

storage/disk/txn.go

Why the changes in this PR are needed?

This change is to avoid badger.ErrTxnTooBig when load «too» many partitions.

What are the changes in this PR?

I handle badger.ErrTxnTooBig like define in the official doc https://dgraph.io/docs/badger/get-started/#transactions

Notes to assist PR review:

I am not sure if it's best solution because of #5721 and also there is Truncate method/concept. It's most naive/simple solution according badger doc. If you confirm the solution (implementation) is ok, I will add test. (I plan to write some tests like previous issue #3879 : generate enough temp data at runtime for proving it's ok).

For my use case it's happen every ~30_000. I have a bundle with 160_000 file JSON, one by user. Each file is small ~1k.

@floriangasc floriangasc force-pushed the fix/storage-disk-too-big-transaction branch 2 times, most recently from 09ab206 to 2032325 Compare May 22, 2023 13:05
@srenatus
Copy link
Contributor

I am not sure if it's best solution because of #5721 and also there is Truncate method/concept. It's most naive/simple solution according badger doc. If you confirm the solution (implementation) is ok, I will add test. (I plan to write some tests like previous issue #3879 : generate enough temp data at runtime for proving it's ok).

The core of the Truncate work was that it's only happening in one specific, mutex-protected code path, and would side-step the existing database triggers and all that. It's the only place where we can commit and re-new a transaction, because there are no triggers set up for the commit on the new store. After the data has been written to the new store, it replaces the old one.

So I don't think that the change proposed here is the way to go. It's similar to 528836a which was consciously removed in 516dd47#diff-c91ae2bef36ff9b230ca0dd967684cea2c38dc920b32fe67faf49395579d96d7L170 because of the trigger callback issue.

@floriangasc floriangasc force-pushed the fix/storage-disk-too-big-transaction branch from 2032325 to cee5ef3 Compare May 23, 2023 15:19
@floriangasc
Copy link
Contributor Author

floriangasc commented May 23, 2023

Hi @srenatus

Thanks for quick reply.

According to what I have understand of your comment and source code:

  • this kind of work (Truncate) can only be done on bundle.Activate().
  • I have tried to update Write() of storage/disk/{disk,txn}.go but it seems to have large side effect (according your last comment).

The most simple solution I have found:
Add WriteTruncate function make similar work of Truncate and little change of txn.go because Write Truncate should/must handle the loop over partitions (update) for commit and recreate transaction in case ErrTxnTooBig (like Truncate).

If that solution is better than precedent and validated :

  1. I have broken some test, which have not understood yet, try to understand what I have broken
  2. The code is too draft, and I make some clean code and add test for the new WriteTruncate
  3. I have some doubt about bundle.writeDataAndModules because this txn is lost, WriteTruncate should return new transaction ?

I have tested this solution with my bundle (160_000 partition) and it's ok

@srenatus
Copy link
Contributor

I'm still not sure that this is the right approach...I think you might be fighting against a fundamental limitation of how Badger works in OPA. Have you tried other things, notably fiddling with the partitions to reduce the txn size?

@floriangasc
Copy link
Contributor Author

floriangasc commented May 24, 2023

Hi @srenatus

I see what you mean. Thank you for your response, it helped me a lot. I give up this error and I have continued my investigation. And I have 3 questions

I discovered: when lazy mode is enabled, the error no longer occurred. Lazy mode is enabled (only?) when the bundle is downloaded! It's very good news (because long polling in production!).
I have 2 questions: (1) There is a reason for lazy mode is available only for download? (2) Why it's not configurable (from cli of file)?

I also discover when lazy mode is available, the ram of cpu + with disc store became ok for sidecar usage, except during bundle activation.
I need to put runtime.GC() to avoid a pic of memory allocation inside for inside Truncate method. Run gc every 100_000 loop is ok for my use case, it's limit ram to 768mo (at very max) instead of 1512mo (or more). The most naive solution i have try for limit memory is:

	for {
		if (i % 100_000) == 0 {
			runtime.GC()
		}
		i++

		var update *storage.Update

		update, err = it.Next()

(3) I haven't got very good knowledge of go gc but there is possibility to fix that in more elegant way otherwise container is kill when bundle is activated (bootstrap or bundle update)? 

@ashutosh-narkar
Copy link
Member

@floriangasc let me try to answer some of your queries

I have 2 questions: (1) There is a reason for lazy mode is available only for download? (2) Why it's not configurable (from cli of file)?

The purpose of lazy mode is to avoid deserializing data as it can be memory consuming and especially when disk storage is enabled you don't want memory to be a bottleneck. We started to apply this first when OPA downloads bundles as it seemed like a good start and we can use it in more places in the future where it makes sense to do so. Exposing this via config could also be considered if there is a valid use-case for it.

On 3) it's possible that this was a temp spike before gc happened but it depends on your environment, data/policy etc. So you would have to experiment a little and allocate resources accordingly. If you suspect there is a leak, feel free to open an issue and provide steps to repro.

I'm closing this PR as your use-case seems to be addressed and as Stephan has previously explained these changes don't seem necessary. Also if you have more questions, please use OPA Slack or start a GitHub Discussion. Thanks!

@floriangasc
Copy link
Contributor Author

@ashutosh-narkar

Thanks à lot for your work and to take time to answer to my questions. I continue to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants