storage disk: Fix badger.ErrTxnTooBig when too many partition/path a… #5937

floriangasc · 2023-05-22T12:58:38Z

storage/disk/txn.go

Why the changes in this PR are needed?

This change is to avoid badger.ErrTxnTooBig when load «too» many partitions.

What are the changes in this PR?

I handle badger.ErrTxnTooBig like define in the official doc https://dgraph.io/docs/badger/get-started/#transactions

Notes to assist PR review:

I am not sure if it's best solution because of #5721 and also there is Truncate method/concept. It's most naive/simple solution according badger doc. If you confirm the solution (implementation) is ok, I will add test. (I plan to write some tests like previous issue #3879 : generate enough temp data at runtime for proving it's ok).

For my use case it's happen every ~30_000. I have a bundle with 160_000 file JSON, one by user. Each file is small ~1k.

srenatus · 2023-05-22T19:08:27Z

I am not sure if it's best solution because of #5721 and also there is Truncate method/concept. It's most naive/simple solution according badger doc. If you confirm the solution (implementation) is ok, I will add test. (I plan to write some tests like previous issue #3879 : generate enough temp data at runtime for proving it's ok).

The core of the Truncate work was that it's only happening in one specific, mutex-protected code path, and would side-step the existing database triggers and all that. It's the only place where we can commit and re-new a transaction, because there are no triggers set up for the commit on the new store. After the data has been written to the new store, it replaces the old one.

So I don't think that the change proposed here is the way to go. It's similar to 528836a which was consciously removed in 516dd47#diff-c91ae2bef36ff9b230ca0dd967684cea2c38dc920b32fe67faf49395579d96d7L170 because of the trigger callback issue.

floriangasc · 2023-05-23T15:38:30Z

Hi @srenatus

Thanks for quick reply.

According to what I have understand of your comment and source code:

this kind of work (Truncate) can only be done on bundle.Activate().
I have tried to update Write() of storage/disk/{disk,txn}.go but it seems to have large side effect (according your last comment).

The most simple solution I have found:
Add WriteTruncate function make similar work of Truncate and little change of txn.go because Write Truncate should/must handle the loop over partitions (update) for commit and recreate transaction in case ErrTxnTooBig (like Truncate).

If that solution is better than precedent and validated :

I have broken some test, which have not understood yet, try to understand what I have broken
The code is too draft, and I make some clean code and add test for the new WriteTruncate
I have some doubt about bundle.writeDataAndModules because this txn is lost, WriteTruncate should return new transaction ?

I have tested this solution with my bundle (160_000 partition) and it's ok

srenatus · 2023-05-23T16:56:59Z

I'm still not sure that this is the right approach...I think you might be fighting against a fundamental limitation of how Badger works in OPA. Have you tried other things, notably fiddling with the partitions to reduce the txn size?

floriangasc · 2023-05-24T19:33:17Z

Hi @srenatus

I see what you mean. Thank you for your response, it helped me a lot. I give up this error and I have continued my investigation. And I have 3 questions

I discovered: when lazy mode is enabled, the error no longer occurred. Lazy mode is enabled (only?) when the bundle is downloaded! It's very good news (because long polling in production!).
I have 2 questions: (1) There is a reason for lazy mode is available only for download? (2) Why it's not configurable (from cli of file)?

I also discover when lazy mode is available, the ram of cpu + with disc store became ok for sidecar usage, except during bundle activation.
I need to put runtime.GC() to avoid a pic of memory allocation inside for inside Truncate method. Run gc every 100_000 loop is ok for my use case, it's limit ram to 768mo (at very max) instead of 1512mo (or more). The most naive solution i have try for limit memory is:

	for {
		if (i % 100_000) == 0 {
			runtime.GC()
		}
		i++

		var update *storage.Update

		update, err = it.Next()

(3) I haven't got very good knowledge of go gc but there is possibility to fix that in more elegant way otherwise container is kill when bundle is activated (bootstrap or bundle update)?

ashutosh-narkar · 2023-05-24T22:33:59Z

@floriangasc let me try to answer some of your queries

I have 2 questions: (1) There is a reason for lazy mode is available only for download? (2) Why it's not configurable (from cli of file)?

The purpose of lazy mode is to avoid deserializing data as it can be memory consuming and especially when disk storage is enabled you don't want memory to be a bottleneck. We started to apply this first when OPA downloads bundles as it seemed like a good start and we can use it in more places in the future where it makes sense to do so. Exposing this via config could also be considered if there is a valid use-case for it.

On 3) it's possible that this was a temp spike before gc happened but it depends on your environment, data/policy etc. So you would have to experiment a little and allocate resources accordingly. If you suspect there is a leak, feel free to open an issue and provide steps to repro.

I'm closing this PR as your use-case seems to be addressed and as Stephan has previously explained these changes don't seem necessary. Also if you have more questions, please use OPA Slack or start a GitHub Discussion. Thanks!

floriangasc · 2023-05-25T06:36:56Z

@ashutosh-narkar

Thanks à lot for your work and to take time to answer to my questions. I continue to investigate.

floriangasc force-pushed the fix/storage-disk-too-big-transaction branch 2 times, most recently from 09ab206 to 2032325 Compare May 22, 2023 13:05

Reframe solution

cee5ef3

floriangasc force-pushed the fix/storage-disk-too-big-transaction branch from 2032325 to cee5ef3 Compare May 23, 2023 15:19

Reframe solution: wip

8ae2f71

ashutosh-narkar closed this May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage disk: Fix badger.ErrTxnTooBig when too many partition/path a… #5937

storage disk: Fix badger.ErrTxnTooBig when too many partition/path a… #5937

floriangasc commented May 22, 2023

srenatus commented May 22, 2023

floriangasc commented May 23, 2023 •

edited

Loading

srenatus commented May 23, 2023

floriangasc commented May 24, 2023 •

edited

Loading

ashutosh-narkar commented May 24, 2023

floriangasc commented May 25, 2023

storage disk: Fix badger.ErrTxnTooBig when too many partition/path a… #5937

storage disk: Fix badger.ErrTxnTooBig when too many partition/path a… #5937

Conversation

floriangasc commented May 22, 2023

Why the changes in this PR are needed?

What are the changes in this PR?

Notes to assist PR review:

srenatus commented May 22, 2023

floriangasc commented May 23, 2023 • edited Loading

srenatus commented May 23, 2023

floriangasc commented May 24, 2023 • edited Loading

ashutosh-narkar commented May 24, 2023

floriangasc commented May 25, 2023

floriangasc commented May 23, 2023 •

edited

Loading

floriangasc commented May 24, 2023 •

edited

Loading