Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[extension/storage/filestorage] Default fsync to true #31201

Closed
wants to merge 1 commit into from

Conversation

bouk
Copy link
Contributor

@bouk bouk commented Feb 13, 2024

Description:

See #31200, setting fsync to false by default is dangerous and will lead to data loss.

Link to tracking Issue: #31200

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the component is in beta, I am in support of this change because it makes the default configuration much safer. However, I would like to hear other opinions as well. cc: @open-telemetry/collector-contrib-approvers

If we do make the change, we'll need to roll it out with a feature gate. (Example)

@andrzej-stencel
Copy link
Member

andrzej-stencel commented Feb 13, 2024

I'm in support of this change.

If we do make the change, we'll need to roll it out with a feature gate.

I fully agree this is a breaking change, but I'm not convinced that flipping a feature gate is easier/more feasible than changing the configuration value?

@djaglowski
Copy link
Member

I fully agree this is a breaking change, but I'm not convinced that flipping a feature gate is easier/more feasible than changing the configuration value?

I think the benefit is to give users warning that the default will change, but maybe it's just more hassle then it's worth.

@bouk
Copy link
Contributor Author

bouk commented Feb 13, 2024

I think we should clearly list it as a breaking change in the changelog, but I don't think we need to feature flag it. It won't break anything, just potentially slow things down slightly.

@swiatekm
Copy link
Contributor

I'd like to see some benchmarks before we merge this. The storage extension is used by the persistent queue in core, which writes each request to the queue. Enabling or disabling fsync - which in this case only protects against hardware failure and similar events - should be a decision the user makes consciously, as it has potentially major performance implications depending on the storage medium in question.

@bouk
Copy link
Contributor Author

bouk commented Feb 16, 2024

@swiatekm-sumo you can see some benchmarks on the original file storage PR: #3087

@djaglowski
Copy link
Member

@swiatekm-sumo you can see some benchmarks on the original file storage PR: #3087

Based on this comment, we observed a 3x difference in performance.

We had a longer discussion about the risks of this setting on that thread as well. Based on rereading, it's not clear to me that this change should happen. Perhaps we should consider a third option which is to add more detailed documentation about the parameter and remove the default value, forcing users to make a decision about it.

I will say that in the absence of a clear consensus I think we need to default to no breaking change.

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm blocking this until we establish consensus. I'll call attention to this discussion at next week's SIG meeting.

cc: @open-telemetry/collector-contrib-approvers in case there are additional opinions.

@swiatekm
Copy link
Contributor

swiatekm commented Feb 16, 2024

Part of the problem is that the value of this setting should likely depend on what the extension is used for. The persistent queue use case is performance-sensitive, but use cases where we just periodically store some relatively small state, like in filelog receiver, would probably be better off with fsync on.

There's also some effort to make the persistent queue less taxing on the storage, so once that lands we could revisit whether enabling fsync remains painful there.

@bouk
Copy link
Contributor Author

bouk commented Feb 16, 2024

It's a shame that bbolt panics on corruption instead of allowing a graceful way to truncate the database

@swiatekm
Copy link
Contributor

We could also consider other storage engines. Our API is very simple and maps readily to any embedded transactional key-value store. I once ran some benchmarks across a variety of them and didn't need to do much work to replace bbolt.

Bbbolt also has some other properties which are undesirable from our perspective, like not dealing with a full disk very gracefully.

@djaglowski
Copy link
Member

@swiatekm-sumo, I'd be interested to see your findings on other storage engines. I seem to recall writing up a comparison vs other options when first implementing the extension but can't seem to find it now.

@swiatekm
Copy link
Contributor

@djaglowski I'll see if I can dig it up. Even if I can't, I don't think it'll be difficult to reproduce.

@djaglowski
Copy link
Member

It was pointed out in SIG today that the performance impact of enabling fsync was original stated as "3 orders of magnitude", rather than the 3x I stated above.

In any case, we agreed that given the tradeoffs here, we should serious consider switching away from bbolt. I suggest we should wait until @swiatekm-sumo has a chance to look for his previous analysis. Otherwise, if anyone has potential alternatives in mind, please feel free to call them out.

Copy link
Contributor

github-actions bot commented Mar 7, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Mar 7, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants