Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

forky: Fixed Chunk Data Size Store #2017

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

forky: Fixed Chunk Data Size Store #2017

wants to merge 25 commits into from

Conversation

janos
Copy link
Member

@janos janos commented Dec 5, 2019

This PR changes the chunk data disk persistence structure. It is based on previous experiment https://github.com/janos/forky where multiple approaches were tried out. The most performant was chosen and added to swarm as storage/fcds package.

This PR includes additional changes required for new storage to be used and optionally manually migrated.

FCDS is integrated into storage/localstore without breaking changes to the package API. That involves:

  • removing usage of retrievalDataIndex, except to provide export functionality for older db schema
  • replacing retrievalAccessIndex with metaIndex that contains storage timestamp, access timestamp and bin id

At the roundtable discussion, it was decided to have an optional manual data migration. To achieve this,
instructions with steps are printed when the new swarm version is started with older data format. LocalStore
migrations are adjusted to make this functionality. Some issues to getMigrations function are discovered
that are fixed and tested in this pr through additional tests. Schema name variables are now unexported
and names from the legacy ldbstore removed from migrations as they are not needed to be there.

In order for migration to be complete, import and export localstore functions needed to handle pinning
information. This functionality is added.

Measurements

Measurements are performed locally on 4 core MacBook Pro mid 2014.
Every run on a clean swarm data directory by uploading files with random data.

Local test1 - 1GB size - 5x speedup

time ./swarm.master up test1
c4e105675ab10acc907ac4e966aa0359eb0accc5fe5bd03dc15d8161e5fa1dda
./swarm.master up test1  0.98s user 1.85s system 1% cpu 3:52.39 total

time ./swarm.fcds up test1
c4e105675ab10acc907ac4e966aa0359eb0accc5fe5bd03dc15d8161e5fa1dda
./swarm.forky up test1  1.01s user 1.94s system 6% cpu 46.629 total

Local test4 - 4GB size - 6.5x speedup

time ./swarm.master up test4
b2c1bae070933e5c46d1f839340e2ea33c77469e1c8210691cbe0ed79b211506
./swarm.master up test4  3.91s user 7.37s system 0% cpu 26:27.17 total

time ./swarm.fcds up test4
b2c1bae070933e5c46d1f839340e2ea33c77469e1c8210691cbe0ed79b211506
./swarm.forky up test4  4.30s user 8.50s system 5% cpu 4:06.79 total

Smoke tests on cluster

Smoke tests were run multiple times for validation and to measure performance. However, the performance
gain is related to the number of cpus that ec2 node has and how many swarm nodes are running on the same ec2 node.

These are results from running one swarm node on 2 core c5.large https://snapshot.raintank.io/dashboard/snapshot/dD0JQruCpHpOjvKqwR6YN3ay3GmONLyr.
If there are two swarm processes on the same node, upload speed is about the half. It is noticeable that
garbage collection influences performance and this is the area where further adjustments can be made.

@janos janos self-assigned this Dec 5, 2019
@@ -168,10 +167,6 @@ func dbImport(ctx *cli.Context) {
}

func openLDBStore(path string, basekey []byte) (*localstore.DB, error) {
if _, err := os.Stat(filepath.Join(path, "CURRENT")); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is removed in order to allow dbImport to create localstore directory if it does not exist, to simplify migration steps (remove the need to start swarm node in between export and import, as it was required in v0.5.0 release).

storage/fcds/fcds.go Outdated Show resolved Hide resolved
storage/fcds/fcds.go Outdated Show resolved Hide resolved
storage/fcds/fcds.go Outdated Show resolved Hide resolved
storage/fcds/fcds.go Outdated Show resolved Hide resolved
shards map[uint8]*os.File // relations with shard id and a shard file
shardsMu map[uint8]*sync.Mutex // mutex for every shard file
meta MetaStore // stores chunk offsets
free map[uint8]struct{} // which shards have free offsets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would also just have a shardCount length bool array here

storage/fcds/meta.go Show resolved Hide resolved

// offsetCache is a simple cache of offset integers
// by shard files.
type offsetCache struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am a bit lost here. what do we need this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the caching of free offsets in Store. It reduces the number of calls to MetaStore.FreeOffset in Store.Put to avoid disk i/o.

storage/localstore/gc.go Outdated Show resolved Hide resolved
storage/localstore/localstore_test.go Show resolved Hide resolved
storage/localstore/localstore_test.go Show resolved Hide resolved
func (s *Store) getOffset(shard uint8) (offset int64, reclaimed bool, err error) {
if !s.shardHasFreeOffsets(shard) {
// shard does not have free offset
return -1, false, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err -> nil preferred

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes of course, my mistake.

@janos
Copy link
Member Author

janos commented Dec 13, 2019

I am confirming that smoke tests pass with the current state of this PR https://snapshot.raintank.io/dashboard/snapshot/BDyvSKICIk5a0AbAjXxBeJfgvEvvpiqP. These results are with deployment of 3 swarm pods per one c5.large ec2 node. The first two failures are because smoke test job started before all swarm nodes in the cluster.

@janos janos requested a review from nolash December 16, 2019 15:19
Copy link
Contributor

@pradovic pradovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very nice 👏 I left a few minor comments, mostly cosmetic ones.

// If offset is less then 0, no free offsets are available.
func (s *Store) getOffset(shard uint8) (offset int64, reclaimed bool, err error) {
if !s.shardHasFreeOffsets(shard) {
// shard does not have free offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe reduce comments in this function as the function comment itself explains the behavior and the code itself with private function names are 100% self explanatory. So I think having only the code makes it even more readable, imo.

}()
select {
case <-done:
case <-time.After(15 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a debug log here?

default:
}
s.wg.Add(1)
return s.wg.Done, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have a private Store.done() function instead of returning done function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is to make it more obvious that done must be called in order to finish the protection and that it is only in the relation to protect function. But for this small codebase, I think that it is not so significant. Is it ok to call the new method unprotect instead done?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both make sense to me, I agree with your point as well, so go for what feels better for you. If you decide to make it a private function unprotect sounds good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage is nicer with a private method, with one less line. Thanks.

func (s *MetaStore) Get(addr chunk.Address) (m *fcds.Meta, err error) {
s.mu.RLock()
m = s.meta[string(addr)]
s.mu.RUnlock()
Copy link
Contributor

@pradovic pradovic Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer? Not needed functionally, just a subjective cosmetic suggestion :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer comes with a cost, and I wanted to have as good as possible performance for measurements. In general, I would agree, but here defer contributes with no insignificant relative cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

// Reclaimed flag denotes that the chunk is at the place of
// already deleted chunk, not appended to the end of the file.
func (s *MetaStore) Set(addr chunk.Address, shard uint8, reclaimed bool, m *fcds.Meta) (err error) {
s.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer unlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer comes with a cost, and I wanted to have as good as possible performance for measurements.


// offsetCache is a simple cache of offset integers
// by shard files.
type offsetCache struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would expiry be helpful to guard against memory "leak"? Is there a situation where the cache items are not removed for a long time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In the context of offsetCache, this is certainly a possibility, and would be a good improvement. In the wider context, removal depends on MetaStore.FreeOffset function, which is injected and cannot be trusted for any implementation. I will add basic TTL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a basic TTL and cleanup, but I am not sure that it is actually needed while adding a bit more complexity and locking.

There is a case where offsetCashe may hold offsets for longer periods in case of no activity (uploads, retrievals or syncing). In that case expiration reduces the optimisation that offsetCache is here for, and I would not add it.

Maybe @zelig would like to share opinion on this subject also?

Copy link
Contributor

@pradovic pradovic Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. I do not think that a small TTL is needed, maybe like 24h one or something similar, just to prevent cache growing forever, but I am not sure it is needed at all as well. Do we expect to have this same cache for the whole time the app running? Sorry if it's a stupid question 🙈 If yes, and if the app is supposed to be running for long periods of time (like multiple days), then it might have be good to clean it up some time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted the TTL feature after a brief discussion. We have concluded that if there would be a cache leak, then a much bigger problem would have cause it in this case, so the additional complexity is most likely not needed.

Copy link
Contributor

@santicomp2014 santicomp2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a very good improvement.
I want to test it in a test cluster that we setup.
LGTM

@janos
Copy link
Member Author

janos commented Dec 19, 2019

Thanks, @santicomp2014. I have updated this branch with the current master and pushed docker image to janos/swarm:fcds, for testing.

Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work on this PR @janos!

a few comments from my side:

  • different chunk sizes - will we need different shards for different chunk sizes? how much effort are we talking in migrating this implementation to support different chunk size? i think that adding the infrastructure to this right now would save us some trouble down the road
  • shard count may change for various reasons and i think we should not tie getShard function output to this number
  • max chunk size is not enforced and gracefully truncates data
  • if we want to accommodate for a hypothetical future feature that reconstructs localstore from the actual data on the shard files (something which is a low hanging fruit with this kind of a design, and allows us to reconstruct the db from an inconsistent state) but also accommodate variable chunk size - we cannot since the shards do not encode chunk size data. this is just a sidemark, i'm sure you are well aware of this
  • traditionally i am not so happy about manual migrations but i'm gonna put a sock in it

in any case i would prefer to have this in the next release just to give us some leeway to play around with this some more

<3

// Interface specifies methods required for FCDS implementation.
// It can be used where alternative implementations are needed to
// switch at runtime.
type Interface interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to call an interface Interface? maybe Storer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the inspiration from sort.Interface, but Storer is a good name. Will change, thanks.

storage/fcds/fcds.go Show resolved Hide resolved
storage/fcds/fcds.go Outdated Show resolved Hide resolved
storage/fcds/fcds.go Show resolved Hide resolved
storage/fcds/fcds.go Outdated Show resolved Hide resolved
case <-ctx.Done():
}
}
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means we continue although there were errors, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this behaviour should be fixed in the whole goroutine where the error is not nil. If there is an error, goroutine should return. I will fix that.

storage/localstore/migration.go Outdated Show resolved Hide resolved
storage/localstore/migration.go Outdated Show resolved Hide resolved
storage/localstore/gc.go Show resolved Hide resolved
storage/fcds/fcds.go Show resolved Hide resolved
Copy link
Member Author

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @acud for a review it is very helpful to see there the implementation is weak.

  • different chunk sizes - will we need different shards for different chunk sizes? how much effort are we talking in migrating this implementation to support different chunk size? i think that adding the infrastructure to this right now would save us some trouble down the road

I am not sure that there is a decision to have variable chunk sizes. This highly influences this and probably other components that deal with the chunks.

  • shard count may change for various reasons and i think we should not tie getShard function output to this number

I am open for suggestions.

  • max chunk size is not enforced and gracefully truncates data

This is a great find. Thanks.

  • if we want to accommodate for a hypothetical future feature that reconstructs localstore from the actual data on the shard files (something which is a low hanging fruit with this kind of a design, and allows us to reconstruct the db from an inconsistent state) but also accommodate variable chunk size - we cannot since the shards do not encode chunk size data. this is just a sidemark, i'm sure you are well aware of this

Yes, and the name of the store reflects on assumption that there is a fixed maximal chunk data size constant in time.

  • traditionally i am not so happy about manual migrations but i'm gonna put a sock in it

This is just as we did a year ago when we introduced a new localstore (I think that you implemented the migration), but without and external link to the migration steps.

in any case i would prefer to have this in the next release just to give us some leeway to play around with this some more

I agree.

@janos janos requested a review from acud January 17, 2020 12:01
Copy link
Member

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @janos. Thanks for addressing all of my comments

@acud
Copy link
Member

acud commented Jan 20, 2020

also please do not merge until the next release is out

@janos
Copy link
Member Author

janos commented Jan 20, 2020

@acud could you reject this PR until the next release is released, to block merging that way? :)

}

for _, sh := range s.shards {
if err := sh.f.Close(); err != nil {
Copy link
Collaborator

@jmozah jmozah Mar 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the fd open for a long duration without mmap is a disaster in the making. I am not sure if go file flush() does a os level fsync(). If it is, then we should flush() at regular intervals to save the contents against crash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice observation. Go os.File.Sync() is flushing the content to the disk, I am not sure to which flush() are you referring to.

Flushing on (regular) intervals is what operating system is already doing. I am not sure how this would help agains crash unless we fsync on every write, which is quite costly. I have already tested fsync on every write and it makes fcds much slower, even compared to go-leveldb, as go-leveldb dos not fsync at all.

Mmap brings its own complexity, especially on different operating systems.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply.

Go os.File.Sync() is flushing the content to the disk, I am not sure to which flush() are you referring to.

Apologies if i confused you. There are two things...

  1. flush(), which flushes the application write buffers to OS.
  2. fsync(), the OS level sync which absolutely makes sure that buffers has gone to disk.

The first one is done by golang itself as pointed out by you. The second one is the one i am concerned.

Flushing on (regular) intervals is what operating system is already doing

It is usually done by OS disk drivers whenever they feel it is okay. The fsync() is expensive as pointed out by you. To avoid fsyncing on every commit, DB's usually implement WAL's which fsyncs it on very small regular intervals on the background. so that even if some data is lost it will be of for very small duration . If you fsync on foreground (query path) everytime that will be very expensive.

even compared to go-leveldb, as go-leveldb dos not fsync at all.

leveldb takes another strategy my doing a larger mmap file than required and written directly using memcopy, then on mmap driver takes care of writing the dirty pages to the disk.

All i am saying is, one way or other, if we want to evade crash and end up with corrupted files on bootup, We have to implement this otherwise the I am sure we can expect some gibberish files when you switch off power abruptly.

Mmap brings its own complexity, especially on different operating systems.

Yes. We should not reinvent the wheel.
As a remedy... we have two options

  1. Use already existing DB which takes care of this ( we can talk about badger if you like)
  2. In Forky, do a FIle.Sync() every X seconds ( where we can tolerate X seconds data loss) in a background thread for all files.

I found this intresting read on this topic
https://www.joeshaw.org/dont-defer-close-on-writable-files/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanations.

Use already existing DB which takes care of this ( we can talk about badger if you like)

I have already tried badger and it is slower then go-leveldb. Maybe you can find to do it more efficiently. And still it does not scale with more cpu cores.

In Forky, do a FIle.Sync() every X seconds ( where we can tolerate X seconds data loss) in a background thread for all files.

I am concerned if this is actually help us, as os is already doing that in some frequency. Also, as you pointed, the solution would be to implement WAL, which in some basic form I already did, with of course, some performance penalties.

As far as I can see go-leveldb is not using mmap.

I see your point very valid, but I think that your suggestions should be tested, even the ones that I already did, to revalidate. We can talk about possibilities, but it would be good to actually test resilience for the weak points that you described.

As this is a very important part of the swarm I am more in favour not to merge this PR until we are clear that it is reliable. Currently, I see that it creates more problems than it brings benefits.

Copy link
Collaborator

@jmozah jmozah Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did few changes in the badger configuration and did run TestLevelDBForky and TestBadgerSuite test cases. These are the results.
Screenshot 2020-03-10 at 1 16 10 AM

The iteration in Badger is somewhat slow... so i commented out the Iteration and did these test cases.. with an assumption that write/read/delete is more important and not iteration.

There are more write improvements i can make in badger.. like batching the Writes and so on.. but for now.. i will leave it like this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I cannot say anything from the screenshots except to compare timings and conclude that TestBadgerSuite writes are slower. It would be very helpful to actually see the changes and what is TestBadgerSuite doing.

Copy link
Collaborator

@jmozah jmozah Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the last 2 commits in my fork https://github.com/jmozah/forky/commits/master
The only one that matters is SyncWrites = false, which makes Forky and badger apples to apples in my opinion. Others configs are more experimental.

@janos
Copy link
Member Author

janos commented Mar 10, 2020

Based on recent discussions and decision to go for a more reliable and in general better approach with using Badger for chunk data storage, I am closing this PR. Thank you all for investing time to improve, test and review this PR, and I am sorry for a long attempt that was so late identified as not the best approach to improve storage performance. At least the high level design, localstore integration and migration part could be reused.

@janos janos closed this Mar 10, 2020
@jmozah
Copy link
Collaborator

jmozah commented Mar 11, 2020

Please don't delete the branch. This should be used to test performance with the badger.

@janos
Copy link
Member Author

janos commented Mar 11, 2020

@jmozah Of course. I deliberately did not delete the branch.

@tfalencar
Copy link

Based on recent discussions and decision to go for a more reliable and in general better approach with using Badger for chunk data storage, I am closing this PR. Thank you all for investing time to improve, test and review this PR, and I am sorry for a long attempt that was so late identified as not the best approach to improve storage performance. At least the high level design, localstore integration and migration part could be reused.

Hi , could you please just summarize why the approach was abandoned? Not everyone is able to join in these discussions, still it would be nice to have it documented here.

@acud acud reopened this Mar 24, 2020
@acud acud changed the title Fixed Chunk Data Size Store forky: Fixed Chunk Data Size Store Mar 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants