Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roadmap: Blob storage #243

Closed
justinsb opened this issue Dec 21, 2014 · 20 comments
Closed

roadmap: Blob storage #243

justinsb opened this issue Dec 21, 2014 · 20 comments
Labels
A-tools-hibernate Issues that pertain to Hibernate integration. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-wishlist A wishlist feature. O-community Originated from the community

Comments

@justinsb
Copy link

justinsb commented Dec 21, 2014

First off, I think this is awesome. I did a series of blog posts about a year ago (http://blog.justinsb.com/blog/categories/cloudata/) that posited the existence of a reliable key-value store and a blob store, and built all sorts of cool things on top of it. For example, a git store would store only the refs in consistent storage, but would store the data itself in the blob store. Similarly a filesystem: metadata in the key-value store; file data in blob storage. It might be fun for me to update that blog series using cockroachdb.

By blob storage, I mean immutable storage, where blobs are stored by their SHA-256 hash or similar. Blobs can be created and are replicated for reliable storage, but then the only "modification" allowed is deletion.

Do you have any plans/pointers for integrating blob storage? I could always just rely on S3 or similar, but this seems like something you might well be planning on supporting directly. (I presume it's not a good idea to just store these (multi-MB) blobs as values in the k-v store.)

Jira issue: CRDB-6201

@jamesgraves
Copy link
Contributor

If you are willing to live with the restrictions for immutable blobs, that opens up a lot of design possibilities. I can imagine a couple different kinds of blob storage that might be useful in various situations. For example, some applications might work well in a peer-to-peer fashion, instead of using centralized servers.

I presume it's not a good idea to just store these (multi-MB) blobs as values in the k-v store.

Performance can potentially be much better with external blob storage. For a particular key-value, you are only talking to one node (the currently elected leader) which may not be located nearby, it could be in another datacenter. The leader may also be busy handling other transactions from other clients.

Presumably, if you are replicating blobs, the client can chose the closest and/or least busy server to download from. It is also easier to scale up the serving of static content.

@spencerkimball
Copy link
Member

While at Google, I worked on Colossus, which was the successor to GFS. It used bigtable to scale out file metadata and a server at each disk to store chunks of data. Aside from increased scale, it used Reed Solomon encoding for more efficient storage. I'd of course like to do something similar on top of Cockroach at some point.

You definitely wouldn't want to store multi-MB blobs as values in the KV store. I believe that leveldb internally will store anything larger than 64K as a separate file.

@xiang90
Copy link
Contributor

xiang90 commented Dec 24, 2014

@spencerkimball

It used bigtable to scale out file metadata and a server at each disk to store chunks of data.

So the metadata server is also distributed and each of them only put part of the metadata into memory?

Aside from increased scale, it used Reed Solomon encoding for more efficient storage.

Reed Solomon encoding/decoding is high cost as far as I know. Is this a client-driven thing? Or it is done at the server side?

I'd of course like to do something similar on top of Cockroach at some point.

I would love to see this happen.

@ghost
Copy link

ghost commented Dec 31, 2014

Hi,

I've finally just released: https://github.com/photosrv/photosrv which is aimed at handling immutable blob storage. It also needs some work in some areas, does not do any kind of encoding/decoding, but has already been proven to scale effectively to hundreds of millions of files.

Maybe there is room or possibility for an integration layer between both photosrv and CockroachDB?

Cheers,

@spencerkimball
Copy link
Member

@xiang90

The metadata was distributed, but it didn't hold the data in memory, except through bigtable's normal caching facility.

Reed Solomon encoding/decoding can be made blazingly fast with appropriate optimizations. Both in how the cauchy encoding matrix is chosen and with carefully tuned low-level inner-loop instructions. At google, this was done exclusively on the client-side for encoding, and for decoding on-the-fly in the event that requested data blocks were not available, necessitating reads of parity blocks. It was also done server-side for permanent reconstructions when machines went missing, data corruption was identified, or disks died.

@Frank-Jin
Copy link

@spencerkimball
Hi,I ask one question.
I see that Google's spanner was built on top of Colossus,but cockroach doesn't depend on any distributed FS. What's different between them?

@spencerkimball
Copy link
Member

Running on top of Colossus is a very efficient proposition since it uses erasure coded storage. It means that if you store replicas of part of your key range within five datacenters, you end up with 5 * 1.67x = 8.3x encoding rate. Back in the days of GFS, the same configuration would mean 5 * 3x = 15x encoding rate for triplicated storage or 5 * 2x = 10x encoding for just two replicas / datacenter (gfs r=2 encoding).

Cockroach doesn't use a separate distributed file system as a dependent layer, so with five datacenters, you'd either end up with 5x encoding rate (one replica in each) or you can increase the Raft encoding to require two replicas per datacenter (or even three, though I don't think that would make as much sense). Using only a single replica per datacenter means you need to use inter-datacenter bandwidth in order to recover a lost disk or machine. Using two replicas per datacenter means you very often could rely on intra-datacenter bandwidth for recovery. The cost of using two replicas in each datacenter is in encoding efficiency (x2) and write latencies increasing both from additionally required bandwidth on writes and probably a change in the latency histogram due to more replicas participating in consensus. You could possibly get really clever with the inter-datacenter bandwidth by sending to only one replica per datacenter and having that replica responsible for forwarding to the alternate, but that would be a really onerous bit of complexity to add.

Eventually Cockroach could just as easily run using a distributed file system, but at this stage it would be a mistake to require such a complex external dependency. Most Cockroach clusters will start as single-datacenter deployments which makes the inter-datacenter bandwidth issue a moot point. Further, a very smart and cheap way to mitigate the costs of recovery in multi-datacenter clusters would be to use more reliable storage than Google uses internally. For example, hardware RAID.

I'm quite serious about the next step being CockroachFS, which would introduce erasure encoding and CockroachDB could bootstrap itself and run on top of that, making it closer to the Spanner architecture where storage is concerned.

@Frank-Jin
Copy link

I see,thank you

@cloudmode
Copy link

We just ported our old FoundationFS demo to cockroachdb as an excercise to learn/understand cockroachdb. Overall, it was a good experience, and we intend to keep this project up to date as cockroachdb moves into beta and production. You can check it out at https://github.com/cloudmode/roachclip-fs. It's an attempt to replicate mongodb's gridfs interface.

@cloudmode
Copy link

Just to clarify, roachclip-fs is for file storage, it's not a file system.

@xiang90
Copy link
Contributor

xiang90 commented Apr 10, 2015

@spencerkimball

Reed Solomon encoding/decoding can be made blazingly fast with appropriate optimizations. Both in how the cauchy encoding matrix is chosen and with carefully tuned low-level inner-loop instructions.

I just found https://github.com/tsuraan/Jerasure and https://github.com/catid/longhair. I am not sure if it worth a rewrite in go.

At google, this was done exclusively on the client-side for encoding, and for decoding on-the-fly in the event that requested data blocks were not available, necessitating reads of parity blocks. It was also done server-side for permanent reconstructions when machines went missing, data corruption was identified, or disks died.

I did some research about this recently. And I am more interested in the permanent reconstructions at server side at the moment.

My understanding is this has to be done at the server side since we want to reduce the bandwidth at client side. To reconstruct a missing block might involve at most K transmissions for a (k, m) coding.

But does this also mean that the server side has to know about the original encoding matrix? And the reconstructing needs some help from the upper level (or might be need to remember the matrix?)

@spencerkimball
Copy link
Member

@xiang90 @google on Colossus we did both client-side and server-side reconstructions. Waiting for the server when accessing via the client would result in unacceptable latency. There are also some common situations where an optimizing data layout would result in no additional bandwidth to reconstruct (e.g. when scanning a file a "stripe" at a time). The encoding matrix is just a binary string of length k * m. We'd store that with the file metadata so the server could reconstruct at its leisure.

Here's a description of a possible data layout:

image

The above diagram shows the data layout for two "stripes", which is a convenient concept to describe pieces of a larger file. In this example, there are 8 data chunks and 4 code chunks in the stripe. Another common format might be 6.4. The metadata for a larger file was broken down by stripes. Each stripe would have 12 "chunks". In the diagram above, a chunk is represented by a vertical column containing 8 MB of non-contiguous data. There are 8 data chunks and 4 code chunks. A stripe here contains 64MB of data and 32MB of parity/code blocks. A "block" is one of the cells in the diagram.

Why layout the data this way? There are some compelling reasons. First, you expect many files to be relatively small. If you have files flirting with sizes which are within an order of magnitude of the stripe size, you would suffer from quite expensive overhead in code blocks for a file with has a size modulo (say) 16MB. In that case, the final stripe would only contain 16 MB of data blocks. If you laid out data contiguously along chunks, that would mean you'd need all 32MB of code blocks. Your final stripe of data would have an encoding rate of 3x. Not good. With the layout in the diagram, you spread the 16MB of data blocks across only the first two block rows and use only 8MB of code blocks (also just those along the first two "mini" stripes), giving you an encoding rate of 1.5x.

For similar reasons (and critical to performance), on-the-fly recovery of data is greatly enhanced by this data layout. If you're trying to read 2MB of data and one of the chunks is unavailable, you'll end up (likely) reading 8MB of data to reconstruct the missing 1MB (say, the 1MB from the 2MB which is available and 7 additional blocks from the remaining 10). This is a 4x read blowup. On the other hand, with the contiguous block layout, missing 2MB of data would mean reading 16MB (2MB from each of 8 of the remaining 12 chunks), for an 8x read blowup. Further, if you read in 8MB increments, a missing chunk with this data layout actually incurs no read blowup costs--let's say you read from 7 of the 8, then you just need 1 MB from one of the 4 code chunks). For the contiguous chunk model, you'd need to read 64MB.. Ouch--an 8x read blowup.

@tamird tamird added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2015
@CooCooCaCha
Copy link

I recently found this Go library for Reed-Solomon encoding. It claims 1GB/s/cpu core.
https://github.com/klauspost/reedsolomon

@petermattis petermattis modified the milestone: Post 1.0 Feb 14, 2016
@petermattis petermattis changed the title Feature request: Blob storage sql: Feature request: Blob storage Mar 31, 2016
@dianasaur323 dianasaur323 added O-community Originated from the community and removed community-answered labels Apr 23, 2017
@knz knz changed the title sql: Feature request: Blob storage Feature request: Blob storage May 9, 2018
@BramGruneir BramGruneir added the A-tools-hibernate Issues that pertain to Hibernate integration. label Jun 14, 2018
@knz knz changed the title Feature request: Blob storage roadmap: Blob storage Jul 23, 2018
@tim-o
Copy link
Contributor

tim-o commented Aug 12, 2018

Zendesk ticket #2743 has been linked to this issue.

@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@tbg tbg added the C-wishlist A wishlist feature. label Dec 11, 2018
@Dmole
Copy link

Dmole commented Apr 6, 2019

OID / BLOB support would bring CockroachDB closer to a drop-in replacement for PostgreSQL.

jordanlewis added a commit to jordanlewis/cockroach that referenced this issue Oct 2, 2019
The spreadsheet we discussed is unwieldy - hard to edit and impossible to keep
up to date. If we write down blacklists in code, then we can use an approach
like this to always have an up to date aggregation.

So far it seems like there's just a lot of unknowns to categorize still.

The output today:

```
=== RUN   TestBlacklists
 648: unknown                                                (unknown)
 493: cockroachdb#5807   (sql: Add support for TEMP tables)
 151: cockroachdb#17511  (sql: support stored procedures)
  86: cockroachdb#26097  (sql: make TIMETZ more pg-compatible)
  56: cockroachdb#10735  (sql: support SQL savepoints)
  55: cockroachdb#32552  (multi-dim arrays)
  55: cockroachdb#26508  (sql: restricted DDL / DML inside transactions)
  52: cockroachdb#32565  (sql: support optional TIME precision)
  39: cockroachdb#243    (roadmap: Blob storage)
  33: cockroachdb#26725  (sql: support postgres' API to handle blob storage (incl lo_creat, lo_from_bytea))
  31: cockroachdb#27793  (sql: support custom/user-defined base scalar (primitive) types)
  24: cockroachdb#12123  (sql: Can't drop and replace a table within a transaction)
  24: cockroachdb#26443  (sql: support user-defined schemas between database and table)
  20: cockroachdb#21286  (sql: Add support for geometric types)
  18: cockroachdb#6583   (sql: explicit lock syntax (SELECT FOR {SHARE,UPDATE} {skip locked,nowait}))
  17: cockroachdb#22329  (Support XA distributed transactions in CockroachDB)
  16: cockroachdb#24062  (sql: 32 bit SERIAL type)
  16: cockroachdb#30352  (roadmap:when CockroachDB  will support cursor?)
  12: cockroachdb#27791  (sql: support RANGE types)
   8: cockroachdb#40195  (pgwire: multiple active result sets (portals) not supported)
   8: cockroachdb#6130   (sql: add support for key watches with notifications of changes)
   5: Expected Failure                                       (unknown)
   5: cockroachdb#23468  (sql: support sql arrays of JSONB)
   5: cockroachdb#40854  (sql: set application_name from connection string)
   4: cockroachdb#35879  (sql: `default_transaction_read_only` should also accept 'on' and 'off')
   4: cockroachdb#32610  (sql: can't insert self reference)
   4: cockroachdb#40205  (sql: add non-trivial implementations of FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR NO KEY SHARE)
   4: cockroachdb#35897  (sql: unknown function: pg_terminate_backend())
   4: cockroachdb#4035   (sql/pgwire: missing support for row count limits in pgwire)
   3: cockroachdb#27796  (sql: support user-defined DOMAIN types)
   3: cockroachdb#3781   (sql: Add Data Type Formatting Functions)
   3: cockroachdb#40476  (sql: support `FOR {UPDATE,SHARE} {SKIP LOCKED,NOWAIT}`)
   3: cockroachdb#35882  (sql: support other character sets)
   2: cockroachdb#10028  (sql: Support view queries with star expansions)
   2: cockroachdb#35807  (sql: INTERVAL output doesn't match PG)
   2: cockroachdb#35902  (sql: large object support)
   2: cockroachdb#40474  (sql: support `SELECT ... FOR UPDATE OF` syntax)
   1: cockroachdb#18846  (sql: Support CIDR column type)
   1: cockroachdb#9682   (sql: implement computed indexes)
   1: cockroachdb#31632  (sql: FK options (deferrable, etc))
   1: cockroachdb#24897  (sql: CREATE OR REPLACE VIEW)
   1: pass?                                                  (unknown)
   1: cockroachdb#36215  (sql: enable setting standard_conforming_strings to off)
   1: cockroachdb#32562  (sql: support SET LOCAL and txn-scoped session variable changes)
   1: cockroachdb#36116  (sql: psychopg: investigate how `'infinity'::timestamp` is presented)
   1: cockroachdb#26732  (sql: support the binary operator: <int> / <float>)
   1: cockroachdb#23299  (sql: support coercing string literals to arrays)
   1: cockroachdb#36115  (sql: psychopg: investigate if datetimetz is being returned instead of datetime)
   1: cockroachdb#26925  (sql: make the CockroachDB integer types more compatible with postgres)
   1: cockroachdb#21085  (sql: WITH RECURSIVE (recursive common table expressions))
   1: cockroachdb#36179  (sql: implicity convert date to timestamp)
   1: cockroachdb#36118  (sql: Cannot parse '24:00' as type time)
   1: cockroachdb#31708  (sql: support current_time)
```

Release justification: non-production change
Release note: None
jordanlewis added a commit to jordanlewis/cockroach that referenced this issue Oct 24, 2019
The spreadsheet we discussed is unwieldy - hard to edit and impossible to keep
up to date. If we write down blacklists in code, then we can use an approach
like this to always have an up to date aggregation.

So far it seems like there's just a lot of unknowns to categorize still.

The output today:

```
=== RUN   TestBlacklists
 648: unknown                                                (unknown)
 493: cockroachdb#5807   (sql: Add support for TEMP tables)
 151: cockroachdb#17511  (sql: support stored procedures)
  86: cockroachdb#26097  (sql: make TIMETZ more pg-compatible)
  56: cockroachdb#10735  (sql: support SQL savepoints)
  55: cockroachdb#32552  (multi-dim arrays)
  55: cockroachdb#26508  (sql: restricted DDL / DML inside transactions)
  52: cockroachdb#32565  (sql: support optional TIME precision)
  39: cockroachdb#243    (roadmap: Blob storage)
  33: cockroachdb#26725  (sql: support postgres' API to handle blob storage (incl lo_creat, lo_from_bytea))
  31: cockroachdb#27793  (sql: support custom/user-defined base scalar (primitive) types)
  24: cockroachdb#12123  (sql: Can't drop and replace a table within a transaction)
  24: cockroachdb#26443  (sql: support user-defined schemas between database and table)
  20: cockroachdb#21286  (sql: Add support for geometric types)
  18: cockroachdb#6583   (sql: explicit lock syntax (SELECT FOR {SHARE,UPDATE} {skip locked,nowait}))
  17: cockroachdb#22329  (Support XA distributed transactions in CockroachDB)
  16: cockroachdb#24062  (sql: 32 bit SERIAL type)
  16: cockroachdb#30352  (roadmap:when CockroachDB  will support cursor?)
  12: cockroachdb#27791  (sql: support RANGE types)
   8: cockroachdb#40195  (pgwire: multiple active result sets (portals) not supported)
   8: cockroachdb#6130   (sql: add support for key watches with notifications of changes)
   5: Expected Failure                                       (unknown)
   5: cockroachdb#23468  (sql: support sql arrays of JSONB)
   5: cockroachdb#40854  (sql: set application_name from connection string)
   4: cockroachdb#35879  (sql: `default_transaction_read_only` should also accept 'on' and 'off')
   4: cockroachdb#32610  (sql: can't insert self reference)
   4: cockroachdb#40205  (sql: add non-trivial implementations of FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR NO KEY SHARE)
   4: cockroachdb#35897  (sql: unknown function: pg_terminate_backend())
   4: cockroachdb#4035   (sql/pgwire: missing support for row count limits in pgwire)
   3: cockroachdb#27796  (sql: support user-defined DOMAIN types)
   3: cockroachdb#3781   (sql: Add Data Type Formatting Functions)
   3: cockroachdb#40476  (sql: support `FOR {UPDATE,SHARE} {SKIP LOCKED,NOWAIT}`)
   3: cockroachdb#35882  (sql: support other character sets)
   2: cockroachdb#10028  (sql: Support view queries with star expansions)
   2: cockroachdb#35807  (sql: INTERVAL output doesn't match PG)
   2: cockroachdb#35902  (sql: large object support)
   2: cockroachdb#40474  (sql: support `SELECT ... FOR UPDATE OF` syntax)
   1: cockroachdb#18846  (sql: Support CIDR column type)
   1: cockroachdb#9682   (sql: implement computed indexes)
   1: cockroachdb#31632  (sql: FK options (deferrable, etc))
   1: cockroachdb#24897  (sql: CREATE OR REPLACE VIEW)
   1: pass?                                                  (unknown)
   1: cockroachdb#36215  (sql: enable setting standard_conforming_strings to off)
   1: cockroachdb#32562  (sql: support SET LOCAL and txn-scoped session variable changes)
   1: cockroachdb#36116  (sql: psychopg: investigate how `'infinity'::timestamp` is presented)
   1: cockroachdb#26732  (sql: support the binary operator: <int> / <float>)
   1: cockroachdb#23299  (sql: support coercing string literals to arrays)
   1: cockroachdb#36115  (sql: psychopg: investigate if datetimetz is being returned instead of datetime)
   1: cockroachdb#26925  (sql: make the CockroachDB integer types more compatible with postgres)
   1: cockroachdb#21085  (sql: WITH RECURSIVE (recursive common table expressions))
   1: cockroachdb#36179  (sql: implicity convert date to timestamp)
   1: cockroachdb#36118  (sql: Cannot parse '24:00' as type time)
   1: cockroachdb#31708  (sql: support current_time)
```

Release justification: non-production change
Release note: None
craig bot pushed a commit that referenced this issue Nov 7, 2019
41252: roachtest: add test that aggregates orm blacklist failures r=jordanlewis a=jordanlewis

The spreadsheet we discussed is unwieldy - hard to edit and impossible to keep
up to date. If we write down blacklists in code, then we can use an approach
like this to always have an up to date aggregation.

So far it seems like there's just a lot of unknowns to categorize still.

The output today:

```
=== RUN   TestBlacklists
 648: unknown                                                (unknown)
 493: #5807   (sql: Add support for TEMP tables)
 151: #17511  (sql: support stored procedures)
  86: #26097  (sql: make TIMETZ more pg-compatible)
  56: #10735  (sql: support SQL savepoints)
  55: #32552  (multi-dim arrays)
  55: #26508  (sql: restricted DDL / DML inside transactions)
  52: #32565  (sql: support optional TIME precision)
  39: #243    (roadmap: Blob storage)
  33: #26725  (sql: support postgres' API to handle blob storage (incl lo_creat, lo_from_bytea))
  31: #27793  (sql: support custom/user-defined base scalar (primitive) types)
  24: #12123  (sql: Can't drop and replace a table within a transaction)
  24: #26443  (sql: support user-defined schemas between database and table)
  20: #21286  (sql: Add support for geometric types)
  18: #6583   (sql: explicit lock syntax (SELECT FOR {SHARE,UPDATE} {skip locked,nowait}))
  17: #22329  (Support XA distributed transactions in CockroachDB)
  16: #24062  (sql: 32 bit SERIAL type)
  16: #30352  (roadmap:when CockroachDB  will support cursor?)
  12: #27791  (sql: support RANGE types)
   8: #40195  (pgwire: multiple active result sets (portals) not supported)
   8: #6130   (sql: add support for key watches with notifications of changes)
   5: Expected Failure                                       (unknown)
   5: #23468  (sql: support sql arrays of JSONB)
   5: #40854  (sql: set application_name from connection string)
   4: #35879  (sql: `default_transaction_read_only` should also accept 'on' and 'off')
   4: #32610  (sql: can't insert self reference)
   4: #40205  (sql: add non-trivial implementations of FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR NO KEY SHARE)
   4: #35897  (sql: unknown function: pg_terminate_backend())
   4: #4035   (sql/pgwire: missing support for row count limits in pgwire)
   3: #27796  (sql: support user-defined DOMAIN types)
   3: #3781   (sql: Add Data Type Formatting Functions)
   3: #40476  (sql: support `FOR {UPDATE,SHARE} {SKIP LOCKED,NOWAIT}`)
   3: #35882  (sql: support other character sets)
   2: #10028  (sql: Support view queries with star expansions)
   2: #35807  (sql: INTERVAL output doesn't match PG)
   2: #35902  (sql: large object support)
   2: #40474  (sql: support `SELECT ... FOR UPDATE OF` syntax)
   1: #18846  (sql: Support CIDR column type)
   1: #9682   (sql: implement computed indexes)
   1: #31632  (sql: FK options (deferrable, etc))
   1: #24897  (sql: CREATE OR REPLACE VIEW)
   1: pass?                                                  (unknown)
   1: #36215  (sql: enable setting standard_conforming_strings to off)
   1: #32562  (sql: support SET LOCAL and txn-scoped session variable changes)
   1: #36116  (sql: psychopg: investigate how `'infinity'::timestamp` is presented)
   1: #26732  (sql: support the binary operator: <int> / <float>)
   1: #23299  (sql: support coercing string literals to arrays)
   1: #36115  (sql: psychopg: investigate if datetimetz is being returned instead of datetime)
   1: #26925  (sql: make the CockroachDB integer types more compatible with postgres)
   1: #21085  (sql: WITH RECURSIVE (recursive common table expressions))
   1: #36179  (sql: implicity convert date to timestamp)
   1: #36118  (sql: Cannot parse '24:00' as type time)
   1: #31708  (sql: support current_time)
```

Release justification: non-production change
Release note: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
@Garfiled
Copy link

While at Google, I worked on Colossus, which was the successor to GFS. It used bigtable to scale out file metadata and a server at each disk to store chunks of data. Aside from increased scale, it used Reed Solomon encoding for more efficient storage. I'd of course like to do something similar on top of Cockroach at some point.

You definitely wouldn't want to store multi-MB blobs as values in the KV store. I believe that leveldb internally will store anything larger than 64K as a separate file.

what does the Colossus stripe width or stripe cell size ? (for SSD or HDD D server is the same?)

@Garfiled
Copy link

@xiang90 @google on Colossus we did both client-side and server-side reconstructions. Waiting for the server when accessing via the client would result in unacceptable latency. There are also some common situations where an optimizing data layout would result in no additional bandwidth to reconstruct (e.g. when scanning a file a "stripe" at a time). The encoding matrix is just a binary string of length k * m. We'd store that with the file metadata so the server could reconstruct at its leisure.

Here's a description of a possible data layout:

image

The above diagram shows the data layout for two "stripes", which is a convenient concept to describe pieces of a larger file. In this example, there are 8 data chunks and 4 code chunks in the stripe. Another common format might be 6.4. The metadata for a larger file was broken down by stripes. Each stripe would have 12 "chunks". In the diagram above, a chunk is represented by a vertical column containing 8 MB of non-contiguous data. There are 8 data chunks and 4 code chunks. A stripe here contains 64MB of data and 32MB of parity/code blocks. A "block" is one of the cells in the diagram.

Why layout the data this way? There are some compelling reasons. First, you expect many files to be relatively small. If you have files flirting with sizes which are within an order of magnitude of the stripe size, you would suffer from quite expensive overhead in code blocks for a file with has a size modulo (say) 16MB. In that case, the final stripe would only contain 16 MB of data blocks. If you laid out data contiguously along chunks, that would mean you'd need all 32MB of code blocks. Your final stripe of data would have an encoding rate of 3x. Not good. With the layout in the diagram, you spread the 16MB of data blocks across only the first two block rows and use only 8MB of code blocks (also just those along the first two "mini" stripes), giving you an encoding rate of 1.5x.

For similar reasons (and critical to performance), on-the-fly recovery of data is greatly enhanced by this data layout. If you're trying to read 2MB of data and one of the chunks is unavailable, you'll end up (likely) reading 8MB of data to reconstruct the missing 1MB (say, the 1MB from the 2MB which is available and 7 additional blocks from the remaining 10). This is a 4x read blowup. On the other hand, with the contiguous block layout, missing 2MB of data would mean reading 16MB (2MB from each of 8 of the remaining 12 chunks), for an 8x read blowup. Further, if you read in 8MB increments, a missing chunk with this data layout actually incurs no read blowup costs--let's say you read from 7 of the 8, then you just need 1 MB from one of the 4 code chunks). For the contiguous chunk model, you'd need to read 64MB.. Ouch--an 8x read blowup.

how did you hanlde small write?

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@Bessonov
Copy link

a comment will keep it active

@spencerkimball
Copy link
Member

spencerkimball commented Sep 26, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tools-hibernate Issues that pertain to Hibernate integration. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-wishlist A wishlist feature. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests