-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roadmap: Blob storage #243
Comments
If you are willing to live with the restrictions for immutable blobs, that opens up a lot of design possibilities. I can imagine a couple different kinds of blob storage that might be useful in various situations. For example, some applications might work well in a peer-to-peer fashion, instead of using centralized servers.
Performance can potentially be much better with external blob storage. For a particular key-value, you are only talking to one node (the currently elected leader) which may not be located nearby, it could be in another datacenter. The leader may also be busy handling other transactions from other clients. Presumably, if you are replicating blobs, the client can chose the closest and/or least busy server to download from. It is also easier to scale up the serving of static content. |
While at Google, I worked on Colossus, which was the successor to GFS. It used bigtable to scale out file metadata and a server at each disk to store chunks of data. Aside from increased scale, it used Reed Solomon encoding for more efficient storage. I'd of course like to do something similar on top of Cockroach at some point. You definitely wouldn't want to store multi-MB blobs as values in the KV store. I believe that leveldb internally will store anything larger than 64K as a separate file. |
So the metadata server is also distributed and each of them only put part of the metadata into memory?
Reed Solomon encoding/decoding is high cost as far as I know. Is this a client-driven thing? Or it is done at the server side?
I would love to see this happen. |
Hi, I've finally just released: https://github.com/photosrv/photosrv which is aimed at handling immutable blob storage. It also needs some work in some areas, does not do any kind of encoding/decoding, but has already been proven to scale effectively to hundreds of millions of files. Maybe there is room or possibility for an integration layer between both photosrv and CockroachDB? Cheers, |
The metadata was distributed, but it didn't hold the data in memory, except through bigtable's normal caching facility. Reed Solomon encoding/decoding can be made blazingly fast with appropriate optimizations. Both in how the cauchy encoding matrix is chosen and with carefully tuned low-level inner-loop instructions. At google, this was done exclusively on the client-side for encoding, and for decoding on-the-fly in the event that requested data blocks were not available, necessitating reads of parity blocks. It was also done server-side for permanent reconstructions when machines went missing, data corruption was identified, or disks died. |
@spencerkimball |
Running on top of Colossus is a very efficient proposition since it uses erasure coded storage. It means that if you store replicas of part of your key range within five datacenters, you end up with 5 * 1.67x = 8.3x encoding rate. Back in the days of GFS, the same configuration would mean 5 * 3x = 15x encoding rate for triplicated storage or 5 * 2x = 10x encoding for just two replicas / datacenter (gfs r=2 encoding). Cockroach doesn't use a separate distributed file system as a dependent layer, so with five datacenters, you'd either end up with 5x encoding rate (one replica in each) or you can increase the Raft encoding to require two replicas per datacenter (or even three, though I don't think that would make as much sense). Using only a single replica per datacenter means you need to use inter-datacenter bandwidth in order to recover a lost disk or machine. Using two replicas per datacenter means you very often could rely on intra-datacenter bandwidth for recovery. The cost of using two replicas in each datacenter is in encoding efficiency (x2) and write latencies increasing both from additionally required bandwidth on writes and probably a change in the latency histogram due to more replicas participating in consensus. You could possibly get really clever with the inter-datacenter bandwidth by sending to only one replica per datacenter and having that replica responsible for forwarding to the alternate, but that would be a really onerous bit of complexity to add. Eventually Cockroach could just as easily run using a distributed file system, but at this stage it would be a mistake to require such a complex external dependency. Most Cockroach clusters will start as single-datacenter deployments which makes the inter-datacenter bandwidth issue a moot point. Further, a very smart and cheap way to mitigate the costs of recovery in multi-datacenter clusters would be to use more reliable storage than Google uses internally. For example, hardware RAID. I'm quite serious about the next step being CockroachFS, which would introduce erasure encoding and CockroachDB could bootstrap itself and run on top of that, making it closer to the Spanner architecture where storage is concerned. |
I see,thank you |
We just ported our old FoundationFS demo to cockroachdb as an excercise to learn/understand cockroachdb. Overall, it was a good experience, and we intend to keep this project up to date as cockroachdb moves into beta and production. You can check it out at https://github.com/cloudmode/roachclip-fs. It's an attempt to replicate mongodb's gridfs interface. |
Just to clarify, roachclip-fs is for file storage, it's not a file system. |
I just found https://github.com/tsuraan/Jerasure and https://github.com/catid/longhair. I am not sure if it worth a rewrite in go.
I did some research about this recently. And I am more interested in the My understanding is this has to be done at the server side since we want to reduce the bandwidth at client side. To reconstruct a missing block might involve at most K transmissions for a (k, m) coding. But does this also mean that the server side has to know about the original encoding matrix? And the reconstructing needs some help from the upper level (or might be need to remember the matrix?) |
@xiang90 @google on Colossus we did both client-side and server-side reconstructions. Waiting for the server when accessing via the client would result in unacceptable latency. There are also some common situations where an optimizing data layout would result in no additional bandwidth to reconstruct (e.g. when scanning a file a "stripe" at a time). The encoding matrix is just a binary string of length k * m. We'd store that with the file metadata so the server could reconstruct at its leisure. Here's a description of a possible data layout: The above diagram shows the data layout for two "stripes", which is a convenient concept to describe pieces of a larger file. In this example, there are 8 data chunks and 4 code chunks in the stripe. Another common format might be 6.4. The metadata for a larger file was broken down by stripes. Each stripe would have 12 "chunks". In the diagram above, a chunk is represented by a vertical column containing 8 MB of non-contiguous data. There are 8 data chunks and 4 code chunks. A stripe here contains 64MB of data and 32MB of parity/code blocks. A "block" is one of the cells in the diagram. Why layout the data this way? There are some compelling reasons. First, you expect many files to be relatively small. If you have files flirting with sizes which are within an order of magnitude of the stripe size, you would suffer from quite expensive overhead in code blocks for a file with has a size modulo (say) 16MB. In that case, the final stripe would only contain 16 MB of data blocks. If you laid out data contiguously along chunks, that would mean you'd need all 32MB of code blocks. Your final stripe of data would have an encoding rate of 3x. Not good. With the layout in the diagram, you spread the 16MB of data blocks across only the first two block rows and use only 8MB of code blocks (also just those along the first two "mini" stripes), giving you an encoding rate of 1.5x. For similar reasons (and critical to performance), on-the-fly recovery of data is greatly enhanced by this data layout. If you're trying to read 2MB of data and one of the chunks is unavailable, you'll end up (likely) reading 8MB of data to reconstruct the missing 1MB (say, the 1MB from the 2MB which is available and 7 additional blocks from the remaining 10). This is a 4x read blowup. On the other hand, with the contiguous block layout, missing 2MB of data would mean reading 16MB (2MB from each of 8 of the remaining 12 chunks), for an 8x read blowup. Further, if you read in 8MB increments, a missing chunk with this data layout actually incurs no read blowup costs--let's say you read from 7 of the 8, then you just need 1 MB from one of the 4 code chunks). For the contiguous chunk model, you'd need to read 64MB.. Ouch--an 8x read blowup. |
I recently found this Go library for Reed-Solomon encoding. It claims 1GB/s/cpu core. |
Zendesk ticket #2743 has been linked to this issue. |
OID / BLOB support would bring CockroachDB closer to a drop-in replacement for PostgreSQL. |
The spreadsheet we discussed is unwieldy - hard to edit and impossible to keep up to date. If we write down blacklists in code, then we can use an approach like this to always have an up to date aggregation. So far it seems like there's just a lot of unknowns to categorize still. The output today: ``` === RUN TestBlacklists 648: unknown (unknown) 493: cockroachdb#5807 (sql: Add support for TEMP tables) 151: cockroachdb#17511 (sql: support stored procedures) 86: cockroachdb#26097 (sql: make TIMETZ more pg-compatible) 56: cockroachdb#10735 (sql: support SQL savepoints) 55: cockroachdb#32552 (multi-dim arrays) 55: cockroachdb#26508 (sql: restricted DDL / DML inside transactions) 52: cockroachdb#32565 (sql: support optional TIME precision) 39: cockroachdb#243 (roadmap: Blob storage) 33: cockroachdb#26725 (sql: support postgres' API to handle blob storage (incl lo_creat, lo_from_bytea)) 31: cockroachdb#27793 (sql: support custom/user-defined base scalar (primitive) types) 24: cockroachdb#12123 (sql: Can't drop and replace a table within a transaction) 24: cockroachdb#26443 (sql: support user-defined schemas between database and table) 20: cockroachdb#21286 (sql: Add support for geometric types) 18: cockroachdb#6583 (sql: explicit lock syntax (SELECT FOR {SHARE,UPDATE} {skip locked,nowait})) 17: cockroachdb#22329 (Support XA distributed transactions in CockroachDB) 16: cockroachdb#24062 (sql: 32 bit SERIAL type) 16: cockroachdb#30352 (roadmap:when CockroachDB will support cursor?) 12: cockroachdb#27791 (sql: support RANGE types) 8: cockroachdb#40195 (pgwire: multiple active result sets (portals) not supported) 8: cockroachdb#6130 (sql: add support for key watches with notifications of changes) 5: Expected Failure (unknown) 5: cockroachdb#23468 (sql: support sql arrays of JSONB) 5: cockroachdb#40854 (sql: set application_name from connection string) 4: cockroachdb#35879 (sql: `default_transaction_read_only` should also accept 'on' and 'off') 4: cockroachdb#32610 (sql: can't insert self reference) 4: cockroachdb#40205 (sql: add non-trivial implementations of FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR NO KEY SHARE) 4: cockroachdb#35897 (sql: unknown function: pg_terminate_backend()) 4: cockroachdb#4035 (sql/pgwire: missing support for row count limits in pgwire) 3: cockroachdb#27796 (sql: support user-defined DOMAIN types) 3: cockroachdb#3781 (sql: Add Data Type Formatting Functions) 3: cockroachdb#40476 (sql: support `FOR {UPDATE,SHARE} {SKIP LOCKED,NOWAIT}`) 3: cockroachdb#35882 (sql: support other character sets) 2: cockroachdb#10028 (sql: Support view queries with star expansions) 2: cockroachdb#35807 (sql: INTERVAL output doesn't match PG) 2: cockroachdb#35902 (sql: large object support) 2: cockroachdb#40474 (sql: support `SELECT ... FOR UPDATE OF` syntax) 1: cockroachdb#18846 (sql: Support CIDR column type) 1: cockroachdb#9682 (sql: implement computed indexes) 1: cockroachdb#31632 (sql: FK options (deferrable, etc)) 1: cockroachdb#24897 (sql: CREATE OR REPLACE VIEW) 1: pass? (unknown) 1: cockroachdb#36215 (sql: enable setting standard_conforming_strings to off) 1: cockroachdb#32562 (sql: support SET LOCAL and txn-scoped session variable changes) 1: cockroachdb#36116 (sql: psychopg: investigate how `'infinity'::timestamp` is presented) 1: cockroachdb#26732 (sql: support the binary operator: <int> / <float>) 1: cockroachdb#23299 (sql: support coercing string literals to arrays) 1: cockroachdb#36115 (sql: psychopg: investigate if datetimetz is being returned instead of datetime) 1: cockroachdb#26925 (sql: make the CockroachDB integer types more compatible with postgres) 1: cockroachdb#21085 (sql: WITH RECURSIVE (recursive common table expressions)) 1: cockroachdb#36179 (sql: implicity convert date to timestamp) 1: cockroachdb#36118 (sql: Cannot parse '24:00' as type time) 1: cockroachdb#31708 (sql: support current_time) ``` Release justification: non-production change Release note: None
The spreadsheet we discussed is unwieldy - hard to edit and impossible to keep up to date. If we write down blacklists in code, then we can use an approach like this to always have an up to date aggregation. So far it seems like there's just a lot of unknowns to categorize still. The output today: ``` === RUN TestBlacklists 648: unknown (unknown) 493: cockroachdb#5807 (sql: Add support for TEMP tables) 151: cockroachdb#17511 (sql: support stored procedures) 86: cockroachdb#26097 (sql: make TIMETZ more pg-compatible) 56: cockroachdb#10735 (sql: support SQL savepoints) 55: cockroachdb#32552 (multi-dim arrays) 55: cockroachdb#26508 (sql: restricted DDL / DML inside transactions) 52: cockroachdb#32565 (sql: support optional TIME precision) 39: cockroachdb#243 (roadmap: Blob storage) 33: cockroachdb#26725 (sql: support postgres' API to handle blob storage (incl lo_creat, lo_from_bytea)) 31: cockroachdb#27793 (sql: support custom/user-defined base scalar (primitive) types) 24: cockroachdb#12123 (sql: Can't drop and replace a table within a transaction) 24: cockroachdb#26443 (sql: support user-defined schemas between database and table) 20: cockroachdb#21286 (sql: Add support for geometric types) 18: cockroachdb#6583 (sql: explicit lock syntax (SELECT FOR {SHARE,UPDATE} {skip locked,nowait})) 17: cockroachdb#22329 (Support XA distributed transactions in CockroachDB) 16: cockroachdb#24062 (sql: 32 bit SERIAL type) 16: cockroachdb#30352 (roadmap:when CockroachDB will support cursor?) 12: cockroachdb#27791 (sql: support RANGE types) 8: cockroachdb#40195 (pgwire: multiple active result sets (portals) not supported) 8: cockroachdb#6130 (sql: add support for key watches with notifications of changes) 5: Expected Failure (unknown) 5: cockroachdb#23468 (sql: support sql arrays of JSONB) 5: cockroachdb#40854 (sql: set application_name from connection string) 4: cockroachdb#35879 (sql: `default_transaction_read_only` should also accept 'on' and 'off') 4: cockroachdb#32610 (sql: can't insert self reference) 4: cockroachdb#40205 (sql: add non-trivial implementations of FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR NO KEY SHARE) 4: cockroachdb#35897 (sql: unknown function: pg_terminate_backend()) 4: cockroachdb#4035 (sql/pgwire: missing support for row count limits in pgwire) 3: cockroachdb#27796 (sql: support user-defined DOMAIN types) 3: cockroachdb#3781 (sql: Add Data Type Formatting Functions) 3: cockroachdb#40476 (sql: support `FOR {UPDATE,SHARE} {SKIP LOCKED,NOWAIT}`) 3: cockroachdb#35882 (sql: support other character sets) 2: cockroachdb#10028 (sql: Support view queries with star expansions) 2: cockroachdb#35807 (sql: INTERVAL output doesn't match PG) 2: cockroachdb#35902 (sql: large object support) 2: cockroachdb#40474 (sql: support `SELECT ... FOR UPDATE OF` syntax) 1: cockroachdb#18846 (sql: Support CIDR column type) 1: cockroachdb#9682 (sql: implement computed indexes) 1: cockroachdb#31632 (sql: FK options (deferrable, etc)) 1: cockroachdb#24897 (sql: CREATE OR REPLACE VIEW) 1: pass? (unknown) 1: cockroachdb#36215 (sql: enable setting standard_conforming_strings to off) 1: cockroachdb#32562 (sql: support SET LOCAL and txn-scoped session variable changes) 1: cockroachdb#36116 (sql: psychopg: investigate how `'infinity'::timestamp` is presented) 1: cockroachdb#26732 (sql: support the binary operator: <int> / <float>) 1: cockroachdb#23299 (sql: support coercing string literals to arrays) 1: cockroachdb#36115 (sql: psychopg: investigate if datetimetz is being returned instead of datetime) 1: cockroachdb#26925 (sql: make the CockroachDB integer types more compatible with postgres) 1: cockroachdb#21085 (sql: WITH RECURSIVE (recursive common table expressions)) 1: cockroachdb#36179 (sql: implicity convert date to timestamp) 1: cockroachdb#36118 (sql: Cannot parse '24:00' as type time) 1: cockroachdb#31708 (sql: support current_time) ``` Release justification: non-production change Release note: None
41252: roachtest: add test that aggregates orm blacklist failures r=jordanlewis a=jordanlewis The spreadsheet we discussed is unwieldy - hard to edit and impossible to keep up to date. If we write down blacklists in code, then we can use an approach like this to always have an up to date aggregation. So far it seems like there's just a lot of unknowns to categorize still. The output today: ``` === RUN TestBlacklists 648: unknown (unknown) 493: #5807 (sql: Add support for TEMP tables) 151: #17511 (sql: support stored procedures) 86: #26097 (sql: make TIMETZ more pg-compatible) 56: #10735 (sql: support SQL savepoints) 55: #32552 (multi-dim arrays) 55: #26508 (sql: restricted DDL / DML inside transactions) 52: #32565 (sql: support optional TIME precision) 39: #243 (roadmap: Blob storage) 33: #26725 (sql: support postgres' API to handle blob storage (incl lo_creat, lo_from_bytea)) 31: #27793 (sql: support custom/user-defined base scalar (primitive) types) 24: #12123 (sql: Can't drop and replace a table within a transaction) 24: #26443 (sql: support user-defined schemas between database and table) 20: #21286 (sql: Add support for geometric types) 18: #6583 (sql: explicit lock syntax (SELECT FOR {SHARE,UPDATE} {skip locked,nowait})) 17: #22329 (Support XA distributed transactions in CockroachDB) 16: #24062 (sql: 32 bit SERIAL type) 16: #30352 (roadmap:when CockroachDB will support cursor?) 12: #27791 (sql: support RANGE types) 8: #40195 (pgwire: multiple active result sets (portals) not supported) 8: #6130 (sql: add support for key watches with notifications of changes) 5: Expected Failure (unknown) 5: #23468 (sql: support sql arrays of JSONB) 5: #40854 (sql: set application_name from connection string) 4: #35879 (sql: `default_transaction_read_only` should also accept 'on' and 'off') 4: #32610 (sql: can't insert self reference) 4: #40205 (sql: add non-trivial implementations of FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR NO KEY SHARE) 4: #35897 (sql: unknown function: pg_terminate_backend()) 4: #4035 (sql/pgwire: missing support for row count limits in pgwire) 3: #27796 (sql: support user-defined DOMAIN types) 3: #3781 (sql: Add Data Type Formatting Functions) 3: #40476 (sql: support `FOR {UPDATE,SHARE} {SKIP LOCKED,NOWAIT}`) 3: #35882 (sql: support other character sets) 2: #10028 (sql: Support view queries with star expansions) 2: #35807 (sql: INTERVAL output doesn't match PG) 2: #35902 (sql: large object support) 2: #40474 (sql: support `SELECT ... FOR UPDATE OF` syntax) 1: #18846 (sql: Support CIDR column type) 1: #9682 (sql: implement computed indexes) 1: #31632 (sql: FK options (deferrable, etc)) 1: #24897 (sql: CREATE OR REPLACE VIEW) 1: pass? (unknown) 1: #36215 (sql: enable setting standard_conforming_strings to off) 1: #32562 (sql: support SET LOCAL and txn-scoped session variable changes) 1: #36116 (sql: psychopg: investigate how `'infinity'::timestamp` is presented) 1: #26732 (sql: support the binary operator: <int> / <float>) 1: #23299 (sql: support coercing string literals to arrays) 1: #36115 (sql: psychopg: investigate if datetimetz is being returned instead of datetime) 1: #26925 (sql: make the CockroachDB integer types more compatible with postgres) 1: #21085 (sql: WITH RECURSIVE (recursive common table expressions)) 1: #36179 (sql: implicity convert date to timestamp) 1: #36118 (sql: Cannot parse '24:00' as type time) 1: #31708 (sql: support current_time) ``` Release justification: non-production change Release note: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
what does the Colossus stripe width or stripe cell size ? (for SSD or HDD D server is the same?) |
how did you hanlde small write? |
We have marked this issue as stale because it has been inactive for |
a comment will keep it active |
This issue should be closed.
…On Tue, Sep 26 2023 at 8:29 AM, Anton Bessonov ***@***.***> wrote:
a comment will keep it active
—
Reply to this email directly, view it on GitHub
<#243 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABO4YTKDBHBHEDPAFRDG2Z3X4LDDPANCNFSM4AZMEDDA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
First off, I think this is awesome. I did a series of blog posts about a year ago (http://blog.justinsb.com/blog/categories/cloudata/) that posited the existence of a reliable key-value store and a blob store, and built all sorts of cool things on top of it. For example, a git store would store only the refs in consistent storage, but would store the data itself in the blob store. Similarly a filesystem: metadata in the key-value store; file data in blob storage. It might be fun for me to update that blog series using cockroachdb.
By blob storage, I mean immutable storage, where blobs are stored by their SHA-256 hash or similar. Blobs can be created and are replicated for reliable storage, but then the only "modification" allowed is deletion.
Do you have any plans/pointers for integrating blob storage? I could always just rely on S3 or similar, but this seems like something you might well be planning on supporting directly. (I presume it's not a good idea to just store these (multi-MB) blobs as values in the k-v store.)
Jira issue: CRDB-6201
The text was updated successfully, but these errors were encountered: