-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leveldb: Add LevelDB support #824
Conversation
Example: go run . -d ldb d format/ldb/testdata/000005.ldb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start! did some initial reviewing to get going
format/format.go
Outdated
@@ -125,6 +125,7 @@ var ( | |||
JPEG = &decode.Group{Name: "jpeg"} | |||
JSON = &decode.Group{Name: "json"} | |||
JSONL = &decode.Group{Name: "jsonl"} | |||
LDB = &decode.Group{Name: "ldb"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about naming it "leveldb" for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And maybe even go one step further and use the MPEG
pattern here of doing leveldb_ldb
(and eventually leveldb_log
etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap makes sense, see my other comments about it
format/ldb/ldb.go
Outdated
var compressionTypes = scalar.UintMapSymStr{ | ||
compressionTypeNone: "none", | ||
compressionTypeSnappy: "Snappy", | ||
compressionTypeZstandard: "Zstandard", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep all lowercase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly "zstd"? not sure what is most common
format/ldb/ldb.go
Outdated
var indexSize int64 | ||
var metaIndexOffset int64 | ||
var metaIndexSize int64 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if use of d.LimitedFn
or d.RangeFn
could simplify some field length calculations? mostly thinking of the use of d.Pos
/d.Len
, is the padding dynamic in size somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer to these functions. Rewrote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap they are quite nice sometimes so one don't have to keep track number of bits left etc. not that i think d.Pos()
inside a one of these still return the absolute position in the "root" buffer but d.Len()
will change behaviour. Maybe there should be d.RootPos()
/d.RootLen()
etc?
format/ldb/ldb.go
Outdated
compressedSize := size | ||
compressed := data | ||
bb := &bytes.Buffer{} | ||
_ = bb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug leftover i guess :)
format/ldb/ldb.go
Outdated
|
||
// index | ||
|
||
d.SeekAbs(indexOffset * 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if d.FramedLen
or d.RangeLen
could be used here instead of seek and passing the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean d.FramedFn
and d.RangeFn
? It is not clear to me how this makes the code clearer.
We find these offsets in the footer, and then jump to them to read the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, only use if it feels it makes things clearer. d.RangeFn
can sometimes be used instead of combination of seek and the limit
format/ldb/ldb.go
Outdated
}) | ||
}) | ||
// TK: how do you make an empty entries-array appear _above_ the trailer? | ||
// Right now, its omited if empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They both ends up in a struct? then they will be sorted by bit range start... and i guess it might be that the entries array ends up with a zero range atm, hmm feels like a bug 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically a key-value block has (1) entries and (2) trailer. However, occasionally there's only a trailer and no entries. What to do in such scenarios?
- Option 1: leave away the entries array all together. (Current solution.)
- Option 2: Show an empty entries array. (Proposition. However, it ends up being shown below the trailer, since it's empty. We need to read the trailer first, to figure out if there's an entries array or not.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be nice to have an empty entries array for clarity? i will have to look closer at how to solve this. But don't think it's a blocker to merge, can be changed/fixed later.
format/ldb/ldb.go
Outdated
d.FieldRawLen("raw", size*8) | ||
} | ||
|
||
func decodeVarInt(d *decode.D) uint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this LEB128? thinking maybe can use d.FieldULEB128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a remark in tryULEB128 that it's the same as "Base 128 Varint" (a term used in Google contexts).
format/ldb/testdata/ldb.fqtest
Outdated
@@ -0,0 +1,98 @@ | |||
$ fq -d ldb dv uncompressed.ldb/000005.ldb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snappy test will be added later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
format/ldb/testdata/make_ldb.py
Outdated
@@ -0,0 +1,45 @@ | |||
# Make LevelDB data: both uncompressed and compressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice with code to generate test cases!
format/ldb/ldb.go
Outdated
interp.RegisterFormat( | ||
format.LDB, | ||
&decode.Format{ | ||
Description: "LevelDB Table", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a .md file named the same as the format and add various tips, todos and author. see maybe bson.md and bson.go how it works. It will also be used for help in the CLI and for generating formats.md with make doc
@wader Thank you for the comments. Now also read |
@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed? Background: I added a |
format/format.go
Outdated
@@ -125,6 +125,7 @@ var ( | |||
JPEG = &decode.Group{Name: "jpeg"} | |||
JSON = &decode.Group{Name: "json"} | |||
JSONL = &decode.Group{Name: "jsonl"} | |||
LDB = &decode.Group{Name: "leveldb_ldb"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to LevelB_LDB
.
Do you know if there are other leveldb related formats that might be interesting in the future? thinking if this is leveldb_table
(https://github.com/google/leveldb/blob/main/doc/table_format.md?) and then maybe there will be leveldb_log
etc? but it's not a big deal to rename things later on anyway i think, also the format can be probed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader 👌, good idea. Besides the suggested modification, I now also added leveldb_log
and leveldb_descriptor
; thereby keeping the names as in the LevelDB leveldbutils dump
command.
format/leveldb/leveldb_ldb.go
Outdated
footerEncodedLength = (4*10 + 8) * 8 | ||
magicNumberLength = 8 * 8 | ||
// leading 64 bits of | ||
// echo http://code.google.com/p/leveldb/ | sha1sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting trivia :)
format/leveldb/leveldb_ldb.go
Outdated
0x1: "value", | ||
} | ||
|
||
type BlockHandle struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe blockHandle
as it's internal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe same for the members also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, privacy/exports in Go is solved by title-casing! Interesting language design decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap, let's say you learn to be ok with it :) not sure how well it agrees with golang's principle of clarity and explicitness 🤔
format/leveldb/leveldb_ldb.go
Outdated
} | ||
d.Copy(bb, bytes.NewReader(decompressed)) | ||
default: | ||
d.Fatalf("Unsupported compression type: %x", compressionType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's possible to continue decoding even if there is an error you can use d.Errorf
and then using -f
will ignore the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Replaced all d.Fatalf with d.Errorf where it made sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lied btw, it's -o force=true
not -f
... but maybe there should be a shorthand for it :)
format/leveldb/leveldb_ldb.go
Outdated
default: | ||
d.Fatalf("Unsupported compression type: %x", compressionType) | ||
} | ||
d.FieldStructRootBitBufFn("uncompressed", bitio.NewBitReader(bb.Bytes(), -1), func(d *decode.D) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really should come up with a nicer API for the nested decoding stuff, feels a bit like a hack atm
result |= (b & 0x7f) << shift | ||
if b&0x80 == 0 { | ||
result |= (b & 0b01111111) << shift | ||
if b&0b10000000 == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice clarifications here 👍
format/leveldb/leveldb_ldb.go
Outdated
Groups: []*decode.Group{format.Probe}, | ||
DecodeFn: ldbDecode, | ||
}) | ||
interp.RegisterFS(leveldbFS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some formats has a torepr
implementation which is a function to convert the decode tree inside more user representation, not sure if that is interesting for leveldb? for example bson has https://github.com/wader/fq/blob/master/format/bson/bson.jq also don't forget to register the format-overloaded--function https://github.com/wader/fq/blob/master/format/bson/bson.go#L24 (it's part for a hack to make is possible to have kind of "polymorphic" functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers. It's not clear to me in which scenarios you'd use torepr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably only useful to have if the format itself is used to encode some structure, if we take bson example again:
# this shows all the nitty gritty details of bson encoding
$ fq -d bson -o line_bytes=8 d format/bson/testdata/test.bson
│00 01 02 03 04 05 06 07│01234567│.{}: format/bson/testdata/test.bson (bson)
0x000│41 01 00 00 │A... │ size: 321
│ │ │ elements[0:17]:
│ │ │ [0]{}: element
0x000│ 01 │ . │ type: "double" (1) (64-bit binary floating point)
0x000│ 64 6f 75│ dou│ name: "dou"
0x008│00 │. │
0x008│ 29 5c 8f c2 f5 b0 58│ )\....X│ value: 98.765
0x010│40 │@ │
│ │ │ [1]{}: element
0x010│ 02 │ . │ type: "string" (2) (UTF-8 string)
0x010│ 73 74 72 00 │ str. │ name: "str"
0x010│ 0a 00│ ..│ length: 10
0x018│00 00 │.. │
0x018│ 6d 79 20 73 74 72│ my str│ value: "my string"
0x020│69 6e 67 00 │ing. │
...
# this shows the value the bson structure "represents"
$ fq -d bson torepr format/bson/testdata/test.bson
{
...
"dou": 98.765,
...
"str": "my string",
...
}
So it will only make sense if whatever is encoded can be translate into a jq value and something a "end user" would expect kind of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader Ah, I think I get it! This was a helpful example for me. Thank you. :)
Yes, this applies here, especially for descriptors. However, this reminds me of another question I have. I will ask below.
### Limitations | ||
|
||
- no Meta Blocks (like "filter") are decoded yet. | ||
- Zstandard uncompression is not implemented yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more or less just depend on some zstd package? i've looked at https://github.com/klauspost/compress a couple of times, it has zstd and lots of other formats and i also suspect the api:s are a bit more low level then golang stdlib so might fit fq better. It can maybe replace github.com/golang/snappy also? impressively it seems it also has zero dependencies on other packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how common zstd
is in the wild. I've never come across it so far in the LevelDB samples I've seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then i think we can skip it for now, probably good to not make this PR grow too much also
| | | key_delta{}: 0x611-0x61a (9) | ||
0x610| 73 | s | user_key: "s" 0x611-0x612 (1) | ||
0x610| 01 | . | type: "value" (0x1) 0x612-0x613 (1) | ||
0x610| ff ff ff ff ff ff ff | ....... | sequence_number: 72057594037927935 0x613-0x61a (7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format as hex? but maybe decimal make sense if it's a sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically sequence makes sense in decimal. They are sequential numbers. The only exception are these index and metaindex sections; especially the example you highlighted, where the maximum possible unsigned integer 0xffffffffffffff
is chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! keep as decimal. btw if 0xffffffffffffff has some special mening you can add description. For example the mp4 decoder does something like this for box size:
const (
boxSizeRestOfFile = 0
boxSizeUse64bitSize = 1
)
d.FieldU32("size", scalar.UintMapDescription{
boxSizeRestOfFile: "Rest of file",
boxSizeUse64bitSize: "Use 64 bit size",
})
In your case you could probably just have d.FieldU64("...", scalar.UintMapDescription{0xffffffffffffff: "blabla"})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader It's unclear to me if it has a special meaning. I only noticed so far that in the index and metaindex there are sometimes sequence numbers like that; it feels more like a hack though. The sequence numbers are typically used in the data section and there they have a well-defined meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! then sounds like we should leave as it is
I usually lean towards not hide things as that is kind of what fq is all about. In this case if you did not add "compressed" fields would those bit ranges show as gaps fields instead? |
The failing test is about |
I'd like you to hear me differently. Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections. |
I see. As it works currently if you didn't add "compressed" fields they would instead show up as "gap" fields that fq automatically inserts, this is so that all bits will always be "reachable" somehow. Maybe what you looking for is the ability to add a "compressed" but somehow tell fq that it should be displayed in a more discreet way unless is verbose mode etc? I'm thinking if totally hiding a field might be confusing as it will look as there is a gap in data/hex column? or maybe i misunderstand what your aiming for? |
@wader Question about the LevelDB log format and how to decompress it. In essence the data structure is as follows:
I hope that's clear so far. Thus my question is: for the pieces which are preserved in full, it's easy to show its underlying datastructure, since it hasn't been split. However, for the "first", "middle", and "last" pieces, I wouldn't know how to visualize it? Is there some precedence here in the other formats? |
format/format.go
Outdated
@@ -125,6 +125,9 @@ var ( | |||
JPEG = &decode.Group{Name: "jpeg"} | |||
JSON = &decode.Group{Name: "json"} | |||
JSONL = &decode.Group{Name: "jsonl"} | |||
LevelDB_Descriptor = &decode.Group{Name: "leveldb_descriptor"} | |||
LDB = &decode.Group{Name: "leveldb_table"} | |||
LOG = &decode.Group{Name: "leveldb_log"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix with LevelDB_
format/leveldb/leveldb_descriptor.go
Outdated
for { | ||
if d.End() { | ||
break | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be for !d.End() {
?
format/leveldb/leveldb_log.go
Outdated
format.LOG, | ||
&decode.Format{ | ||
Description: "LevelDB Log", | ||
Groups: []*decode.Group{format.Probe}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some things to be careful about when a format is in the probe group. For example make sure to "fail fast" if some magic etc is not found and to make sure that empty input does not succeed, ex if you have for !d.End() {
at the "root" maybe it's good to count number of things successfully decoded and then after the loop make sure at least N was decoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not quite aware what that format.Probe
was doing (I copy pasted that part); now I am. Since the log files don't have magic strings of any kind in them, I think it's better to exclude them for now + from descriptors which is based on similar logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 yeap there is no special probe code in fq, just that some decoders are tried if none is specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra trivia: all formats and group exist as jq functions so you can do this ex: ... | probe
@wader Something about those test failures I don't quite understand; it seems some aspect of LevelDB is leaking into the other tests involving |
Think i would try to be "true" to how the format works, so show the parts as some array with structs with raw data field for the partia data etc. This is how most formats in fq work that has to "demux" etc, ex ogg, gzip and the tcp reassembly. This also makes it quite nice to when deal with broken files that might have some partial data that is good, then you can use fq to concat parts, maybe prefix/append some missing data and decode or output the "repaired" data. |
I think it might be the thing i mentioned about probe, that the leveldb_description format succeeds when it shouldn't. So i guess this will get fixed when you remove it from probe. |
Yes, I hear you. I think I found a solution... to use |
handleLength := d.LimitedFn(footerEncodedLength, func(d *decode.D) { | ||
// check for magic number and fail fast if it isn't there | ||
d.SeekAbs(d.Len() - magicNumberLength) | ||
d.FieldU64("magic_number", d.UintAssert(tableMagicNumber), scalar.UintHex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want you can probably keep it like it was, as long as it does not decode megabytes of data before failing it should probably be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new way better. It seems to me the code got clearer; also with the d.BitsLeft() at the end.
Aha that explains things :) i wonder if we could have separate options for raw and string truncate limit? by look up manually you mean use a query to access a string field? |
use a query, exactly! In my imaginary world, the entire thing would be an App like Hex Fiend or some debugger, and I could just hover over it to see the full thing as a tooltip. |
Yeap that would very interesting. I can imagine some kind of "IDE" with multiple visual tree:s, data views and REPL:s. |
README.md
Outdated
@@ -102,6 +102,7 @@ ipv6_packet, | |||
jpeg, | |||
json, | |||
jsonl, | |||
[leveldb_ldb](doc/formats.md#leveldb_ldb), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do a make doc
to update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oddly, the make doc
commands changes a lot of content in the SVGs. I left them out in the commit, as it seems they shouldn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha yes skip those, i will fix. I did some changes to https://github.com/wader/ansisvg some days ago
@@ -0,0 +1,57 @@ | |||
$ fq -d leveldb_descriptor dv uncompressed.ldb/MANIFEST-000004 | |||
|00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f|0123456789abcdef|.{}: uncompressed.ldb/MANIFEST-000004 (leveldb_descriptor) 0x0-0x57 (87) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relation between manifest and descriptor? name for same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears so. The term MANIFEST
only seems to be used for the filename. In the codebase, they use the terms "descriptor" and "VersionEdit"; I find their inconsistency confusing here.
Hmm strange, but it was quite recently that i made synthetic something more special, before they were a hack using zero length bit ranges so it might explain why "synthetic" is not used much in the code yet. For example mp4 decoder adds track id and format here https://github.com/wader/fq/blob/master/format/mp4/mp4.go#L280-L291 |
Stress tested a bit more using:
Seems chrome has a bunch of leveldb files, some fail with variations of |
I appreciate the stress-testing and the command. I'll take a look tomorrow! |
👍 no stress! i think the PR is in very good shape now, some more testing and fix decode issues and we should be ready to merge. anything else you want to add? |
In the LevelDB encoding, the internal key can be cut at any byte: including the user_key, type, or sequence_number. The resulting prefix is shared among subsequent keys and not specified explicitly by them. This fixes a previous mistaken belief that cuts can't happen in the last 8 bytes of the type & sequence number. Tests are added.
@wader That one took me a while to track down and fix. :) |
Thank you. I searched for "synthetic"; had I searched for "FieldValue", then I would have found it faster. I wrongly assumed FieldValue was just a regular reader. :) |
format/leveldb/leveldb_descriptor.go
Outdated
err := readInternalKey(nil, int(length), d) | ||
if err != nil { | ||
// TK(2023-12-08): how do I propagate this | ||
// error `err` into the `d` object? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is d.IOPanic
that i think is similar? maybe there should be a similar one for generic error, or maybe d.Errorf
should support %w
error arguments somehow? have to read up on how that works. For now i think it can be fine to do d.Errof("blabla: %s", err)
and format out the error as a string maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err.Error()
seems to do the trick.
0x1d0| bd 03 | .. | value_length: 445 0x1d6-0x1d8 (2) | ||
| | | key{}: 0x1d8-0x1e5 (13) | ||
0x1d0| 69 70 73 75 6d | ipsum | user_key_suffix: raw bits 0x1d8-0x1dd (5) | ||
| | | user_key: "lorem.ipsum" (inferred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :) i guess this will make it quite a bit easier to write queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader This decoding has given me a new insight into everything LevelDB stores; not just the key and the value, but apparently also the history of key-value pairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳 yeap i've had similar experiences writing decoders, you bump into lots of fascinating tricks and legacy, ex i remember zip has so much legacy to handle floppy fdisks and appending and also the date format uses msdos-timestamps (2 second precision!)
0x1d0| 0d | . | unshared_bytes: 13 0x1d5-0x1d6 (1) | ||
0x1d0| bd 03 | .. | value_length: 445 0x1d6-0x1d8 (2) | ||
| | | key{}: 0x1d8-0x1e5 (13) | ||
0x1d0| 69 70 73 75 6d | ipsum | user_key_suffix: raw bits 0x1d8-0x1dd (5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader They are; I couldn't figure out if you had a scalar method I could plug into d.FieldRawLen
... since I need the bytes as well:
br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha ok, the suffix bytes are not always valid utf8? anyways i think the current way is fine! but i do wonder if there should be a Raw
variant that return bytes? (maybe there is?) the reason Raw
returns a bitio.BitReader
is that they can be used for very large bit ranges so that the bits will only be read if really needed to save on IO and memory.
Aha sorry should have been clearer :) BTW now all *.ldb files in my home directory decodes fine and looks beautiful! |
format/leveldb/leveldb_table.go
Outdated
// case 2: type and sequence_number fit fully in unshared: simulate user_key value. | ||
if unsharedSize >= typeAndSequenceNumberSize { | ||
br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8) | ||
d.FieldValueStr("user_key", string(append(sharedBytes, d.ReadAllBits(br)...)), strInferred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be something like?
suffix := d.FieldUTF8("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize))
d.FieldValueStr("user_key", string(sharedBytes)+suffix), strInferred)
An alternativ to using d.FieldValueStr
could be to only have one user_key
field and let the symbolic value for that one string be the inferred string. But i think i like how it is now which i guess is more true to how thing actually work, that it's a suffix, then and just add a bit extra to make it convenient.
btw append(sharedBytes, d.ReadAllBits(br)...)
can something in go be a bit tricky, if i remember correctly the sharedBytes
slice might end up being reused if it as capacity, but in this case it should be fine i think as it get turned into a new string? (which immutable and not share the slice bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader I like your solution.
[...] the sharedBytes slice might end up being reused [...]
the append()
of Go I don't quite understand yet. Apparently it is an amoritized-O(1) operation yet it seems sharedBytes doesn't get mutated? (So it's not doing .append(....)
like in Python or .push
like in JavaScript.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader Although, on second thought, if the user_key is cut at some UTF-8 multi-byte character, this might fail. However, maybe it is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In LevelDB, both keys and values are bytestrings; so a UTF-8 encoding would typically be used, but must not necessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader I made some own functions for now to handle this. See recent commit.
format/leveldb/leveldb_table.go
Outdated
} | ||
err := keyCallbackFn(keyPrefix, int(unshared), d) | ||
if err != nil { | ||
d.Errorf(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these should be d.Errorf("%s", err)
to be sure some random %s
etc in the error string is seen as a format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pointer. I just saw earlier that I'd used some d.Errorf("%v", err)
code I had copy-pasted from avro_ocf.go without remembering.
format/leveldb/leveldb_descriptor.go
Outdated
d.FieldStruct("data", func(d *decode.D) { | ||
err := readInternalKey(nil, int(length), d) | ||
if err != nil { | ||
d.Errorf(err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment about d.Errorf(err.Error())
Great work! just some tiny things left i think, after that just let me know if you think your read to merge Hope it was a good review experience too 😄 |
Thank you, @wader! I very much appreciate your reviewing style. I learned a lot about fq, Go, and LevelDB in the process. I hope it was a good review experience for you as well. |
decode unfragmented .log files: - break leveldb_log.go into leveldb_log_blocks.go and leveldb_log.go; the former is used by both .MANIFEST (descriptor) and .LOG. - in leveldb_log, introduce readBatch that decodes further fix UTF8 decoding: - introduce fieldUTF8ReturnBytes and stringify to handle multi-byte UTF8-encodings correctly.
@@ -441,3 +446,21 @@ func mask(crc uint32) uint32 { | |||
// Rotate right by 15 bits and add a constant. | |||
return ((crc >> 15) | (crc << 17)) + kMaskDelta | |||
} | |||
|
|||
// Concatinate byteslices and convert into a string. | |||
func stringify(byteSlices ...[]byte) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to save on allocations you could use a strings.Builder
or possible sum all lengths allocate once with make()
and then append. But i have feeling it wont make much difference in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh i assumed strings.Builder
did more smart things by growing smarty but maybe the only smartness is in the String()
method? but i do wonder if append might have some smartness about how to grow so it won't end up reallocating/copy all the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader I'm trying to track down how append works precisely; this is closest I've found: https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssagen/ssa.go#L3503
That code has too many concepts I don't understand; don't have more clarity now. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting! it seems to generate a call to growslice https://github.com/golang/go/blob/fe1b2f95e6dbfb6e6212bb391706ae62eb0ae5ec/src/runtime/slice.go#L155 which in turn seems to call nextslicecap https://github.com/golang/go/blob/fe1b2f95e6dbfb6e6212bb391706ae62eb0ae5ec/src/runtime/slice.go#L267 which has some heuristics how to grow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://godbolt.org/z/rEPerb3b4 maybe interesting. i would have assumed that append would just end up being a call to something but seem there lots of tricky going on, maybe be speed up common cases etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wader Oh, checking the compiler explorer was very insightful!! runtime.growslice(SB)
clearly shows the connection to slice.go now. I don't quite see the big picture yet, but now I know where to start if I wanted to dive deeper. :)
return string(result) | ||
} | ||
|
||
func fieldUTF8ReturnBytes(name string, nBytes int, d *decode.D) []byte { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've experiment and thought a bit about doing a decode api v2 that would use method chaining then things could probably be nicer, something like:
d.FieldUTF8("bla", nBytes).Bytes()
d.FieldUTF8("bla", nBytes).BitReader()
d.FieldUTF8("bla", nBytes).Map(...).Bytes()
Think that would make it a bit more flexible, would probably also cut down a bit on amount of generated code. Also! it can probably help make the api more type safe as ex d.FieldUTF8
would return some type forcing things to use string. Should think more about this :)
br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8) | ||
d.FieldValueStr("user_key", string(append(sharedBytes, d.ReadAllBits(br)...)), strInferred) | ||
suffix := fieldUTF8ReturnBytes("user_key_suffix", unsharedSize-typeAndSequenceNumberSize, d) | ||
d.FieldValueStr("user_key", stringify(sharedBytes, suffix), strInferred) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if i understand correctly: shared and suffix are seen as bytes so they might not be valid utf8, so reading them as utf8 strings wont preserve the raw bytes if illegal etc? normal golang strings i think can store raw bytes but fq's d.UTF8
use unicode.UTF8BOM
which i suspect might replace illegal bytes with error or replacement runes? anyways if this seem to work keep it like this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting point with UTF8BOM (0xEF, 0xBB, 0xBF)!
I hadn't come across this before nor thought about it.
It seems the Python encoder natively skips it (unless specified explicitly), also some of the LevelDB dumps I've observed so far don't use the UTF8BOM.
That said, since LevelDB takes bytestrings as keys and values, that wouldn't stop anyone from using UTF8BOM if they wanted to.
So maybe the utf-8 encoder will be good enough in most cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safest i think is probably to keep it as is, treat the parts as raw bytes. It's a bit weird that UTF8 with a BOM is even a thing as UTF8 does not really have a byte order :) but it seems to happen 🤷 the encoding used by d.UTF8
is choose to be quite liberal with weird stuff :)
BTW if you want to have full control over text encoding then there is d.FieldStr("...", <encoding>)
where you can plugin any encoding.Encoding
.
Happy to hear that! i can sometimes end up thinking that i might come a cross as "naggy" when there is a lot of back and forth, i'm not! superhappy someone wants to help out :) also a bit tricky to review things about the format itself as i have very little experience with it, trying to mostly focus on helping it fit well into rest of fq :) |
"lorem.lorem": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.", | ||
"lorem.ipsum": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.", | ||
"lorem.dolor": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.", | ||
"row": "Row, row, row your boat\nGently down the stream.\nMerrily, merrily, merrily, merrily,\nLife is but a dream. 🚣♂️" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now i noticed the 🚣♂️ :)
Played around a bit more with chrome leveldb:s, seems to work fine! is a bit confusing to look at web storage databases, lots of weird stuff stored in them :) Think i'm kind of ready to merge if you are |
@wader I'm ready! :) |
🥳 |
No description provided.