leveldb: Add LevelDB support #824

mikez · 2023-12-04T17:51:29Z

No description provided.

Example: go run . -d ldb d format/ldb/testdata/000005.ldb

wader

Nice start! did some initial reviewing to get going

wader · 2023-12-04T22:07:23Z

format/format.go

@@ -125,6 +125,7 @@ var (
 	JPEG                = &decode.Group{Name: "jpeg"}
 	JSON                = &decode.Group{Name: "json"}
 	JSONL               = &decode.Group{Name: "jsonl"}
+	LDB                 = &decode.Group{Name: "ldb"}


What do you think about naming it "leveldb" for clarity?

Yes. And maybe even go one step further and use the MPEG pattern here of doing leveldb_ldb (and eventually leveldb_log etc.)?

yeap makes sense, see my other comments about it

wader · 2023-12-04T22:08:04Z

format/ldb/ldb.go

+var compressionTypes = scalar.UintMapSymStr{
+	compressionTypeNone:      "none",
+	compressionTypeSnappy:    "Snappy",
+	compressionTypeZstandard: "Zstandard",


Keep all lowercase?

Possibly "zstd"? not sure what is most common

wader · 2023-12-04T22:17:03Z

format/ldb/ldb.go

+	var indexSize int64
+	var metaIndexOffset int64
+	var metaIndexSize int64
+


I wonder if use of d.LimitedFn or d.RangeFn could simplify some field length calculations? mostly thinking of the use of d.Pos/d.Len, is the padding dynamic in size somehow?

Thanks for the pointer to these functions. Rewrote.

yeap they are quite nice sometimes so one don't have to keep track number of bits left etc. not that i think d.Pos() inside a one of these still return the absolute position in the "root" buffer but d.Len() will change behaviour. Maybe there should be d.RootPos()/d.RootLen() etc?

wader · 2023-12-04T22:17:47Z

format/ldb/ldb.go

+			compressedSize := size
+			compressed := data
+			bb := &bytes.Buffer{}
+			_ = bb


debug leftover i guess :)

wader · 2023-12-04T22:26:29Z

format/ldb/ldb.go

+
+	// index
+
+	d.SeekAbs(indexOffset * 8)


Wonder if d.FramedLen or d.RangeLen could be used here instead of seek and passing the size?

You mean d.FramedFn and d.RangeFn? It is not clear to me how this makes the code clearer.
We find these offsets in the footer, and then jump to them to read the data.

No worries, only use if it feels it makes things clearer. d.RangeFn can sometimes be used instead of combination of seek and the limit

wader · 2023-12-04T22:32:06Z

format/ldb/ldb.go

+		})
+	})
+	// TK: how do you make an empty entries-array appear _above_ the trailer?
+	// Right now, its omited if empty.


They both ends up in a struct? then they will be sorted by bit range start... and i guess it might be that the entries array ends up with a zero range atm, hmm feels like a bug 🤔

Typically a key-value block has (1) entries and (2) trailer. However, occasionally there's only a trailer and no entries. What to do in such scenarios?

Option 1: leave away the entries array all together. (Current solution.)

Option 2: Show an empty entries array. (Proposition. However, it ends up being shown below the trailer, since it's empty. We need to read the trailer first, to figure out if there's an entries array or not.)

I guess it would be nice to have an empty entries array for clarity? i will have to look closer at how to solve this. But don't think it's a blocker to merge, can be changed/fixed later.

wader · 2023-12-04T22:35:27Z

format/ldb/ldb.go

+	d.FieldRawLen("raw", size*8)
+}
+
+func decodeVarInt(d *decode.D) uint64 {


Is this LEB128? thinking maybe can use d.FieldULEB128

Made a remark in tryULEB128 that it's the same as "Base 128 Varint" (a term used in Google contexts).

wader · 2023-12-04T23:03:47Z

format/ldb/testdata/ldb.fqtest

@@ -0,0 +1,98 @@
+$ fq -d ldb dv uncompressed.ldb/000005.ldb


snappy test will be added later?

wader · 2023-12-04T23:08:58Z

format/ldb/testdata/make_ldb.py

@@ -0,0 +1,45 @@
+# Make LevelDB data: both uncompressed and compressed.


Nice with code to generate test cases!

wader · 2023-12-04T23:11:22Z

format/ldb/ldb.go

+	interp.RegisterFormat(
+		format.LDB,
+		&decode.Format{
+			Description: "LevelDB Table",


You can add a .md file named the same as the format and add various tips, todos and author. see maybe bson.md and bson.go how it works. It will also be used for help in the CLI and for generating formats.md with make doc

mikez · 2023-12-05T10:32:53Z

@wader Thank you for the comments. Now also read dev.md and made updates accordingly.

mikez · 2023-12-05T10:44:15Z

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?

Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

wader · 2023-12-05T16:56:42Z

format/format.go

@@ -125,6 +125,7 @@ var (
 	JPEG                = &decode.Group{Name: "jpeg"}
 	JSON                = &decode.Group{Name: "json"}
 	JSONL               = &decode.Group{Name: "jsonl"}
+	LDB                 = &decode.Group{Name: "leveldb_ldb"}


Change to LevelB_LDB.

Do you know if there are other leveldb related formats that might be interesting in the future? thinking if this is leveldb_table (https://github.com/google/leveldb/blob/main/doc/table_format.md?) and then maybe there will be leveldb_log etc? but it's not a big deal to rename things later on anyway i think, also the format can be probed

@wader 👌, good idea. Besides the suggested modification, I now also added leveldb_log and leveldb_descriptor; thereby keeping the names as in the LevelDB leveldbutils dump command.

wader · 2023-12-05T16:57:48Z

format/leveldb/leveldb_ldb.go

+	footerEncodedLength = (4*10 + 8) * 8
+	magicNumberLength   = 8 * 8
+	// leading 64 bits of
+	//     echo http://code.google.com/p/leveldb/ | sha1sum


Interesting trivia :)

wader · 2023-12-05T16:58:45Z

format/leveldb/leveldb_ldb.go

+	0x1: "value",
+}
+
+type BlockHandle struct {


Maybe blockHandle as it's internal?

Maybe same for the members also?

Ah, privacy/exports in Go is solved by title-casing! Interesting language design decision.

yeap, let's say you learn to be ok with it :) not sure how well it agrees with golang's principle of clarity and explicitness 🤔

wader · 2023-12-05T17:01:05Z

format/leveldb/leveldb_ldb.go

+				}
+				d.Copy(bb, bytes.NewReader(decompressed))
+			default:
+				d.Fatalf("Unsupported compression type: %x", compressionType)


If it's possible to continue decoding even if there is an error you can use d.Errorf and then using -f will ignore the error

👍 Replaced all d.Fatalf with d.Errorf where it made sense.

I lied btw, it's -o force=true not -f... but maybe there should be a shorthand for it :)

wader · 2023-12-05T17:01:56Z

format/leveldb/leveldb_ldb.go

+			default:
+				d.Fatalf("Unsupported compression type: %x", compressionType)
+			}
+			d.FieldStructRootBitBufFn("uncompressed", bitio.NewBitReader(bb.Bytes(), -1), func(d *decode.D) {


I really should come up with a nicer API for the nested decoding stuff, feels a bit like a hack atm

wader · 2023-12-05T17:03:20Z

pkg/decode/read.go

-		result |= (b & 0x7f) << shift
-		if b&0x80 == 0 {
+		result |= (b & 0b01111111) << shift
+		if b&0b10000000 == 0 {


Nice clarifications here 👍

wader · 2023-12-05T17:09:04Z

format/leveldb/leveldb_ldb.go

+			Groups:      []*decode.Group{format.Probe},
+			DecodeFn:    ldbDecode,
+		})
+	interp.RegisterFS(leveldbFS)


Some formats has a torepr implementation which is a function to convert the decode tree inside more user representation, not sure if that is interesting for leveldb? for example bson has https://github.com/wader/fq/blob/master/format/bson/bson.jq also don't forget to register the format-overloaded--function https://github.com/wader/fq/blob/master/format/bson/bson.go#L24 (it's part for a hack to make is possible to have kind of "polymorphic" functions)

Thanks for the pointers. It's not clear to me in which scenarios you'd use torepr.

It's probably only useful to have if the format itself is used to encode some structure, if we take bson example again:

# this shows all the nitty gritty details of bson encoding $ fq -d bson -o line_bytes=8 d format/bson/testdata/test.bson │00 01 02 03 04 05 06 07│01234567│.{}: format/bson/testdata/test.bson (bson) 0x000│41 01 00 00 │A... │ size: 321 │ │ │ elements[0:17]: │ │ │ [0]{}: element 0x000│ 01 │ . │ type: "double" (1) (64-bit binary floating point) 0x000│ 64 6f 75│ dou│ name: "dou" 0x008│00 │. │ 0x008│ 29 5c 8f c2 f5 b0 58│ )\....X│ value: 98.765 0x010│40 │@ │ │ │ │ [1]{}: element 0x010│ 02 │ . │ type: "string" (2) (UTF-8 string) 0x010│ 73 74 72 00 │ str. │ name: "str" 0x010│ 0a 00│ ..│ length: 10 0x018│00 00 │.. │ 0x018│ 6d 79 20 73 74 72│ my str│ value: "my string" 0x020│69 6e 67 00 │ing. │ ... # this shows the value the bson structure "represents" $ fq -d bson torepr format/bson/testdata/test.bson { ... "dou": 98.765, ... "str": "my string", ... }

So it will only make sense if whatever is encoded can be translate into a jq value and something a "end user" would expect kind of

@wader Ah, I think I get it! This was a helpful example for me. Thank you. :)
Yes, this applies here, especially for descriptors. However, this reminds me of another question I have. I will ask below.

wader · 2023-12-05T17:15:15Z

doc/formats.md

+### Limitations
+
+- no Meta Blocks (like "filter") are decoded yet.
+- Zstandard uncompression is not implemented yet.


This is more or less just depend on some zstd package? i've looked at https://github.com/klauspost/compress a couple of times, it has zstd and lots of other formats and i also suspect the api:s are a bit more low level then golang stdlib so might fit fq better. It can maybe replace github.com/golang/snappy also? impressively it seems it also has zero dependencies on other packages

I don't know how common zstd is in the wild. I've never come across it so far in the LevelDB samples I've seen.

Ok, then i think we can skip it for now, probably good to not make this PR grow too much also

wader · 2023-12-05T17:17:35Z

format/leveldb/testdata/ldb_uncompressed.fqtest

+     |                                               |                |          key_delta{}: 0x611-0x61a (9)
+0x610|   73                                          | s              |            user_key: "s" 0x611-0x612 (1)
+0x610|      01                                       |  .             |            type: "value" (0x1) 0x612-0x613 (1)
+0x610|         ff ff ff ff ff ff ff                  |   .......      |            sequence_number: 72057594037927935 0x613-0x61a (7)


Format as hex? but maybe decimal make sense if it's a sequence

Typically sequence makes sense in decimal. They are sequential numbers. The only exception are these index and metaindex sections; especially the example you highlighted, where the maximum possible unsigned integer 0xffffffffffffff is chosen.

Ok! keep as decimal. btw if 0xffffffffffffff has some special mening you can add description. For example the mp4 decoder does something like this for box size:

const ( boxSizeRestOfFile = 0 boxSizeUse64bitSize = 1 ) d.FieldU32("size", scalar.UintMapDescription{ boxSizeRestOfFile: "Rest of file", boxSizeUse64bitSize: "Use 64 bit size", })

In your case you could probably just have d.FieldU64("...", scalar.UintMapDescription{0xffffffffffffff: "blabla"})

@wader It's unclear to me if it has a special meaning. I only noticed so far that in the index and metaindex there are sometimes sequence numbers like that; it feels more like a hack though. The sequence numbers are typically used in the data section and there they have a well-defined meaning.

Ok! then sounds like we should leave as it is

wader · 2023-12-05T17:31:24Z

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?

Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

I usually lean towards not hide things as that is kind of what fq is all about. In this case if you did not add "compressed" fields would those bit ranges show as gaps fields instead?

wader · 2023-12-05T17:34:17Z

The failing test is about TestFormats/all/all.fqtest, probably just a matter of WRITE_ACTUAL=1 go test -run TestFormats/all/all.fqtest ./format. I've thought about redoing those tests... would be nice if adding a format would affect as little as possible outside the formats own directory

mikez · 2023-12-06T18:20:35Z

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?
Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

I usually lean towards not hide things as that is kind of what fq is all about. In this case if you did not add "compressed" fields would those bit ranges show as gaps fields instead?

I'd like you to hear me differently.

Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

wader · 2023-12-06T18:37:00Z

I'd like you to hear me differently.

Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

I see. As it works currently if you didn't add "compressed" fields they would instead show up as "gap" fields that fq automatically inserts, this is so that all bits will always be "reachable" somehow. Maybe what you looking for is the ability to add a "compressed" but somehow tell fq that it should be displayed in a more discreet way unless is verbose mode etc? I'm thinking if totally hiding a field might be confusing as it will look as there is a gap in data/hex column? or maybe i misunderstand what your aiming for?

mikez · 2023-12-06T18:39:23Z

@wader Question about the LevelDB log format and how to decompress it.

In essence the data structure is as follows:

We have a sequence of records, each with some data. The user can iterate over these using the LevelDB API.
In the .log-file itself, these records are split into many small pieces and put into blocks of 32KB. Each little piece has a marker if it's an entire record ("full") or only a fragment ("first", "middle", "last").
In the LevelDB app itself, these small little pieces are put together by and the datastructure is parsed.

I hope that's clear so far.

Thus my question is: for the pieces which are preserved in full, it's easy to show its underlying datastructure, since it hasn't been split. However, for the "first", "middle", and "last" pieces, I wouldn't know how to visualize it? Is there some precedence here in the other formats?

wader · 2023-12-06T18:39:57Z

format/format.go

@@ -125,6 +125,9 @@ var (
 	JPEG                = &decode.Group{Name: "jpeg"}
 	JSON                = &decode.Group{Name: "json"}
 	JSONL               = &decode.Group{Name: "jsonl"}
+	LevelDB_Descriptor  = &decode.Group{Name: "leveldb_descriptor"}
+	LDB                 = &decode.Group{Name: "leveldb_table"}
+	LOG                 = &decode.Group{Name: "leveldb_log"}


Prefix with LevelDB_

wader · 2023-12-06T18:41:07Z

format/leveldb/leveldb_descriptor.go

+		for {
+			if d.End() {
+				break
+			}


Could this be for !d.End() {?

wader · 2023-12-06T18:48:43Z

format/leveldb/leveldb_log.go

+		format.LOG,
+		&decode.Format{
+			Description: "LevelDB Log",
+			Groups:      []*decode.Group{format.Probe},


There are some things to be careful about when a format is in the probe group. For example make sure to "fail fast" if some magic etc is not found and to make sure that empty input does not succeed, ex if you have for !d.End() { at the "root" maybe it's good to count number of things successfully decoded and then after the loop make sure at least N was decoded.

I was not quite aware what that format.Probe was doing (I copy pasted that part); now I am. Since the log files don't have magic strings of any kind in them, I think it's better to exclude them for now + from descriptors which is based on similar logic.

👍 yeap there is no special probe code in fq, just that some decoders are tried if none is specified

extra trivia: all formats and group exist as jq functions so you can do this ex: ... | probe

mikez · 2023-12-06T18:50:19Z

@wader Something about those test failures I don't quite understand; it seems some aspect of LevelDB is leaking into the other tests involving FieldFormatBitBuf.

wader · 2023-12-06T18:58:26Z

@wader Question about the LevelDB log format and how to decompress it.

In essence the data structure is as follows:

We have a sequence of records, each with some data. The user can iterate over these using the LevelDB API.

In the .log-file itself, these records are split into many small pieces and put into blocks of 32KB. Each little piece has a marker if it's an entire record ("full") or only a fragment ("first", "middle", "last").

In the LevelDB app itself, these small little pieces are put together by and the datastructure is parsed.

I hope that's clear so far.

Thus my question is: for the pieces which are preserved in full, it's easy to show its underlying datastructure, since it hasn't been split. However, for the "first", "middle", and "last" pieces, I wouldn't know how to visualize it? Is there some precedence here in the other formats?

Think i would try to be "true" to how the format works, so show the parts as some array with structs with raw data field for the partia data etc. This is how most formats in fq work that has to "demux" etc, ex ogg, gzip and the tcp reassembly. This also makes it quite nice to when deal with broken files that might have some partial data that is good, then you can use fq to concat parts, maybe prefix/append some missing data and decode or output the "repaired" data.

wader · 2023-12-06T19:12:10Z

@wader Something about those test failures I don't quite understand; it seems some aspect of LevelDB is leaking into the other tests involving FieldFormatBitBuf.

I think it might be the thing i mentioned about probe, that the leveldb_description format succeeds when it shouldn't. So i guess this will get fixed when you remove it from probe.

mikez · 2023-12-06T22:26:03Z

I'd like you to hear me differently.
Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

I see. As it works currently if you didn't add "compressed" fields they would instead show up as "gap" fields that fq automatically inserts, this is so that all bits will always be "reachable" somehow. Maybe what you looking for is the ability to add a "compressed" but somehow tell fq that it should be displayed in a more discreet way unless is verbose mode etc? I'm thinking if totally hiding a field might be confusing as it will look as there is a gap in data/hex column? or maybe i misunderstand what your aiming for?

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

wader · 2023-12-06T22:31:37Z

format/leveldb/leveldb_table.go

-		handleLength := d.LimitedFn(footerEncodedLength, func(d *decode.D) {
+		// check for magic number and fail fast if it isn't there
+		d.SeekAbs(d.Len() - magicNumberLength)
+		d.FieldU64("magic_number", d.UintAssert(tableMagicNumber), scalar.UintHex)


If you want you can probably keep it like it was, as long as it does not decode megabytes of data before failing it should probably be ok.

I like the new way better. It seems to me the code got clearer; also with the d.BitsLeft() at the end.

wader · 2023-12-06T22:36:16Z

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

Aha that explains things :) i wonder if we could have separate options for raw and string truncate limit? by look up manually you mean use a query to access a string field?

mikez · 2023-12-06T22:43:14Z

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

Aha that explains things :) i wonder if we could have separate options for raw and string truncate limit? by look up manually you mean use a query to access a string field?

use a query, exactly! In my imaginary world, the entire thing would be an App like Hex Fiend or some debugger, and I could just hover over it to see the full thing as a tooltip.

wader · 2023-12-07T10:31:34Z

use a query, exactly! In my imaginary world, the entire thing would be an App like Hex Fiend or some debugger, and I could just hover over it to see the full thing as a tooltip.

Yeap that would very interesting. I can imagine some kind of "IDE" with multiple visual tree:s, data views and REPL:s.

wader · 2023-12-07T10:38:40Z

README.md

@@ -102,6 +102,7 @@ ipv6_packet,
 jpeg,
 json,
 jsonl,
+[leveldb_ldb](doc/formats.md#leveldb_ldb),


Do a make doc to update

Oddly, the make doc commands changes a lot of content in the SVGs. I left them out in the commit, as it seems they shouldn't change.

Aha yes skip those, i will fix. I did some changes to https://github.com/wader/ansisvg some days ago

wader · 2023-12-07T10:40:56Z

format/leveldb/testdata/leveldb_manifest.fqtest

@@ -0,0 +1,57 @@
+$ fq -d leveldb_descriptor dv uncompressed.ldb/MANIFEST-000004
+    |00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f|0123456789abcdef|.{}: uncompressed.ldb/MANIFEST-000004 (leveldb_descriptor) 0x0-0x57 (87)


What is the relation between manifest and descriptor? name for same thing?

It appears so. The term MANIFEST only seems to be used for the filename. In the codebase, they use the terms "descriptor" and "VersionEdit"; I find their inconsistency confusing here.

wader · 2023-12-07T17:41:08Z

d.FieldValue

I tried to search for "synthetic field" in the code, but couldn't find any examples, except for the flag in scalar.go. How would I implement this?

Hmm strange, but it was quite recently that i made synthetic something more special, before they were a hack using zero length bit ranges so it might explain why "synthetic" is not used much in the code yet.

For example mp4 decoder adds track id and format here https://github.com/wader/fq/blob/master/format/mp4/mp4.go#L280-L291

wader · 2023-12-07T19:36:48Z

Stress tested a bit more using:

find ~/ -name "*.ldb" -print0 | xargs -0 go run . -d leveldb_table '._error | select(.) | input_filename, .error'

Seems chrome has a bunch of leveldb files, some fail with variations of "UTF8(user_key): failed at position 127 (read size 0 seek pos 0): tryText nBytes must be >= 0 (-1)

mikez · 2023-12-07T22:24:38Z

Stress tested a bit more using:
find ~/ -name "*.ldb" -print0 | xargs -0 go run . -d leveldb_table '._error | select(.) | input_filename, .error'
Seems chrome has a bunch of leveldb files, some fail with variations of "UTF8(user_key): failed at position 127 (read size 0 seek pos 0): tryText nBytes must be >= 0 (-1)

I appreciate the stress-testing and the command. I'll take a look tomorrow!

wader · 2023-12-07T22:28:18Z

I appreciate the stress-testing and the command. I'll take a look tomorrow!

👍 no stress! i think the PR is in very good shape now, some more testing and fix decode issues and we should be ready to merge. anything else you want to add?

In the LevelDB encoding, the internal key can be cut at any byte: including the user_key, type, or sequence_number. The resulting prefix is shared among subsequent keys and not specified explicitly by them. This fixes a previous mistaken belief that cuts can't happen in the last 8 bytes of the type & sequence number. Tests are added.

mikez · 2023-12-08T22:38:43Z

@wader That one took me a while to track down and fix. :)

mikez · 2023-12-08T22:40:21Z

d.FieldValue

I tried to search for "synthetic field" in the code, but couldn't find any examples, except for the flag in scalar.go. How would I implement this?

Hmm strange, but it was quite recently that i made synthetic something more special, before they were a hack using zero length bit ranges so it might explain why "synthetic" is not used much in the code yet.

For example mp4 decoder adds track id and format here https://github.com/wader/fq/blob/master/format/mp4/mp4.go#L280-L291

Thank you. I searched for "synthetic"; had I searched for "FieldValue", then I would have found it faster. I wrongly assumed FieldValue was just a regular reader. :)

wader · 2023-12-08T22:53:39Z

format/leveldb/leveldb_descriptor.go

+			err := readInternalKey(nil, int(length), d)
+			if err != nil {
+				// TK(2023-12-08): how do I propagate this
+				// error `err` into the `d` object?


There is d.IOPanic that i think is similar? maybe there should be a similar one for generic error, or maybe d.Errorf should support %w error arguments somehow? have to read up on how that works. For now i think it can be fine to do d.Errof("blabla: %s", err) and format out the error as a string maybe?

err.Error() seems to do the trick.

wader · 2023-12-08T23:02:49Z

format/leveldb/testdata/leveldb_table_uncompressed.fqtest

+0x1d0|                  bd 03                        |      ..        |            value_length: 445 0x1d6-0x1d8 (2)
+     |                                               |                |            key{}: 0x1d8-0x1e5 (13)
+0x1d0|                        69 70 73 75 6d         |        ipsum   |              user_key_suffix: raw bits 0x1d8-0x1dd (5)
+     |                                               |                |              user_key: "lorem.ipsum" (inferred)


Nice :) i guess this will make it quite a bit easier to write queries?

@wader This decoding has given me a new insight into everything LevelDB stores; not just the key and the value, but apparently also the history of key-value pairs.

🥳 yeap i've had similar experiences writing decoders, you bump into lots of fascinating tricks and legacy, ex i remember zip has so much legacy to handle floppy fdisks and appending and also the date format uses msdos-timestamps (2 second precision!)

wader · 2023-12-08T23:03:18Z

format/leveldb/testdata/leveldb_table_uncompressed.fqtest

+0x1d0|               0d                              |     .          |            unshared_bytes: 13 0x1d5-0x1d6 (1)
+0x1d0|                  bd 03                        |      ..        |            value_length: 445 0x1d6-0x1d8 (2)
+     |                                               |                |            key{}: 0x1d8-0x1e5 (13)
+0x1d0|                        69 70 73 75 6d         |        ipsum   |              user_key_suffix: raw bits 0x1d8-0x1dd (5)


There are not strings?

@wader They are; I couldn't figure out if you had a scalar method I could plug into d.FieldRawLen... since I need the bytes as well:

br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8)

Aha ok, the suffix bytes are not always valid utf8? anyways i think the current way is fine! but i do wonder if there should be a Raw variant that return bytes? (maybe there is?) the reason Raw returns a bitio.BitReader is that they can be used for very large bit ranges so that the bits will only be read if really needed to save on IO and memory.

wader · 2023-12-08T23:06:43Z

Thank you. I searched for "synthetic"; had I searched for "FieldValue", then I would have found it faster. I wrongly assumed FieldValue was just a regular reader. :)

Aha sorry should have been clearer :)

BTW now all *.ldb files in my home directory decodes fine and looks beautiful!

wader · 2023-12-09T09:45:45Z

format/leveldb/leveldb_table.go

+	// case 2: type and sequence_number fit fully in unshared: simulate user_key value.
+	if unsharedSize >= typeAndSequenceNumberSize {
+		br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8)
+		d.FieldValueStr("user_key", string(append(sharedBytes, d.ReadAllBits(br)...)), strInferred)


Could it be something like?

suffix := d.FieldUTF8("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)) d.FieldValueStr("user_key", string(sharedBytes)+suffix), strInferred)

An alternativ to using d.FieldValueStr could be to only have one user_key field and let the symbolic value for that one string be the inferred string. But i think i like how it is now which i guess is more true to how thing actually work, that it's a suffix, then and just add a bit extra to make it convenient.

btw append(sharedBytes, d.ReadAllBits(br)...) can something in go be a bit tricky, if i remember correctly the sharedBytes slice might end up being reused if it as capacity, but in this case it should be fine i think as it get turned into a new string? (which immutable and not share the slice bytes)

@wader I like your solution.

[...] the sharedBytes slice might end up being reused [...]

the append() of Go I don't quite understand yet. Apparently it is an amoritized-O(1) operation yet it seems sharedBytes doesn't get mutated? (So it's not doing .append(....) like in Python or .push like in JavaScript.)

@wader Although, on second thought, if the user_key is cut at some UTF-8 multi-byte character, this might fail. However, maybe it is good enough.

In testing, I see it does indeed fail:

In LevelDB, both keys and values are bytestrings; so a UTF-8 encoding would typically be used, but must not necessarily.

@wader I made some own functions for now to handle this. See recent commit.

wader · 2023-12-09T10:01:45Z

format/leveldb/leveldb_table.go

+						}
+						err := keyCallbackFn(keyPrefix, int(unshared), d)
+						if err != nil {
+							d.Errorf(err.Error())


Maybe these should be d.Errorf("%s", err) to be sure some random %s etc in the error string is seen as a format?

Thank you for the pointer. I just saw earlier that I'd used some d.Errorf("%v", err) code I had copy-pasted from avro_ocf.go without remembering.

wader · 2023-12-09T10:02:08Z

format/leveldb/leveldb_descriptor.go

+		d.FieldStruct("data", func(d *decode.D) {
+			err := readInternalKey(nil, int(length), d)
+			if err != nil {
+				d.Errorf(err.Error())


See other comment about d.Errorf(err.Error())

wader · 2023-12-09T10:10:42Z

Great work! just some tiny things left i think, after that just let me know if you think your read to merge

Hope it was a good review experience too 😄

mikez · 2023-12-09T10:38:00Z

Great work! just some tiny things left i think, after that just let me know if you think your read to merge

Hope it was a good review experience too 😄

Thank you, @wader! I very much appreciate your reviewing style. I learned a lot about fq, Go, and LevelDB in the process. I hope it was a good review experience for you as well.

decode unfragmented .log files: - break leveldb_log.go into leveldb_log_blocks.go and leveldb_log.go; the former is used by both .MANIFEST (descriptor) and .LOG. - in leveldb_log, introduce readBatch that decodes further fix UTF8 decoding: - introduce fieldUTF8ReturnBytes and stringify to handle multi-byte UTF8-encodings correctly.

wader · 2023-12-09T13:56:38Z

format/leveldb/leveldb_table.go

@@ -441,3 +446,21 @@ func mask(crc uint32) uint32 {
 	// Rotate right by 15 bits and add a constant.
 	return ((crc >> 15) | (crc << 17)) + kMaskDelta
 }
+
+// Concatinate byteslices and convert into a string.
+func stringify(byteSlices ...[]byte) string {


If you want to save on allocations you could use a strings.Builder or possible sum all lengths allocate once with make() and then append. But i have feeling it wont make much difference in this case?

strings.Builder seems to do the same thing. I suppose the resizing would be made in an amortized-constant way (similar to hash tables)? I can still preallocate though. :)

Ok, this explains more what's going on. It seems sometimes it's mutated (and then I assume that's amortized constant)... and sometimes it's created from scratch (which is costly). It is not clear to me which one happens when though.

Huh i assumed strings.Builder did more smart things by growing smarty but maybe the only smartness is in the String() method? but i do wonder if append might have some smartness about how to grow so it won't end up reallocating/copy all the time?

@wader I'm trying to track down how append works precisely; this is closest I've found: https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssagen/ssa.go#L3503
That code has too many concepts I don't understand; don't have more clarity now. :)

interesting! it seems to generate a call to growslice https://github.com/golang/go/blob/fe1b2f95e6dbfb6e6212bb391706ae62eb0ae5ec/src/runtime/slice.go#L155 which in turn seems to call nextslicecap https://github.com/golang/go/blob/fe1b2f95e6dbfb6e6212bb391706ae62eb0ae5ec/src/runtime/slice.go#L267 which has some heuristics how to grow

https://godbolt.org/z/rEPerb3b4 maybe interesting. i would have assumed that append would just end up being a call to something but seem there lots of tricky going on, maybe be speed up common cases etc?

@wader Oh, checking the compiler explorer was very insightful!! runtime.growslice(SB) clearly shows the connection to slice.go now. I don't quite see the big picture yet, but now I know where to start if I wanted to dive deeper. :)

wader · 2023-12-09T14:07:00Z

format/leveldb/leveldb_table.go

+	return string(result)
+}
+
+func fieldUTF8ReturnBytes(name string, nBytes int, d *decode.D) []byte {


I've experiment and thought a bit about doing a decode api v2 that would use method chaining then things could probably be nicer, something like:

d.FieldUTF8("bla", nBytes).Bytes() d.FieldUTF8("bla", nBytes).BitReader() d.FieldUTF8("bla", nBytes).Map(...).Bytes()

Think that would make it a bit more flexible, would probably also cut down a bit on amount of generated code. Also! it can probably help make the api more type safe as ex d.FieldUTF8 would return some type forcing things to use string. Should think more about this :)

wader · 2023-12-09T14:26:50Z

format/leveldb/leveldb_table.go

-		br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8)
-		d.FieldValueStr("user_key", string(append(sharedBytes, d.ReadAllBits(br)...)), strInferred)
+		suffix := fieldUTF8ReturnBytes("user_key_suffix", unsharedSize-typeAndSequenceNumberSize, d)
+		d.FieldValueStr("user_key", stringify(sharedBytes, suffix), strInferred)


So if i understand correctly: shared and suffix are seen as bytes so they might not be valid utf8, so reading them as utf8 strings wont preserve the raw bytes if illegal etc? normal golang strings i think can store raw bytes but fq's d.UTF8 use unicode.UTF8BOM which i suspect might replace illegal bytes with error or replacement runes? anyways if this seem to work keep it like this!

Ah, interesting point with UTF8BOM (0xEF, 0xBB, 0xBF)!

I hadn't come across this before nor thought about it.
It seems the Python encoder natively skips it (unless specified explicitly), also some of the LevelDB dumps I've observed so far don't use the UTF8BOM.
That said, since LevelDB takes bytestrings as keys and values, that wouldn't stop anyone from using UTF8BOM if they wanted to.

So maybe the utf-8 encoder will be good enough in most cases?

Safest i think is probably to keep it as is, treat the parts as raw bytes. It's a bit weird that UTF8 with a BOM is even a thing as UTF8 does not really have a byte order :) but it seems to happen 🤷 the encoding used by d.UTF8 is choose to be quite liberal with weird stuff :)

BTW if you want to have full control over text encoding then there is d.FieldStr("...", <encoding>) where you can plugin any encoding.Encoding.

wader · 2023-12-09T14:54:13Z

Great work! just some tiny things left i think, after that just let me know if you think your read to merge
Hope it was a good review experience too 😄

Thank you, @wader! I very much appreciate your reviewing style. I learned a lot about fq, Go, and LevelDB in the process. I hope it was a good review experience for you as well.

Happy to hear that! i can sometimes end up thinking that i might come a cross as "naggy" when there is a lot of back and forth, i'm not! superhappy someone wants to help out :) also a bit tricky to review things about the format itself as i have very little experience with it, trying to mostly focus on helping it fit well into rest of fq :)

wader · 2023-12-09T14:55:49Z

format/leveldb/testdata/lorem.json

+    "lorem.lorem": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
+    "lorem.ipsum": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
+    "lorem.dolor": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
+    "row": "Row, row, row your boat\nGently down the stream.\nMerrily, merrily, merrily, merrily,\nLife is but a dream. 🚣‍♂️"


Now i noticed the 🚣‍♂️ :)

wader · 2023-12-09T15:23:10Z

Played around a bit more with chrome leveldb:s, seems to work fine! is a bit confusing to look at web storage databases, lots of weird stuff stored in them :)

Think i'm kind of ready to merge if you are

mikez · 2023-12-09T15:37:22Z

@wader I'm ready! :)

wader · 2023-12-09T15:47:49Z

🥳

mikez added 2 commits December 4, 2023 12:05

ldb: first draft

fb910bd

Example: go run . -d ldb d format/ldb/testdata/000005.ldb

ldb: uncompression support

efc59a8

wader reviewed Dec 4, 2023

View reviewed changes

leveldb: address PR comments

b05aa99

leveldb: rename functions and add comments

78a3e94

wader reviewed Dec 5, 2023

View reviewed changes

wader changed the title ~~ldb: Add LevelDB support~~ leveldb: Add LevelDB support Dec 6, 2023

leveldb: add log and descriptor decoders

2df0f0f

wader reviewed Dec 6, 2023

View reviewed changes

leveldb: updates per PR comments

fe1099b

wader reviewed Dec 6, 2023

View reviewed changes

leveldb: fix all.fqtest failures

4283091

wader reviewed Dec 7, 2023

View reviewed changes

wader reviewed Dec 8, 2023

View reviewed changes

mikez added 2 commits December 9, 2023 08:45

leveldb: propagate error

e735cea

leveldb: rename "suffix" to "sequence_number_suffix"

07ad940

wader reviewed Dec 9, 2023

View reviewed changes

leveldb: fix Errorf arguments

e826f09

wader reviewed Dec 9, 2023

View reviewed changes

leveldb: improve stringify by preallocating result

08e3d2d

wader reviewed Dec 9, 2023

View reviewed changes

wader merged commit b05c7ec into wader:master Dec 9, 2023
5 checks passed

		@@ -0,0 +1,45 @@
		# Make LevelDB data: both uncompressed and compressed.

		@@ -0,0 +1,57 @@
		$ fq -d leveldb_descriptor dv uncompressed.ldb/MANIFEST-000004
		\|00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f\|0123456789abcdef\|.{}: uncompressed.ldb/MANIFEST-000004 (leveldb_descriptor) 0x0-0x57 (87)

leveldb: Add LevelDB support #824

leveldb: Add LevelDB support #824

Conversation

mikez commented Dec 4, 2023

wader left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikez Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikez Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikez commented Dec 5, 2023

mikez commented Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikez Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikez Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wader commented Dec 5, 2023

wader commented Dec 5, 2023

mikez commented Dec 6, 2023

wader commented Dec 6, 2023

mikez commented Dec 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikez commented Dec 6, 2023

wader commented Dec 6, 2023

wader commented Dec 6, 2023 • edited Loading

mikez commented Dec 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wader commented Dec 6, 2023

mikez commented Dec 6, 2023

wader commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikez Dec 5, 2023 •

edited

Loading

mikez Dec 5, 2023 •

edited

Loading

mikez commented Dec 5, 2023 •

edited

Loading

mikez Dec 6, 2023 •

edited

Loading

mikez Dec 6, 2023 •

edited

Loading

wader commented Dec 6, 2023 •

edited

Loading

mikez Dec 7, 2023 •

edited

Loading