First complete version of RecordIO #1

wangkuiyi · 2017-04-29T05:32:26Z

This PR is based on discussions in wangkuiyi/sstable#1

Fixes PaddlePaddle/Paddle#1947

recordio_test.go shows the usage of the RecordIO API.

gongweibao · 2017-04-29T12:52:00Z

header.go

+	checkSum       uint32
+	compressor     uint32
+	compressedSize uint32
+	len            uint32


用record_num表示记录的个数是否会更好一点？
或者加个注释?

len确实有点confusing，我第一眼看到以为是number of bytes，其实是number of records。

gongweibao · 2017-04-29T12:55:03Z

reader.go

+type Index struct {
+	chunkOffsets []int64
+	chunkLens    []uint32
+	records      int


chunk struct中records表示的[][]byte
这里用total_record_num?

Done. Index.records => Index.numRecords

gongweibao · 2017-04-29T13:16:00Z

reader.go

+			f.chunkOffsets = append(f.chunkOffsets, offset)
+			f.chunkLens = append(f.chunkLens, hdr.len)
+			f.records += int(hdr.len)
+			offset, e = r.Seek(int64(hdr.compressedSize), io.SeekCurrent)


chunk的maxChunkSize是可以用户定制的。这个地方的seek的次数可能未必可控，潜在的性能风险比较大。
单个chunk的索引其实只有chunkOffset和chunkLen，读12个字节；跳读一次parseHeader需要读16个字节+seek一次。
如果需要索引，从性能考虑，不如把索引加到后边或者另外的文件里边。

为了让代码可测试，需要允许比较小的chunk的。我修改一下：如果用户不指定，则使用默认defaultMaxChunkSize。

gongweibao · 2017-04-29T13:23:10Z

chunk.go

+	}
+
+	var buf bytes.Buffer
+	if _, e = io.CopyN(&buf, r, int64(hdr.compressedSize)); e != nil {


没有做checkSum检查

gongweibao

.

helinwang · 2017-04-29T16:14:17Z

chunk.go

+	size    int // sum of record lengths.
+}
+
+func newChunk() *Chunk {


Ok for me either way, just a suggestion: Maybe we don't need this function, since there is no argument to the function. We can just use &Chunk{} where needed.

helinwang · 2017-04-29T16:17:08Z

header.go

+
+	// NoCompression means writing raw chunk data into files.
+	// With other choices, chunks are compressed before written.
+	NoCompression = 0


These few lines can be changed to:

NoCompression = iota Snappy Gzip

https://golang.org/ref/spec#Iota

helinwang · 2017-04-29T16:20:30Z

header.go

+
+func (c *Header) write(w io.Writer) (int, error) {
+	var buf [20]byte
+	binary.LittleEndian.PutUint32(buf[0:4], magicNumber)


Feels like compressor, magicNumber are only necessary per file (not per chunk)? Do we allow user specify different compressor per chunk?

You are right. But a point of RecordIO is that if some records were not correctly written, we can skip over to the next chunk. Therefore, we need to have magic number and per-chunk metadata.

@wangkuiyi from my understanding it's the checksum that prevent decoder to return the corrupted data, not the magic number? It's possible that the program crash after writing the magic number but during writing the data.
I thought magic number is for understanding the type of the file. E.g., gzip have a magic number to tell it's a gzip file.

I see, it's for constructing the correct index (since checksum is not verified during the index construction). Maybe it make sense to put magic number at the end of header, to prevent the case that writing magic number succeeded but writing other header fields failed.

Yes. I think that the magic number is for segmenting chunks. If a chunk is corrupted when it's written, we got a chance to skip to the next valid chunk by search sequentially for the next magic number.

I see. Good!

helinwang · 2017-04-29T16:33:33Z

header.go

+	checkSum       uint32
+	compressor     uint32
+	compressedSize uint32
+	len            uint32


len确实有点confusing，我第一眼看到以为是number of bytes，其实是number of records。

helinwang · 2017-04-29T16:36:58Z

chunk.go

+
+	// Clear the current chunk.
+	ch.records = nil
+	ch.size = 0


ch.size跟len(ch.records)是不是重复了？

不重复。是我的变量命名导致的误会。

Done. ch.size ==> ch.numBytes

helinwang · 2017-04-29T16:48:18Z

chunk.go

+// the next add invocation.
+func (ch *Chunk) dump(w io.Writer, compressorIndex int) error {
+	// Write raw records and their lengths into data buffer.
+	var data bytes.Buffer


Maybe early return if len(ch.records) == 0. Otherwise empty chunk will still have header.

Good point! Done.

helinwang · 2017-04-29T16:59:20Z

chunk.go

+		return nil, e
+	}
+
+	ch := &Chunk{}


Here is not using newChunk(), would not have this inconsistent problem if newChunk() is removed :p

helinwang · 2017-04-29T17:00:40Z

chunk.go

+	ch := &Chunk{}
+	for i := 0; i < int(hdr.len); i++ {
+		var rs [4]byte
+		if _, e = deflated.Read(rs[:]); e != nil {


rs[:] is same as rs.

Noop. deflated.Read requires a slice, but rs was defined as an array. rs[:] converts the array into a slice.

helinwang · 2017-04-29T17:01:15Z

chunk.go

+			return nil, fmt.Errorf("Failed to read record length: %v", e)
+		}
+
+		r := make([]byte, binary.LittleEndian.Uint32(rs[:]))


rs[:] is same as rs.

rs[:] converts array rs into a slice as required by binary.LittleEndian.Uint32.

helinwang · 2017-04-30T15:23:58Z

reader.go

+	for {
+		if hdr, e = parseHeader(r); e != nil {
+			break
+		} else {


golang style prefer early return than if {} else {}.
I think the following is more clear (have less indentation):

for { if { break } // no else, more code here. }

helinwang · 2017-04-30T15:30:22Z

reader.go

+}
+
+// NewScanner creates a scanner that sequencially reads records in the
+// range [start, start+len).


Perhaps document what will happen when len < 0.

helinwang · 2017-04-30T15:41:27Z

reader.go

+func (s *Scanner) Record() []byte {
+	ci, ri := s.index.Locate(s.cur)
+	if s.chunkIndex != ci {
+		log.Fatalf("Must call Scan before Record")


Please do not panic when the application is not in an unrecoverable state. Please see bufio.Text() (also a scanner) for reference: https://golang.org/src/bufio/scan.go?s=4262:4293#L96, or here: https://golang.org/src/text/scanner/scanner.go?s=16921:16957#L652
I think returning nil here is reasonable.

Btw, from my understanding, logging inside a library does not make much sense (should report error if need logging) because it "pollutes" stdout, and developer may not see it during development (what if the case happened during production under some circumstance that is not trigged during development). I have never see it in golang official library, but not 100% sure, so I create this question: http://stackoverflow.com/questions/43708125/should-golang-library-do-logging-instead-of-return-error

Good point for this case! That test is for making sure that no bug here in the program, it is a develop-time test, shouldn't be shipped to production code.

Done by deleting the test.

helinwang

LGTM++ Very nice interface!

first version

ce7d1ec

wangkuiyi mentioned this pull request Apr 29, 2017

Initial import, not complete yet wangkuiyi/sstable#1

Closed

gongweibao reviewed Apr 29, 2017

View reviewed changes

gongweibao suggested changes Apr 29, 2017

View reviewed changes

Update document

69a9688

helinwang reviewed Apr 29, 2017

View reviewed changes

wangkuiyi self-assigned this Apr 30, 2017

Response to comments from Weibao and Helin

27bb8bb

helinwang reviewed Apr 30, 2017

View reviewed changes

Response to the second batch of comments from Helin

504aa7a

helinwang approved these changes May 1, 2017

View reviewed changes

wangkuiyi merged commit 19cc0e7 into master May 1, 2017

zou000 deleted the first-draft branch May 28, 2019 08:02

First complete version of RecordIO #1

First complete version of RecordIO #1

Conversation

wangkuiyi commented Apr 29, 2017 • edited Loading

gongweibao Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

gongweibao Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

gongweibao left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

wangkuiyi Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi Apr 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

wangkuiyi commented Apr 29, 2017 •

edited

Loading

gongweibao Apr 29, 2017 •

edited

Loading

wangkuiyi Apr 29, 2017 •

edited

Loading

wangkuiyi Apr 29, 2017 •

edited

Loading

gongweibao Apr 29, 2017 •

edited

Loading

wangkuiyi Apr 29, 2017 •

edited

Loading

gongweibao left a comment •

edited

Loading

helinwang Apr 30, 2017 •

edited

Loading

helinwang Apr 30, 2017 •

edited

Loading

wangkuiyi Apr 30, 2017 •

edited

Loading

wangkuiyi Apr 29, 2017 •

edited

Loading

helinwang Apr 30, 2017 •

edited

Loading

helinwang Apr 30, 2017 •

edited

Loading