Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime error: slice bounds out of range in packets.go #587

Closed
gm42 opened this issue May 9, 2017 · 21 comments · Fixed by #592
Closed

runtime error: slice bounds out of range in packets.go #587

gm42 opened this issue May 9, 2017 · 21 comments · Fixed by #592
Labels

Comments

@gm42
Copy link

gm42 commented May 9, 2017

Issue description

When using many concurrent SQL queries (same pool), I get a slice out of bounds error.
The corresponding logic in jmoiron/sqlx seems fine. It's just a call to (*Rows).Next() which is causing this.

Similar to #36 perhaps?

Error log

panic: runtime error: slice bounds out of range

goroutine 4814 [running]:
mypackage/vendor/github.com/go-sql-driver/mysql.(*binaryRows).readRow(0xc4205a2160, 0xc422be3140, 0xc, 0xc, 0xc, 0xc422be3080)
	/home/gm42/projects/mypackage/src/mypackage/vendor/github.com/go-sql-driver/mysql/packets.go:1168 +0x92d
mypackage/vendor/github.com/go-sql-driver/mysql.(*binaryRows).Next(0xc4205a2160, 0xc422be3140, 0xc, 0xc, 0xc, 0xc)
	/home/gm42/projects/mypackage/src/mypackage/vendor/github.com/go-sql-driver/mysql/rows.go:85 +0x5e
database/sql.(*Rows).nextLocked(0xc422b6da80, 0xc4206ac4b0)
	/usr/local/go/src/database/sql/sql.go:2149 +0x6c
database/sql.(*Rows).Next.func1()
	/usr/local/go/src/database/sql/sql.go:2134 +0x3c
database/sql.withLock(0xeb1640, 0xc422b6dab0, 0xc4206ac4f0)
	/usr/local/go/src/database/sql/sql.go:2545 +0x65
database/sql.(*Rows).Next(0xc422b6da80, 0xb82ce0)
	/usr/local/go/src/database/sql/sql.go:2135 +0x83
mypackage/vendor/github.com/jmoiron/sqlx.(*Row).Scan(0xc4223b0870, 0xc422be2fc0, 0xc, 0xc, 0x0, 0x0)

Configuration

Driver version (or git SHA): 1421caf

Go version: go version go1.8.1 linux/amd64

Server version: MySQL 5.6.27

Server OS: Ubuntu 16

@gm42
Copy link
Author

gm42 commented May 9, 2017

I was digging a bit around. These are the three bits of code involved (from highest to lowest level):

  1. https://github.com/jmoiron/sqlx/blob/f980a91bdc37abef88269b8f122d7de6352102f5/sqlx.go#L205 (corresponding to line 199 in the stacktrace I posted)
  2. mysql/packets.go

    Line 1168 in 1421caf

    dest[i], isNull, n, err = readLengthEncodedString(data[pos:])
    so:
		// Length coded Binary Strings
		case fieldTypeDecimal, fieldTypeNewDecimal, fieldTypeVarChar,
			fieldTypeBit, fieldTypeEnum, fieldTypeSet, fieldTypeTinyBLOB,
			fieldTypeMediumBLOB, fieldTypeLongBLOB, fieldTypeBLOB,
			fieldTypeVarString, fieldTypeString, fieldTypeGeometry, fieldTypeJSON:
			var isNull bool
			var n int
			dest[i], isNull, n, err = readLengthEncodedString(data[pos:])
  1. on the other hand dest is initialised in https://golang.org/src/database/sql/sql.go#L2147 as:
rs.lastcols = make([]driver.Value, len(rs.rowsi.Columns()))

Is it possible that either rs.lastcols is nil or with an insufficient number of columns?

UPDATE 1: I checked by modifying packets.go, and dest is fine. The slice of the panic must be data[pos:]
UPDATE 2: I have to say that this looks a lot like a heisenbug, as now that I am checking in packets.go also for the length of data, I cannot seem to be able to reproduce it anymore (it was not possible to deterministically reproduce it before either)

@gm42
Copy link
Author

gm42 commented May 9, 2017

By the way, I am also getting (from few to many) errors like:

sql: Scan error on column index 3: converting driver.Value type int64 ("7234581354877314404") to a uint32: value out of range

You can check with https://play.golang.org/p/N0AeL32n8R that such big numbers (all instances I got in logs and I tried) are really pieces of ASCII text e.g. "de liefd". This specific error (the one I report in this comment) is reproducible, compared to the one in first post.

So my theory is that the two errors are related, and somehow one field gets "eaten"/skipped and causes the problem.

@julienschmidt
Copy link
Member

@gm42 can you make sure that this is a problem somehow caused by concurrency?
Two ways to test this come to my mind:

  1. read the whole database sequentially once

  2. try to find the corresponding database entry (e.g. by searching for %de liefd%) and try if it works if you (repeatedly) select the same entry

@julienschmidt
Copy link
Member

It's hard to imagine might be causing the problems. The driver itself keeps all the connections completely isolated from each other, so concurrency shouldn't be an issue. The pooling logic in the database/sql package has some weird and complex implementations but I also can't imagine any scenario where it might cause problems like that. The last suspect might be the MySQL server itself, but then it should also cause problems with other clients. And considering the size of the MySQL community, I think this problem would have been detected already.

The only remaining reason I can imagine is that we do not reset some state properly somewhere.

@gm42
Copy link
Author

gm42 commented May 10, 2017

@julienschmidt an update:

  1. still could not reproduce again the error in original post, I am running the version I
    mentioned there 1421caf of go-sql-driver (untouched); I believe it will happen again, just it's hard to trigger
  2. I cannot seem to trigger the converting driver.Value type int64 error with MySQL v5.7, while I always can with v5.6; where not explicitly specified like I did here, I am always testing with the version of original post (v5.6)
  3. I can trigger the converting driver.Value type int64 error by reading sequentially all the records, however it does not happen always with same rows; it seems random in "electing" which rows to fail, so in this sense concurrency works just as an "intensifier" of the trigger.

Just to be clear about (3), there are times that a full sequential read does not trigger the conversion error at all, and times that it does, even immediately; this is also unexpected for me, and puzzling, but compatible with some internal state problem because I do not restart the Go process between runs e.g. the server side API stays running (this is the side connecting to MySQL), and from client-side I can trigger (either sequentially or concurrently) full scans via (synchronous) API calls.

Now I will run some more tests with concurrency and try to reduce this to a testcase for MySQL 5.7.

UPDATE: @julienschmidt I have managed to reduce the data/query parts and the Go parts of the testcase: https://gist.github.com/gm42/2fa0693d9a9aea81f1dce1f1b65679af

You can reproduce the issue with MySQL 5.6 by running:

go run go-mysql-587-testcase.go

(You will need the sqlx package and to populate a localhost MySQL instance, or change the connection string accordingly)

Some interesting facts I found out:

  • disabling connection pooling make this bug disappear! So evidently it's related to connection sharing. A separate upstream bug for database/sql should be reported when investigation here is finished IMO.
  • only happening with MySQL 5.6 with a self-join table and when ordering happens through one of the JOINed tables

When trying to reduce the testcase (so that less tables would be involved), the provided query was really the smallest testcase I could find, please let me know if bug can be reproduced and/or if you manage to reduce it further.

@cdesousa
Copy link

I've been able to reproduce this on my laptop with @gm42 submitted test case.

mysql version: mysqld Ver 5.7.18 for osx10.12 on x86_64 (Homebrew)

go version: 1.8.1
go-sql-driver: latest master
sqlx: latest master

result: it seems to fail randomly in a different iteration

2017/05/10 11:53:14 starting iterations
2017/05/10 11:53:14 TestcaseIteration(40): NOT A BUG
2017/05/10 11:53:14 TestcaseIteration(41): 67108864
panic: BUG

2017/05/10 11:53:56 starting iterations
2017/05/10 11:53:56 TestcaseIteration(45): NOT A BUG
2017/05/10 11:53:56 TestcaseIteration(46): NOT A BUG
2017/05/10 11:53:56 TestcaseIteration(48): NOT A BUG
2017/05/10 11:53:56 TestcaseIteration(47): NOT A BUG
2017/05/10 11:53:56 TestcaseIteration(43): NOT A BUG
2017/05/10 11:53:56 TestcaseIteration(42): 67108864
panic: BUG

@julienschmidt
Copy link
Member

The next step would be to reproduce it without sqlx, as the error might also be happening there.

@cdesousa
Copy link

cdesousa commented May 10, 2017

Reproduced (same scenario as above) without sqlx, using regular database/sql package (with @gm42's original test case + his database.sql).

https://gist.github.com/cdesousa/95775bb6dbf67480084d770475addb19

The test does pass when I remove the IFNULL function from the SELECT and just select either plc.parent_id or plc.id.

Still doing some additional debugging as to what the cause is.

2017/05/10 13:52:02 starting iterations
** Data:  50 test 1
2017/05/10 13:52:02 TestcaseIteration(46): NOT A BUG
** Data:  29 test 1
2017/05/10 13:52:02 TestcaseIteration(43): NOT A BUG
** Data:  36 test 1
2017/05/10 13:52:02 TestcaseIteration(44): NOT A BUG
** Data:  73 test 1
2017/05/10 13:52:02 TestcaseIteration(50): NOT A BUG
** Data:  41 test 1
2017/05/10 13:52:02 TestcaseIteration(45): NOT A BUG
** Data:  87 test 1
2017/05/10 13:52:02 TestcaseIteration(52): NOT A BUG
** Data:  22 test 1
2017/05/10 13:52:02 TestcaseIteration(42): NOT A BUG
** Data:  1 test 1
2017/05/10 13:52:02 TestcaseIteration(39): NOT A BUG
** Data:  89 test 1
2017/05/10 13:52:02 TestcaseIteration(53): NOT A BUG
** Data:  64 test 1
2017/05/10 13:52:02 TestcaseIteration(48): NOT A BUG
** Data:  8 test 1
2017/05/10 13:52:02 TestcaseIteration(40): NOT A BUG
** Data:  108 test 1
2017/05/10 13:52:02 TestcaseIteration(55): NOT A BUG
** Data:  109  67108864
2017/05/10 13:52:02 TestcaseIteration(57): 67108864
panic: BUG

@cdesousa
Copy link

cdesousa commented May 10, 2017

Update:

  • moving the db.Prepare statement into the testCase func (ie preparing it within each goroutine) results in "BUG NOT DETECTED".
  • db.SetMaxOpenConns(1) results in "BUG NOT DETECTED"

@julienschmidt
Copy link
Member

This sounds a lot like a bug in the pooling logic of the database/sql package

@julienschmidt
Copy link
Member

But on the other hand, this shouldn't cause a bug like this.
There must be something different in the state of the mysqlRows, mysqlConn or the vars in readRow() between cases where the bug triggers and when not.

@methane
Copy link
Member

methane commented May 10, 2017

See here:

mysql/statement.go

Lines 117 to 125 in e5cc7f3

// Columns
// If not cached, read them and cache them
if len(stmt.columns) == 0 {
rows.rs.columns, err = mc.readColumns(resLen)
stmt.columns = append(stmt.columns, rows.rs.columns)
} else {
rows.rs.columns = stmt.columns[0]
err = mc.readUntilEOF()
}

Is this bug reproducible when removed this line?

			stmt.columns = append(stmt.columns, rows.rs.columns)

@methane
Copy link
Member

methane commented May 10, 2017

(slightly off topic)
I don't like such unsafe caching.
I think we should choose robustness and simplicity over performance in benchmark game.

@cdesousa
Copy link

@methane the bug is not reproducible when I comment out that line.

@methane
Copy link
Member

methane commented May 10, 2017

OK.
There are no guarantee that same statement returns same column definition every time.

So we should check column definition packet are exactly same bytes before reusing
parsed column definition.
Or we can just remove column definition cache.

@methane
Copy link
Member

methane commented May 10, 2017

BTW, Oracle removed protocol document. WTF!

We can see archived page here:
https://web-beta.archive.org/web/20170404230026/https://dev.mysql.com/doc/internals/en/prepared-statements.html

Or we can refer MariaDB's document.

@gm42
Copy link
Author

gm42 commented May 11, 2017

Thanks guys for the progress here. IMO the driver should release a patch that fixes this serious problem first, then caching is studied later with a regression test for this in mind.

Imagine how this could be used as an attack vector to increase the balance or some other number in a column by making many API calls with text-setting fiields like zzzzzzzz... 😊

Shall I attempt a PR or you already have something in progress? I also think #444 might be affected

methane added a commit to methane/mysql that referenced this issue May 11, 2017
methane added a commit to methane/mysql that referenced this issue May 11, 2017
@methane
Copy link
Member

methane commented May 11, 2017

I tried #587 (comment)

I found fieldType of nullable_id is sometimes fieldTypeLong and sometimes fieldTypeLongLong.
Using cached wrong fieldType broke parsing row.

@cdesousa
Copy link

Great find, thanks for the work!

@gm42
Copy link
Author

gm42 commented May 11, 2017

@methane might be explained by the fact that an IFNULL sometimes returns a Long (e.g. for positive case matching NULL), sometimes returns the actual column value (LongLong)

methane added a commit that referenced this issue May 12, 2017
@gm42
Copy link
Author

gm42 commented May 12, 2017

Thanks to everyone involved here for the swift support & patching. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants