Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when online write #5752

Closed
JimWen opened this issue Jun 28, 2020 · 18 comments
Closed

Crash when online write #5752

JimWen opened this issue Jun 28, 2020 · 18 comments
Labels
kind/bug Something is broken. status/accepted We accept to investigate/work on it.

Comments

@JimWen
Copy link

JimWen commented Jun 28, 2020

What version of Dgraph are you using?

Dgraph version : v20.03.1
Dgraph SHA-256 : 6a40b1e084205ae9e29336780b3458a3869db45c0b96b916190881c16d705ba8
Commit SHA-1 : c201611
Commit timestamp : 2020-04-24 13:53:41 -0700
Branch : HEAD
Go version : go1.14.1

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

128G mem & 1.8T SSD

Linux version 3.10.0-1062.9.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

Steps to reproduce the issue (command/config used to run Dgraph).

Online write data through java api or liveload, it's ok at first days, but crash after some days.

Expected behaviour and actual result.

The crash log is as followings, it seems that this is related to badger.

panic: runtime error: slice bounds out of range [4172166119:1963281920]

goroutine 25348428 [running]:
github.com/dgraph-io/badger/v2/table.(*Builder).addHelper(0xc01fc8ccc0, 0xc154c416c0, 0x1c, 0x20, 0x440, 0x0, 0xc203897e8d, 0x55, 0xe7, 0x0, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/table/builder.go:222 +0x4a1
github.com/dgraph-io/badger/v2/table.(*Builder).Add(0xc01fc8ccc0, 0xc154c416c0, 0x1c, 0x20, 0x440, 0x0, 0xc203897e8d, 0x55, 0xe7, 0x0, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/table/builder.go:339 +0xe0
github.com/dgraph-io/badger/v2.(*levelsController).compactBuildTables(0xc000322070, 0x2, 0x1c394a0, 0xc02be6c820, 0xc0005d6960, 0xc0005d69c0, 0xc1225a9c40, 0x1, 0x1, 0xc1225a9c48, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:607 +0x92a
github.com/dgraph-io/badger/v2.(*levelsController).runCompactDef(0xc000322070, 0x2, 0x1c394a0, 0xc02be6c820, 0xc0005d6960, 0xc0005d69c0, 0xc1225a9c40, 0x1, 0x1, 0xc1225a9c48, ...)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:835 +0xc6
github.com/dgraph-io/badger/v2.(*levelsController).doCompact(0xc000322070, 0x2, 0x3ff03c83ae800000, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:904 +0x4b7
github.com/dgraph-io/badger/v2.(*levelsController).runWorker(0xc000322070, 0xc0040d1800)
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:365 +0x319
created by github.com/dgraph-io/badger/v2.(*levelsController).startCompact
        /root/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200421062606-cddf7c03451c/levels.go:340 +0x88
@jarifibrahim
Copy link
Contributor

jarifibrahim commented Jun 30, 2020

@JimWen can you show me the contents of the p directory on which the crash happened? From the information in the crash it looks like the builder was trying to build a table with around 4.0G size (4172166119 == 4.0G). The usual size for tables is 70 MB so this size is definitely odd.

I want to see the size of the other SST files in this p directory.

@JimWen
Copy link
Author

JimWen commented Jun 30, 2020

@jarifibrahim I'm afraid that i can't show you the contents of the p directory because i have to clear all the alpha data and reimport data to solve this problem .

@jarifibrahim
Copy link
Contributor

@JimWen Does the crash show up when you bulk load a lot of data like 5 billion nquads? I'll try to insert 5 billion nquads in dgraph and see how it goes.

@JimWen
Copy link
Author

JimWen commented Jul 1, 2020

@JimWen Does the crash show up when you bulk load a lot of data like 5 billion nquads? I'll try to insert 5 billion nquads in dgraph and see how it goes.

No, just happen when liveload or online mutation through api.

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Jul 1, 2020

No, just happen when liveload or online mutation through api.

Got it. I'm running dgraph on 2billion rdf. Let me see if it goes through.

@jarifibrahim
Copy link
Contributor

@JimWen my 2billion test didn't crash. Do you have any suggestions or maybe specifics what you did so that I can reproduce the failure? I haven't tried using the java API because that shouldn't have any effect on this crash.

@JimWen
Copy link
Author

JimWen commented Jul 4, 2020

@JimWen my 2billion test didn't crash. Do you have any suggestions or maybe specifics what you did so that I can reproduce the failure? I haven't tried using the java API because that shouldn't have any effect on this crash.

If the schema is like
type User {
follower:uid
}
But actually the write data is [uid] which means 1-N, would this just cause data lose or may cause crash?

  1. It works on Ludicrous Mode, would this cause crash?
  2. I notice that it aways crash when i stop write for a long time and restart the cluster alpha node, then it works fine for a while and crash.
    @jarifibrahim

@JimWen
Copy link
Author

JimWen commented Jul 6, 2020

@jarifibrahim
it crash again after wrting on line for 7days using v20.03.3, the total data is about 7 billion edges. it works on Ludicrous Mode and crash log is as followings:

panic: runtime error: slice bounds out of range [:4172166156] with capacity 1963281920

goroutine 21418 [running]:
github.com/dgraph-io/badger/v2/table.(*Builder).addBlockToIndex(...)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:296
github.com/dgraph-io/badger/v2/table.(*Builder).finishBlock(0xc1b382c180)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:290 +0x64d
github.com/dgraph-io/badger/v2/table.(*Builder).Add(0xc1b382c180, 0xc18ac559c0, 0x1e, 0x20, 0x440, 0x0, 0xc13621a025, 0x154, 0x13db, 0x0, ...)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/table/builder.go:333 +0xf8
github.com/dgraph-io/badger/v2.(*levelsController).compactBuildTables(0xc0001c2070, 0x2, 0x1b81ee0, 0xc06930d040, 0xc00018e8a0, 0xc00018e900, 0xc21da10768, 0x1, 0x1, 0xc21da10770, ...)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:613 +0x90b
github.com/dgraph-io/badger/v2.(*levelsController).runCompactDef(0xc0001c2070, 0x2, 0x1b81ee0, 0xc06930d040, 0xc00018e8a0, 0xc00018e900, 0xc21da10768, 0x1, 0x1, 0xc21da10770, ...)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:862 +0xe1
github.com/dgraph-io/badger/v2.(*levelsController).doCompact(0xc0001c2070, 0x2, 0x3ff0303383000000, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:931 +0x49d
github.com/dgraph-io/badger/v2.(*levelsController).runWorker(0xc0001c2070, 0xc01e3893e0)
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:368 +0x2fd
created by github.com/dgraph-io/badger/v2.(*levelsController).startCompact
/go/pkg/mod/github.com/dgraph-io/badger/v2@v2.0.1-rc1.0.20200528205344-e7b6e76f96e8/levels.go:343 +0x88
"log/alpha1.log" 29871L, 2672378C

And i also check the p directory, there is no table near 4G size, top size file list is as followings

-rw------- 1 work work 70M Jul 4 13:51 001918.vlog
-rw------- 1 work work 70M Jul 5 16:51 002277.vlog
-rw------- 1 work work 70M Jul 4 14:41 001940.vlog
-rw------- 1 work work 70M Jul 6 12:41 002565.vlog
-rw------- 1 work work 70M Jul 4 09:00 001796.vlog
-rw------- 1 work work 70M Jul 4 16:51 001994.vlog
-rw------- 1 work work 70M Jul 4 13:40 001914.vlog
-rw------- 1 work work 70M Jul 4 09:11 001800.vlog
-rw------- 1 work work 70M Jul 6 12:51 002569.vlog
-rw------- 1 work work 70M Jul 4 15:50 001970.vlog
-rw------- 1 work work 70M Jul 4 16:20 001982.vlog
-rw------- 1 work work 70M Jul 5 16:01 002260.vlog
-rw------- 1 work work 70M Jul 4 16:30 001986.vlog
-rw------- 1 work work 70M Jul 4 16:10 001978.vlog
-rw------- 1 work work 70M Jul 5 19:01 002314.vlog
-rw------- 1 work work 70M Jul 5 15:01 002239.vlog
-rw------- 1 work work 70M Jul 5 17:01 002280.vlog
-rw------- 1 work work 70M Jul 5 11:41 002178.vlog
-rw------- 1 work work 70M Jul 5 16:31 002270.vlog
-rw------- 1 work work 70M Jul 5 16:21 002267.vlog
-rw------- 1 work work 70M Jul 5 09:41 002133.vlog
-rw------- 1 work work 70M Jul 5 21:41 002352.vlog
-rw------- 1 work work 70M Jul 4 17:01 001998.vlog
-rw------- 1 work work 70M Jul 4 17:21 002005.vlog
-rw------- 1 work work 70M Jul 3 20:00 001714.vlog
-rw------- 1 work work 70M Jul 5 21:01 002342.vlog
-rw------- 1 work work 70M Jul 3 20:51 001726.vlog
-rw------- 1 work work 71M Jul 5 14:01 002218.vlog
-rw------- 1 work work 71M Jul 4 08:50 001793.vlog
-rw------- 1 work work 71M Jul 4 12:01 001885.vlog
-rw------- 1 work work 71M Jul 5 12:41 002195.vlog
-rw------- 1 work work 71M Jul 4 17:40 002012.vlog
-rw------- 1 work work 71M Jul 3 12:21 001499.vlog
-rw------- 1 work work 71M Jul 4 12:51 001899.vlog
-rw------- 1 work work 71M Jul 3 18:51 001696.vlog
-rw------- 1 work work 71M Jul 5 12:21 002190.vlog
-rw------- 1 work work 71M Jul 3 19:20 001704.vlog
-rw------- 1 work work 71M Jul 4 12:11 001888.vlog
-rw------- 1 work work 71M Jul 2 17:31 001216.vlog
-rw------- 1 work work 71M Jul 4 18:20 002024.vlog
-rw------- 1 work work 72M Jul 4 10:50 001851.vlog
-rw------- 1 work work 146M Jul 6 11:50 036406.sst
-rw------- 1 work work 202M Jul 6 12:01 036593.sst
-rw------- 1 work work 249M Jul 6 10:01 034726.sst
-rw------- 1 work work 327M Jul 5 22:42 032620.sst
-rw------- 1 work work 422M Jul 5 11:21 029259.sst
-rw------- 1 work work 459M Jul 5 14:40 030111.sst
-rw------- 1 work work 501M Jul 6 10:01 034725.sst
-rw------- 1 work work 719M Jul 6 10:56 035595.sst
-rw------- 1 work work 1.1G Jul 6 10:12 034913.sst
-rw------- 1 work work 1.3G Jul 6 12:02 036595.sst

And the zero state is as followings

{"counter":"27638547","groups":{"1":{"members":{"1":{"id":"1","groupId":1,"addr":"172.21.46.35:7080","leader":true,"lastUpdate":"1593611550"}},"tablets":{"chl":{"groupId":1,"predicate":"chl","space":"4695775026"},"dgraph.graphql.schema":{"groupId":1,"predicate":"dgraph.graphql.schema"},"dgraph.graphql.xid":{"groupId":1,"predicate":"dgraph.graphql.xid"},"dgraph.type":{"groupId":1,"predicate":"dgraph.type","space":"41421216460"},"extmsg":{"groupId":1,"predicate":"extmsg"},"from":{"groupId":1,"predicate":"from","space":"223127171705"},"net":{"groupId":1,"predicate":"net","space":"4220898548"},"title":{"groupId":1,"predicate":"title","space":"12325089358"},"to":{"groupId":1,"predicate":"to","space":"13864305792"},"withExpect":{"groupId":1,"predicate":"withExpect","space":"10265741211"}},"snapshotTs":"54649802","checksum":"3759641937265695043"},"2":{"members":{"2":{"id":"2","groupId":2,"addr":"172.21.46.43:7080","leader":true,"lastUpdate":"1593611555"}},"tablets":{"aid":{"groupId":2,"predicate":"aid","space":"12185695440"},"gid_gps":{"groupId":2,"predicate":"gid_gps","space":"2246383293"},"gid_ip":{"groupId":2,"predicate":"gid_ip","space":"4275951255"},"gid_user":{"groupId":2,"predicate":"gid_user","space":"10186607933"},"src":{"groupId":2,"predicate":"src","space":"9066055157"}},"snapshotTs":"54848420","checksum":"16815000284943703647"},"3":{"members":{"3":{"id":"3","groupId":3,"addr":"172.21.46.44:7080","leader":true,"lastUpdate":"1593611557"}},"tablets":{"gid_device":{"groupId":3,"predicate":"gid_device","space":"5483695993"},"gid_did":{"groupId":3,"predicate":"gid_did","space":"304280668"},"gid_idfa":{"groupId":3,"predicate":"gid_idfa","space":"1199109308"},"gid_wifi":{"groupId":3,"predicate":"gid_wifi","space":"1457055681"},"loc":{"groupId":3,"predicate":"loc","space":"16879986572"},"status":{"groupId":3,"predicate":"status","space":"354027403"},"t":{"groupId":3,"predicate":"t","space":"10332890221"}},"snapshotTs":"55094804","checksum":"3247556968549744124"},"4":{"members":{"4":{"id":"4","groupId":4,"addr":"172.21.46.59:7080","leader":true,"lastUpdate":"1593611559"}},"tablets":{"gid_brand":{"groupId":4,"predicate":"gid_brand","moveTs":"39"},"gid_job":{"groupId":4,"predicate":"gid_job","space":"3165219418"},"gid_mac":{"groupId":4,"predicate":"gid_mac","space":"1763035672"},"msg":{"groupId":4,"predicate":"msg","space":"176492074"},"pl":{"groupId":4,"predicate":"pl","space":"9285853378"},"withDID":{"groupId":4,"predicate":"withDID","space":"114209649"},"withDeviceId":{"groupId":4,"predicate":"withDeviceId","space":"14719347936"},"withIdfa":{"groupId":4,"predicate":"withIdfa","moveTs":"24"},"withIp":{"groupId":4,"predicate":"withIp","space":"14052854988"},"withMac":{"groupId":4,"predicate":"withMac","space":"3683865987"}},"snapshotTs":"54637590","checksum":"876440663310665151"},"5":{"members":{"5":{"id":"5","groupId":5,"addr":"172.21.46.65:7080","leader":true,"lastUpdate":"1593611561"}},"tablets":{"bg":{"groupId":5,"predicate":"bg","space":"10036594964"},"gid_company":{"groupId":5,"predicate":"gid_company","moveTs":"40"},"gid_expect":{"groupId":5,"predicate":"gid_expect","space":"5837675755"},"gid_interview":{"groupId":5,"predicate":"gid_interview","space":"174559486"},"withBrand":{"groupId":5,"predicate":"withBrand","moveTs":"28"},"withCompany":{"groupId":5,"predicate":"withCompany","moveTs":"29"},"withGps":{"groupId":5,"predicate":"withGps","space":"7782805485"},"withInterview":{"groupId":5,"predicate":"withInterview","space":"221288458"},"withJob":{"groupId":5,"predicate":"withJob","space":"10772862262"},"withWifi":{"groupId":5,"predicate":"withWifi","space":"4211920663"}},"snapshotTs":"54489876","checksum":"11427293922357654662"}},"zeros":{"1":{"id":"1","addr":"172.21.46.35:5080","leader":true}},"maxLeaseId":"15000000000000020000","maxTxnTs":"55300000","maxRaftId":"5","cid":"dca06408-70ba-4f46-ae60-7df73f943dd9","license":{"maxNodes":"18446744073709551615","expiryTs":"1596203542","enabled":true}}

And here is my exec param

if [ $# -lt 2 ]; then
echo 'too less input, at least 2'
exit 1
fi

type=$1
offset=$2

echo "type:$type, offset:$offset"

path=./log
if [ ! -d $path ]; then
mkdir $path
fi

replica=1
zero_1=172.21.46.35
dgpath=/data/app/Dgraph/bin/dgraph
uipath=/data/app/Dgraph/bin/dgraph-ratel
cfpath=/data/app/Dgraph/config/config.json

local_host="hostname --fqdn"
local_ip=/sbin/ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v inet6|awk '{print $2}'|tr -d "addr:"

echo "ip:$local_ip"

case $type in
zero1)
index=expr $offset + 1
port=expr $offset + 5080
nohup $dgpath zero --enable_sentry=false --ludicrous_mode --my=$local_ip:$port -o $offset --idx $index --replicas $replica --config $cfpath --cwd zero$index --rebalance_interval 12000h > ./log/zero$index.log 2>&1 &
;;
zero)
index=expr $offset + 1
port=expr $offset + 5080
nohup $dgpath zero --enable_sentry=false --ludicrous_mode --my=$local_ip:$port -o $offset --idx $index --replicas $replica --config $cfpath --cwd zero$index --peer $zero_1:5080 > ./log/zero$index.log 2>&1 &
;;
alpha)
index=expr $offset + 1
port=expr $offset + 7080
nohup $dgpath alpha --enable_sentry=false --ludicrous_mode --snapshot_after=1000000 --abort_older_than 15m --my=$local_ip:$port -o $offset --zero=$zero_1:5080 --config $cfpath --cwd alpha$index > ./log/alpha$index.log 2>&1 &
;;
ratel)
index=expr $offset + 1
port=expr $offset + 8000
nohup $uipath -addr "http://$local_ip:$port" -port $port > ./log/web$index.log 2>&1 &
;;
*)
echo 'wrong type input'
;;
esac

@jarifibrahim
Copy link
Contributor

@JimWen can you share alpha logs as well?

@JimWen
Copy link
Author

JimWen commented Jul 6, 2020

@JimWen can you share alpha logs as well?

Sorry to that i have clear all the data to run it not on Ludicrous Mode . I notice nothing unnormal except the crash log in that file.

@jarifibrahim
Copy link
Contributor

@JimWen Can you save the big SST files and the logs next time it crashes? I'd like to look at the SST files which are over 500 MB. These files should be 65 MB by default. If the size goes anywhere around 4 GB, badger will crash because of uint32 overflow.

@JimWen
Copy link
Author

JimWen commented Jul 7, 2020

@jarifibrahim ok, i will save it next time.

@jarifibrahim
Copy link
Contributor

@JimWen would you be able to share your rdfs? It might be possible that this crash is because of some specific input. I don't know how plausible is that but I'd like to explore that option as well.
You can send it to Ibrahim[at]dgraph.io

@jarifibrahim jarifibrahim added the kind/bug Something is broken. label Jul 8, 2020
@JimWen
Copy link
Author

JimWen commented Jul 10, 2020

@JimWen would you be able to share your rdfs? It might be possible that this crash is because of some specific input. I don't know how plausible is that but I'd like to explore that option as well.
You can send it to Ibrahim[at]dgraph.io

@jarifibrahim Sorry, i‘m afraid that i can't share these business data.

@lgalatin
Copy link
Contributor

hi @JimWen We would like to setup a call with you to see if we can further troubleshoot this problem. Can I contact you on your email to set up a session with @jarifibrahim?

@JimWen
Copy link
Author

JimWen commented Jul 15, 2020

@jarifibrahim do we have any tools to help inspect keys in sst file? It also crash when running not on Ludicrous Mode, and is it normal if sst file size is bigger than 65 MB?

@jarifibrahim
Copy link
Contributor

@JimWen dgraph debug tool can show the keys and its metadata. It's not normal to have an sst file larger than 65 MB. The SST are bigger than 65MB only if you have too many duplicate keys. To get a ~3 GB of sst file, you'll need too many duplicate keys (which doesn't happen under normal usage).

Badger uses uint32 at multiple places and a values bigger than 4 GB will cause the uin32 to overflow and cause a crash.

@minhaj-shakeel
Copy link
Contributor

Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.

drawing

@hypermodeinc hypermodeinc deleted a comment from github-actions bot Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken. status/accepted We accept to investigate/work on it.
Development

No branches or pull requests

4 participants