Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: IndexNode coredump when DeleteIndex and then rebuild it ( cannot be reproduced stably) #25297

Closed
1 task done
yangshuai0711 opened this issue Jul 3, 2023 · 10 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@yangshuai0711
Copy link

yangshuai0711 commented Jul 3, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.2.11. Other versions may also exist, but it is uncertain.
- Deployment mode(cluster):
- MQ type(aws ):  externally provided S3
- SDK version(e.g. pymilvus v2.0.0rc2): attu 2.2.6
- OS(Ubuntu or CentOS): centos7.6
- CPU/Memory: 
- GPU:

Current Behavior

new collection ,new data rows 20k from attu 2.2.6。
or drop an vec index,and then rebuild it.nothing happen where drop index.
and IndexNode CoreDump when rebuild one.

Expected Behavior

build index stably

Steps To Reproduce

compile milvus:
./scripts/install.deps.sh
make 

vim config/milvus.yaml
 `
etcd:
  endpoints:
    - 10.21.245.117:2379
...
minio:
    address: mys3.vip.com
    port:443
    useSSL: true
...
pulsar:
  address: 10.21.169.231 

`

./scripts/start_cluster.sh

Attu operation, create a table, add 10k data, add another 10k data (total 20k), and create an IVF-FLAT index with nlist 1024.

Milvus Log

2023-07-03 20:12:13,180 INFO [default] [KNOWHERE][SetBlasThreshold][milvus] Set faiss::distance_compute_blas_threshold to 16384
2023-07-03 20:12:13,181 INFO [default] [KNOWHERE][SetEarlyStopThreshold][milvus] Set faiss::early_stop_threshold to 0
2023-07-03 20:12:13,181 INFO [default] [KNOWHERE][SetStatisticsLevel][milvus] Set knowhere::STATISTICS_LEVEL to 0
2023-07-03 20:12:13,181 | DEBUG | default | [SERVER][operator()][milvus] Config easylogging with yaml file: /opt/apps/milvus/configs/easylogging.yaml
2023-07-03 20:12:13,181 | INFO | default | [KNOWHERE][SetSimdType][milvus] FAISS expect simdType::AUTO
2023-07-03 20:12:13,181 | INFO | default | [KNOWHERE][SetSimdType][milvus] FAISS hook AVX2
2023-07-03 20:12:13,181 | DEBUG | default | [SEGCORE][SetIndexSliceSize][milvus] set config index slice size(byte): 16777216
2023-07-03 20:12:13,181 | DEBUG | default | [SEGCORE][SetThreadCoreCoefficient][milvus] set thread pool core coefficient: 10
2023-07-03 20:12:17,279 | INFO | default | [SEGCORE][N6milvus7storage17MinioChunkManagerE::MinioChunkManager][milvus] init MinioChunkManager with parameter[endpoint: 's3plus-bj02.vip.sankuai.com:443', default_bucket_name:'milvus-prod', use_secure:'true']
2023-07-03 20:12:17,279 | WARNING | default | [KNOWHERE][GetGlobalThreadPool][milvus] Global ThreadPool has not been inialized yet, init it now with threads num: 8
2023-07-03 20:12:17,280 | INFO | default | [SEGCORE][N6milvus10ThreadPoolE::ThreadPool][milvus] Thread pool's worker num:80
2023-07-03 20:12:17,647 | WARNING | default | [KNOWHERE][MatchNlist][milvus] Row num 10000 match nlist 256
Fatal error condition occurred in /opt/data/milvus_compile/milvus-2.2.11/cmake_build/3rdparty_download/aws-sdk-subbuild/src/aws_sdk_s3_ep/crt/aws-crt-cpp/crt/aws-c-io/source/event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application
################################################################################
Stack trace:
################################################################################
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(aws_backtrace_print+0x46) [0x7f05618a4706]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(aws_fatal_assert+0x43) [0x7f056189c003]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(+0x1b3d07) [0x7f05617b0d07]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(aws_ref_count_release+0x1d) [0x7f05618a51ed]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(+0x1b1958) [0x7f05617ae958]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(aws_ref_count_release+0x1d) [0x7f05618a51ed]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(_ZN3Aws3Crt2Io15ClientBootstrapD2Ev+0x26) [0x7f05617640a6]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(_ZN3Aws25SetDefaultClientBootstrapERKSt10shared_ptrINS_3Crt2Io15ClientBootstrapEE+0xc2) [0x7f05616bb7d2]
/opt/apps/milvus/lib/libaws-cpp-sdk-core.so(_ZN3Aws10CleanupCrtEv+0x22) [0x7f05616bb932]
/opt/apps/milvus/lib/libmilvus_storage.so(_ZN6milvus7storage17MinioChunkManager14ShutdownSDKAPIEv+0x3e) [0x7f056435261e]
/opt/apps/milvus/lib/libmilvus_storage.so(_ZN6milvus7storage17MinioChunkManagerD1Ev+0x1e) [0x7f056435269e]
/opt/apps/milvus/lib/libmilvus_storage.so(_ZN6milvus7storage17MinioChunkManagerD0Ev+0x9) [0x7f0564352939]
bin/milvus(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x56) [0x33aaa96]
/opt/apps/milvus/lib/libmilvus_storage.so(_ZNSt23_Sp_counted_ptr_inplaceIN6milvus7storage18MemFileManagerImplESaIS2_ELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x86) [0x7f056431a8b6]
/opt/apps/milvus/lib/libmilvus_index.so(_ZN6milvus5index16VectorMemNMIndexD0Ev+0x18a) [0x7f055f9b3eba]
/opt/apps/milvus/lib/libmilvus_indexbuilder.so(_ZN6milvus12indexbuilder15VecIndexCreatorD0Ev+0x2f) [0x7f05638ccdff]
/opt/apps/milvus/lib/libmilvus_indexbuilder.so(DeleteIndex+0x13) [0x7f05638cdc53]
bin/milvus(_cgo_182415b04a2d_Cfunc_DeleteIndex+0x1d) [0x32f582d]
bin/milvus(runtime.asmcgocall.abi0+0x64) [0x1516c04]
SIGABRT: abort
PC=0x7f0561e2c387 m=3 sigcode=18446744073709551610
signal arrived during cgo execution
goroutine 500 [syscall]:
runtime.cgocall(0x32f5810, 0xc000639828)
/usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc000639800 sp=0xc0006397c8 pc=0x14a98fc
github.com/milvus-io/milvus/internal/util/indexcgowrapper._Cfunc_DeleteIndex(0x7f052f306eb0)
_cgo_gotypes.go:344 +0x51 fp=0xc000639828 sp=0xc000639800 pc=0x2c50271
github.com/milvus-io/milvus/internal/util/indexcgowrapper.(*CgoIndex).Delete.func1(0x1d33a97?)
/opt/data/milvus_compile/milvus-2.2.11/internal/util/indexcgowrapper/index.go:320 +0x3a fp=0xc000639860 sp=0xc000639828 pc=0x2c55eba
github.com/milvus-io/milvus/internal/util/indexcgowrapper.(*CgoIndex).Delete(0xc00056a460)
/opt/data/milvus_compile/milvus-2.2.11/internal/util/indexcgowrapper/index.go:320 +0x32 fp=0xc000639898 sp=0xc000639860 pc=0x2c55e32
github.com/milvus-io/milvus/internal/indexnode.(*indexBuildTask).SaveIndexFiles(0xc000f9ad80, {0x46307e8, 0xc001273a40})
/opt/data/milvus_compile/milvus-2.2.11/internal/indexnode/task.go:361 +0x1c8 fp=0xc000639d10 sp=0xc000639898 pc=0x2c628a8
github.com/milvus-io/milvus/internal/indexnode.task.SaveIndexFiles-fm({0x46307e8?, 0xc001273a40?})
:1 +0x3e fp=0xc000639d38 sp=0xc000639d10 pc=0x2c6b0de
github.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask.func1(0xc0020b3e08)
/opt/data/milvus_compile/milvus-2.2.11/internal/indexnode/task_scheduler.go:207 +0x82 fp=0xc000639d68 sp=0xc000639d38 pc=0x2c65e82
github.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).processTask(0xc000e9f000, {0x4645ff8, 0xc000f9ad80}, {0x229fba0?, 0xc001f927e0?})
/opt/data/milvus_compile/milvus-2.2.11/internal/indexnode/task_scheduler.go:220 +0x3c9 fp=0xc000639f60 sp=0xc000639d68 pc=0x2c657e9
github.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func1(0xc001f92840?, {0x4645ff8?, 0xc000f9ad80?})
/opt/data/milvus_compile/milvus-2.2.11/internal/indexnode/task_scheduler.go:253 +0x6c fp=0xc000639fb8 sp=0xc000639f60 pc=0x2c6626c
github.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop.func3()
/opt/data/milvus_compile/milvus-2.2.11/internal/indexnode/task_scheduler.go:254 +0x32 fp=0xc000639fe0 sp=0xc000639fb8 pc=0x2c661d2
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000639fe8 sp=0xc000639fe0 pc=0x1516f41
created by github.com/milvus-io/milvus/internal/indexnode.(*TaskScheduler).indexBuildLoop
/opt/data/milvus_compile/milvus-2.2.11/internal/indexnode/task_scheduler.go:251 +0x186
goroutine 1 [chan receive]:
runtime.gopark(0xc0006db6b0?, 0xc0006db708?, 0xb3?, 0xdd?, 0xc0006db708?)
/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0013816d8 sp=0xc0013816b8 pc=0x14e2216
runtime.chanrecv(0xc0002a4960, 0xc0006dbc00, 0x1)
/usr/local/go/src/runtime/chan.go:583 +0x49d fp=0xc001381768 sp=0xc0013816d8 pc=0x14ac73d
runtime.chanrecv1(0xc0002a4960?, 0xc0006dbda8?)
/usr/local/go/src/runtime/chan.go:442 +0x18 fp=0xc001381790 sp=0xc001381768 pc=0x14ac238
github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run(0xc0007a5e48, 0x0, {0x0, 0x0})
/opt/data/milvus_compile/milvus-2.2.11/cmd/roles/roles.go:346 +0xaf1 fp=0xc001381df8 sp=0xc001381790 pc=0x32e3e71
github.com/milvus-io/milvus/cmd/milvus.(*run).execute(0xc000e3a000, {0xc0000520a0?, 0x5, 0x5}, 0xc000528240)
/opt/data/milvus_compile/milvus-2.2.11/cmd/milvus/run.go:117 +0x68e fp=0xc001381ee0 sp=0xc001381df8 pc=0x32f014e
github.com/milvus-io/milvus/cmd/milvus.RunMilvus({0xc0000520a0?, 0x5, 0x5})
/opt/data/milvus_compile/milvus-2.2.11/cmd/milvus/milvus.go:60 +0x21e fp=0xc001381f58 sp=0xc001381ee0 pc=0x32ef9be
main.main()
/opt/data/milvus_compile/milvus-2.2.11/cmd/main.go:26 +0x2e fp=0xc001381f80 sp=0xc001381f58 pc=0x32f302e
runtime.main()
/usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc001381fe0 sp=0xc001381f80 pc=0x14e1de7
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc001381fe8 sp=0xc001381fe0 pc=0x1516f41

Anything else?

No response

@yangshuai0711 yangshuai0711 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 3, 2023
@xiaofan-luan
Copy link
Collaborator

Fatal error condition occurred in /opt/data/milvus_compile/milvus-2.2.11/cmake_build/3rdparty_download/aws-sdk-subbuild/src/aws_sdk_s3_ep/crt/aws-crt-cpp/crt/aws-c-io/source/event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS

how large is the indexnode? like how many cores and memories?

seems like a similar issue
Fatal error condition occurred in /opt/data/milvus_compile/milvus-2.2.11/cmake_build/3rdparty_download/aws-sdk-subbuild/src/aws_sdk_s3_ep/crt/aws-crt-cpp/crt/aws-c-io/source/event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS

@xiaofan-luan
Copy link
Collaborator

according to other issue, this is windows only.
huggingface/datasets#3310
@yangshuai0711 could you confirm you are running on a windows platform?

@xiaofan-luan
Copy link
Collaborator

@xige-16 pls help to revert the aws sdk to 1.8.186

aws/aws-sdk-cpp#1809

@xiaofan-luan xiaofan-luan added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 3, 2023
@xiaofan-luan xiaofan-luan added this to the 2.2.12 milestone Jul 3, 2023
@xiaofan-luan xiaofan-luan assigned xige-16 and unassigned yanliang567 Jul 3, 2023
@xiaofan-luan
Copy link
Collaborator

related with #25264 (comment)

@yangshuai0711
Copy link
Author

yangshuai0711 commented Jul 3, 2023

according to other issue, this is windows only. huggingface/datasets#3310 @yangshuai0711 could you confirm you are running on a windows platform?

im sure. centos7.6. indexNode has 4 8c16G servers

@yangshuai0711
Copy link
Author

@xiaofan-luan Would it be a good idea to consolidate the functionality of network calls into Go for uniformity? I have so many stories between AWS and me. 😂

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 4, 2023
@xiaofan-luan
Copy link
Collaborator

@xiaofan-luan Would it be a good idea to consolidate the functionality of network calls into Go for uniformity? I have so many stories between AWS and me. 😂

we use to utilize go for S3 access, but this introduce actual copy thus we moved to cpp sdk

@xige-16
Copy link
Contributor

xige-16 commented Jul 6, 2023

aws-c-io fix pr awslabs/aws-c-io#515

@xige-16
Copy link
Contributor

xige-16 commented Aug 9, 2023

Please use milvus 2.2.12 or later to avoid encountering this issue.

@yanliang567
Copy link
Contributor

not reproduce recently, close for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants