Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Likely regression in GCS memory usage from 2.2 to 2.3 releases #32090

Closed
cadedaniel opened this issue Jan 31, 2023 · 11 comments · Fixed by #32302
Closed

[Core] Likely regression in GCS memory usage from 2.2 to 2.3 releases #32090

cadedaniel opened this issue Jan 31, 2023 · 11 comments · Fixed by #32302
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@cadedaniel
Copy link
Member

@cadedaniel cadedaniel added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core labels Jan 31, 2023
@scv119
Copy link
Contributor

scv119 commented Feb 1, 2023

hmm cc @rkooo567 @rickyyx likely due to the new task backend?

@scv119 scv119 assigned rickyyx and unassigned fishbone Feb 1, 2023
@rickyyx
Copy link
Contributor

rickyyx commented Feb 1, 2023

hmmm, we cap at 100k tasks. And the amount of data stored is only on MiB, while there are auxiliary data structures, but should be proportional to the number of tasks stored. (This is on the many_task)

image

But let me rerun those tests with test backend entirely disabled.

@rickyyx
Copy link
Contributor

rickyyx commented Feb 1, 2023

So I disabled task events (no task events will be reported). The memory usage is still high, and the interesting thing is it has pretty high memory usage when it just begins (look at the very first memory monitor printing):

image

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 6, 2023

Looks like we need to bisect.

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 6, 2023

Maybe some regression in GCS memory usage

@fishbone
Copy link
Contributor

fishbone commented Feb 6, 2023

Hi @rkooo567 feel free to bisect it if you like

@fishbone
Copy link
Contributor

fishbone commented Feb 7, 2023

File: gcs_server
Type: inuse_space
Showing nodes accounting for 990.37MB, 99.04% of 999.99MB total
Dropped 1174 nodes (cum <= 5MB)
      flat  flat%   sum%        cum   cum%
  683.41MB 68.34% 68.34%   971.24MB 97.12%  ray::rpc::ServerCallFactoryImpl::CreateCall
  133.72MB 13.37% 81.71%   200.49MB 20.05%  grpc::ServerInterface::RequestAsyncCall
   87.33MB  8.73% 90.45%    87.33MB  8.73%  google::protobuf::internal::AllocateMemory
   62.77MB  6.28% 96.72%    66.77MB  6.68%  grpc_core::Server::RequestRegisteredCall
   23.15MB  2.31% 99.04%    23.15MB  2.31%  gpr_malloc
         0     0% 99.04%    21.08MB  2.11%  (anonymous namespace)::ArenaStorage
         0     0% 99.04%    26.73MB  2.67%  clone
         0     0% 99.04%    27.14MB  2.71%  cq_next
         0     0% 99.04%    27.14MB  2.71%  end_worker
         0     0% 99.04%     5.96MB   0.6%  exec_ctx_run
         0     0% 99.04%    26.51MB  2.65%  execute_native_thread_routine
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::AllocateAlignedWithHook
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::AllocateInternal
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMaybeMessage
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMessage (inline)
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMessageInternal
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::DoCreateMessage
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::AllocateAligned
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::AllocateAlignedFallback
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::GetSerialArenaFallback
         0     0% 99.04%    21.08MB  2.11%  gpr_malloc_aligned
         0     0% 99.04%    27.14MB  2.71%  grpc::CompletionQueue::AsyncNext
         0     0% 99.04%    27.14MB  2.71%  grpc::CompletionQueue::AsyncNextInternal
         0     0% 99.04%    66.77MB  6.68%  grpc::ServerInterface::PayloadAsyncRequest::PayloadAsyncRequest
         0     0% 99.04%    66.77MB  6.68%  grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest
         0     0% 99.04%   200.49MB 20.05%  grpc::Service::RequestAsyncUnary
         0     0% 99.04%    21.09MB  2.11%  grpc_call_create
         0     0% 99.04%    21.09MB  2.11%  grpc_chttp2_parsing_accept_stream
         0     0% 99.04%    21.50MB  2.15%  grpc_chttp2_perform_read
         0     0% 99.04%    21.51MB  2.15%  grpc_combiner_continue_exec_ctx

@rkooo567 rkooo567 added the bug Something that is supposed to be working; but isn't label Feb 7, 2023
@fishbone fishbone linked a pull request Feb 8, 2023 that will close this issue
7 tasks
@zhe-thoughts
Copy link
Collaborator

Fixing this makes me so happy :) Thanks @iycheng and folks!

@rickyyx
Copy link
Contributor

rickyyx commented Feb 8, 2023

File: gcs_server
Type: inuse_space
Showing nodes accounting for 990.37MB, 99.04% of 999.99MB total
Dropped 1174 nodes (cum <= 5MB)
      flat  flat%   sum%        cum   cum%
  683.41MB 68.34% 68.34%   971.24MB 97.12%  ray::rpc::ServerCallFactoryImpl::CreateCall
  133.72MB 13.37% 81.71%   200.49MB 20.05%  grpc::ServerInterface::RequestAsyncCall
   87.33MB  8.73% 90.45%    87.33MB  8.73%  google::protobuf::internal::AllocateMemory
   62.77MB  6.28% 96.72%    66.77MB  6.68%  grpc_core::Server::RequestRegisteredCall
   23.15MB  2.31% 99.04%    23.15MB  2.31%  gpr_malloc
         0     0% 99.04%    21.08MB  2.11%  (anonymous namespace)::ArenaStorage
         0     0% 99.04%    26.73MB  2.67%  clone
         0     0% 99.04%    27.14MB  2.71%  cq_next
         0     0% 99.04%    27.14MB  2.71%  end_worker
         0     0% 99.04%     5.96MB   0.6%  exec_ctx_run
         0     0% 99.04%    26.51MB  2.65%  execute_native_thread_routine
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::AllocateAlignedWithHook
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::AllocateInternal
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMaybeMessage
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMessage (inline)
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMessageInternal
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::DoCreateMessage
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::AllocateAligned
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::AllocateAlignedFallback
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::GetSerialArenaFallback
         0     0% 99.04%    21.08MB  2.11%  gpr_malloc_aligned
         0     0% 99.04%    27.14MB  2.71%  grpc::CompletionQueue::AsyncNext
         0     0% 99.04%    27.14MB  2.71%  grpc::CompletionQueue::AsyncNextInternal
         0     0% 99.04%    66.77MB  6.68%  grpc::ServerInterface::PayloadAsyncRequest::PayloadAsyncRequest
         0     0% 99.04%    66.77MB  6.68%  grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest
         0     0% 99.04%   200.49MB 20.05%  grpc::Service::RequestAsyncUnary
         0     0% 99.04%    21.09MB  2.11%  grpc_call_create
         0     0% 99.04%    21.09MB  2.11%  grpc_chttp2_parsing_accept_stream
         0     0% 99.04%    21.50MB  2.15%  grpc_chttp2_perform_read
         0     0% 99.04%    21.51MB  2.15%  grpc_combiner_continue_exec_ctx

Which tool generates this? pprof?

@cadedaniel
Copy link
Member Author

I am reopening this until #32323 is cherry-picked onto 2.3.0

@cadedaniel cadedaniel reopened this Feb 8, 2023
@scv119 scv119 closed this as completed Feb 8, 2023
@fishbone
Copy link
Contributor

fishbone commented Feb 8, 2023

File: gcs_server
Type: inuse_space
Showing nodes accounting for 990.37MB, 99.04% of 999.99MB total
Dropped 1174 nodes (cum <= 5MB)
      flat  flat%   sum%        cum   cum%
  683.41MB 68.34% 68.34%   971.24MB 97.12%  ray::rpc::ServerCallFactoryImpl::CreateCall
  133.72MB 13.37% 81.71%   200.49MB 20.05%  grpc::ServerInterface::RequestAsyncCall
   87.33MB  8.73% 90.45%    87.33MB  8.73%  google::protobuf::internal::AllocateMemory
   62.77MB  6.28% 96.72%    66.77MB  6.68%  grpc_core::Server::RequestRegisteredCall
   23.15MB  2.31% 99.04%    23.15MB  2.31%  gpr_malloc
         0     0% 99.04%    21.08MB  2.11%  (anonymous namespace)::ArenaStorage
         0     0% 99.04%    26.73MB  2.67%  clone
         0     0% 99.04%    27.14MB  2.71%  cq_next
         0     0% 99.04%    27.14MB  2.71%  end_worker
         0     0% 99.04%     5.96MB   0.6%  exec_ctx_run
         0     0% 99.04%    26.51MB  2.65%  execute_native_thread_routine
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::AllocateAlignedWithHook
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::AllocateInternal
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMaybeMessage
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMessage (inline)
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::CreateMessageInternal
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::Arena::DoCreateMessage
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::AllocateAligned
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::AllocateAlignedFallback
         0     0% 99.04%    87.33MB  8.73%  google::protobuf::internal::ThreadSafeArena::GetSerialArenaFallback
         0     0% 99.04%    21.08MB  2.11%  gpr_malloc_aligned
         0     0% 99.04%    27.14MB  2.71%  grpc::CompletionQueue::AsyncNext
         0     0% 99.04%    27.14MB  2.71%  grpc::CompletionQueue::AsyncNextInternal
         0     0% 99.04%    66.77MB  6.68%  grpc::ServerInterface::PayloadAsyncRequest::PayloadAsyncRequest
         0     0% 99.04%    66.77MB  6.68%  grpc::ServerInterface::RegisteredAsyncRequest::IssueRequest
         0     0% 99.04%   200.49MB 20.05%  grpc::Service::RequestAsyncUnary
         0     0% 99.04%    21.09MB  2.11%  grpc_call_create
         0     0% 99.04%    21.09MB  2.11%  grpc_chttp2_parsing_accept_stream
         0     0% 99.04%    21.50MB  2.15%  grpc_chttp2_perform_read
         0     0% 99.04%    21.51MB  2.15%  grpc_combiner_continue_exec_ctx

Which tool generates this? pprof?

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants