Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak on reloading model #72

Open
jriegner opened this issue Oct 7, 2022 · 8 comments
Open

Potential memory leak on reloading model #72

jriegner opened this issue Oct 7, 2022 · 8 comments

Comments

@jriegner
Copy link

jriegner commented Oct 7, 2022

Hey guys,

We use tfgo and notice an increase of memory usage each time our model gets reloaded. We have a running service which periodically checks whether the model got updated and reloads it. Now I wouldn't expect the memory usage to increase, since the model in memory should be replaced by the updated one.

The code to load the model is

// load model into memory
	model := tg.LoadModel(
		"path/to/our/model",
		[]string{
			"serve",
		},
		&tf.SessionOptions{},
	)

But our monitoring shows that the usage goes up every time the model gets reloaded (once per hour). I profiled the service with pprof and could not see that any of the internal components in our code has a significantly growing memory usage.

Furthermore I built tensorflow 2.9.1 with debug symbols and wrote a small go app just reloading the model. I did this to check for memory leaks with memleak-bpfcc from https://github.com/iovisor/bcc. This gave me the following stack trace, which, I believe, shows that there is memory leaked

	1770048 bytes in 9219 allocations from stack
		operator new(unsigned long)+0x19 [libstdc++.so.6.0.28]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::New(google::protobuf::Arena*)+0x1c [libtensorflow_framework.so.2]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::NewFromPrototype(tensorflow::NodeDef const*, google::protobuf::Arena*)+0x20 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type* google::protobuf::internal::RepeatedPtrFieldBase::Add<google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler>(google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type*)+0xc2 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::Add()+0x21 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::add_node_def()+0x20 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x334 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDef>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::FunctionDefLibrary::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x240 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDefLibrary>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDefLibrary*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::GraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x291 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::GraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::GraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::MetaGraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x325 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::MetaGraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::MetaGraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::SavedModel::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x25b [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::MergeFromCodedStream(google::protobuf::io::CodedInputStream*)+0x32 [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::ParseFromCodedStream(google::protobuf::io::CodedInputStream*)+0x3e [libtensorflow_framework.so.2]
		tensorflow::ReadBinaryProto(tensorflow::Env*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::MessageLite*)+0x141 [libtensorflow_framework.so.2]
		tensorflow::(anonymous namespace)::ReadSavedModel(absl::lts_20211102::string_view, tensorflow::SavedModel*)+0x136 [libtensorflow_framework.so.2]
		tensorflow::ReadMetaGraphDefFromSavedModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::MetaGraphDef*)+0x5d [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModelInternal(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0x41 [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModel(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0xc0 [libtensorflow_framework.so.2]
		TF_LoadSessionFromSavedModel+0x2a8 [libtensorflow.so]
		_cgo_6ae2e7a71f9a_Cfunc_TF_LoadSessionFromSavedModel+0x6e [testapp]
		runtime.asmcgocall.abi0+0x64 [testapp]
		github.com/galeone/tensorflow/tensorflow/go._Cfunc_TF_LoadSessionFromSavedModel.abi0+0x4d [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel.func2+0x14f [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel+0x2b6 [testapp]
		github.com/galeone/tfgo.LoadModel+0x6d [testapp]
		main.reloadModel+0x276 [testapp]
		main.main+0x72 [testapp]
		runtime.main+0x212 [testapp]
		runtime.goexit.abi0+0x1 [testapp]


As you can see this stacktrace shows calls to tfgo and to the underlying tensorflow library. I am not sure if I read it right, but it seems like there is a leak in tfgo or tensorflow itself.

Is there a way to explicitly release the memory of a loaded model when we reload? Could it be a problem in tfgo?
If you need more information on this, please tell me.

Thanks in advance :)

@galeone
Copy link
Owner

galeone commented Oct 7, 2022

Hi @jriegner , thank you for this report!

You're right, there's a leak. And I guess I found where the leak is 😄

LoadModel returns a *tfgo.Model that contains a tensorflow a *tf.SavedModel. This tf.SavedModel has a *tf.Session field.

When the model is deleted, the session is not automatically closed (it's the old tf.Session of Python - if you are familiar with TensorFlow < 2), and thus I guess the leak is in the missing invocation of session.Close().

I guess we have 2 options:

  1. Expose the session and let the user invoke the Close() mehtod (I don't like it)
  2. Create a finalizer that automatically invokes the session.Close() when the tfgo.Model is garbage collected.

I kinda like more the second approach.

Right now I'm travelling so I don't know when I'll be able to work on it (I'm setting up the development environment right now while I'm on a train lol) - if you want to implement the second option, I'll be more then happy to review and merge it

@galeone
Copy link
Owner

galeone commented Oct 7, 2022

Update: I guess I made it b536202

Give it a try and let me know if it works

@jriegner
Copy link
Author

jriegner commented Oct 10, 2022

Hey, thanks for your fast reply!

I ran my local test app with the updated code and memleak-bpfcc still complained about leaking memory. I verified that the finalizer ran and also explicitly closed the session on every reload. Same result.

I will check with our production setup too and will come back then.

Edit:
I monitored our service with the latest version of tfgo for a while and can see that the memory increases on each reload.

@galeone
Copy link
Owner

galeone commented Oct 16, 2022

If it's not the leak related to the missing session close, I have no other clue 🤔

For sure there was a leak caused by the missing session.close(), which now should be fixed (I forgot to add the same line in the tg.ImportModel and I fixed it yesterday).

But I have no idea of what could cause this issue on tfgo side - maybe it's a leak in the TensorFlow C library

@jriegner
Copy link
Author

Alright, I will update the version on my side and give it a try. Thanks for your support, maybe the last update fixed it 👍🏻

@LoveVsLike
Copy link

Update: I guess I made it b536202

Give it a try and let me know if it works

I try it, but is not work!

@galeone
Copy link
Owner

galeone commented Dec 20, 2023

Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.

Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.

Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).

I'll update the fork and I'll let you know.

@LoveVsLike
Copy link

Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.

Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.

Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).

I'll update the fork and I'll let you know.

Hi, I update TF to last version and tfgo, But is not work, Can you put a issue to tf?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants