Potential memory leak on reloading model #72

jriegner · 2022-10-07T08:56:01Z

Hey guys,

We use tfgo and notice an increase of memory usage each time our model gets reloaded. We have a running service which periodically checks whether the model got updated and reloads it. Now I wouldn't expect the memory usage to increase, since the model in memory should be replaced by the updated one.

The code to load the model is

// load model into memory
	model := tg.LoadModel(
		"path/to/our/model",
		[]string{
			"serve",
		},
		&tf.SessionOptions{},
	)

But our monitoring shows that the usage goes up every time the model gets reloaded (once per hour). I profiled the service with pprof and could not see that any of the internal components in our code has a significantly growing memory usage.

Furthermore I built tensorflow 2.9.1 with debug symbols and wrote a small go app just reloading the model. I did this to check for memory leaks with memleak-bpfcc from https://github.com/iovisor/bcc. This gave me the following stack trace, which, I believe, shows that there is memory leaked

	1770048 bytes in 9219 allocations from stack
		operator new(unsigned long)+0x19 [libstdc++.so.6.0.28]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::New(google::protobuf::Arena*)+0x1c [libtensorflow_framework.so.2]
		google::protobuf::internal::GenericTypeHandler<tensorflow::NodeDef>::NewFromPrototype(tensorflow::NodeDef const*, google::protobuf::Arena*)+0x20 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type* google::protobuf::internal::RepeatedPtrFieldBase::Add<google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler>(google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::TypeHandler::Type*)+0xc2 [libtensorflow_framework.so.2]
		google::protobuf::RepeatedPtrField<tensorflow::NodeDef>::Add()+0x21 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::add_node_def()+0x20 [libtensorflow_framework.so.2]
		tensorflow::FunctionDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x334 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDef>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::FunctionDefLibrary::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x240 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::FunctionDefLibrary>(google::protobuf::io::CodedInputStream*, tensorflow::FunctionDefLibrary*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::GraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x291 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::GraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::GraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::MetaGraphDef::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x325 [libtensorflow_framework.so.2]
		bool google::protobuf::internal::WireFormatLite::ReadMessage<tensorflow::MetaGraphDef>(google::protobuf::io::CodedInputStream*, tensorflow::MetaGraphDef*)+0x64 [libtensorflow_framework.so.2]
		tensorflow::SavedModel::MergePartialFromCodedStream(google::protobuf::io::CodedInputStream*)+0x25b [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::MergeFromCodedStream(google::protobuf::io::CodedInputStream*)+0x32 [libtensorflow_framework.so.2]
		google::protobuf::MessageLite::ParseFromCodedStream(google::protobuf::io::CodedInputStream*)+0x3e [libtensorflow_framework.so.2]
		tensorflow::ReadBinaryProto(tensorflow::Env*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, google::protobuf::MessageLite*)+0x141 [libtensorflow_framework.so.2]
		tensorflow::(anonymous namespace)::ReadSavedModel(absl::lts_20211102::string_view, tensorflow::SavedModel*)+0x136 [libtensorflow_framework.so.2]
		tensorflow::ReadMetaGraphDefFromSavedModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::MetaGraphDef*)+0x5d [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModelInternal(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0x41 [libtensorflow_framework.so.2]
		tensorflow::LoadSavedModel(tensorflow::SessionOptions const&, tensorflow::RunOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, tensorflow::SavedModelBundle*)+0xc0 [libtensorflow_framework.so.2]
		TF_LoadSessionFromSavedModel+0x2a8 [libtensorflow.so]
		_cgo_6ae2e7a71f9a_Cfunc_TF_LoadSessionFromSavedModel+0x6e [testapp]
		runtime.asmcgocall.abi0+0x64 [testapp]
		github.com/galeone/tensorflow/tensorflow/go._Cfunc_TF_LoadSessionFromSavedModel.abi0+0x4d [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel.func2+0x14f [testapp]
		github.com/galeone/tensorflow/tensorflow/go.LoadSavedModel+0x2b6 [testapp]
		github.com/galeone/tfgo.LoadModel+0x6d [testapp]
		main.reloadModel+0x276 [testapp]
		main.main+0x72 [testapp]
		runtime.main+0x212 [testapp]
		runtime.goexit.abi0+0x1 [testapp]

As you can see this stacktrace shows calls to tfgo and to the underlying tensorflow library. I am not sure if I read it right, but it seems like there is a leak in tfgo or tensorflow itself.

Is there a way to explicitly release the memory of a loaded model when we reload? Could it be a problem in tfgo?
If you need more information on this, please tell me.

Thanks in advance :)

The text was updated successfully, but these errors were encountered:

galeone · 2022-10-07T14:08:50Z

Hi @jriegner , thank you for this report!

You're right, there's a leak. And I guess I found where the leak is 😄

LoadModel returns a *tfgo.Model that contains a tensorflow a *tf.SavedModel. This tf.SavedModel has a *tf.Session field.

When the model is deleted, the session is not automatically closed (it's the old tf.Session of Python - if you are familiar with TensorFlow < 2), and thus I guess the leak is in the missing invocation of session.Close().

I guess we have 2 options:

Expose the session and let the user invoke the Close() mehtod (I don't like it)
Create a finalizer that automatically invokes the session.Close() when the tfgo.Model is garbage collected.

I kinda like more the second approach.

Right now I'm travelling so I don't know when I'll be able to work on it (I'm setting up the development environment right now while I'm on a train lol) - if you want to implement the second option, I'll be more then happy to review and merge it

galeone · 2022-10-07T14:29:36Z

Update: I guess I made it b536202

Give it a try and let me know if it works

jriegner · 2022-10-10T10:52:44Z

Hey, thanks for your fast reply!

I ran my local test app with the updated code and memleak-bpfcc still complained about leaking memory. I verified that the finalizer ran and also explicitly closed the session on every reload. Same result.

I will check with our production setup too and will come back then.

Edit:
I monitored our service with the latest version of tfgo for a while and can see that the memory increases on each reload.

galeone · 2022-10-16T10:36:53Z

If it's not the leak related to the missing session close, I have no other clue 🤔

For sure there was a leak caused by the missing session.close(), which now should be fixed (I forgot to add the same line in the tg.ImportModel and I fixed it yesterday).

But I have no idea of what could cause this issue on tfgo side - maybe it's a leak in the TensorFlow C library

jriegner · 2022-10-17T07:38:21Z

Alright, I will update the version on my side and give it a try. Thanks for your support, maybe the last update fixed it 👍🏻

LoveVsLike · 2023-12-19T07:20:32Z

Update: I guess I made it b536202

Give it a try and let me know if it works

I try it, but is not work!

galeone · 2023-12-20T06:16:36Z

Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.

Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.

Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).

I'll update the fork and I'll let you know.

LoveVsLike · 2024-01-05T08:47:56Z

Hi @LoveVsLike - I guess the problem is inside the TensorFlow bindings. As you can see, in tfgo we just open the session and using a finalizer we close the session when the model is collected. So, the problem in the TensorFlow code, I guess.

Perhaps you can try to open an issue on https://github.com/tensorflow/tensorflow and link this thread there. Maybe, someone from the TensorFlow team can help us.

Anyway, since tfgo is still using TensorFlow 2.9.1 I can try to update my fork to 2.14. Maybe the leak is already fixed and we don't know it (but I'm not confident, since it has been years and this leak is still present version after version...).

I'll update the fork and I'll let you know.

Hi, I update TF to last version and tfgo, But is not work, Can you put a issue to tf?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak on reloading model #72

Potential memory leak on reloading model #72

jriegner commented Oct 7, 2022

galeone commented Oct 7, 2022

galeone commented Oct 7, 2022

jriegner commented Oct 10, 2022 •

edited

Loading

galeone commented Oct 16, 2022

jriegner commented Oct 17, 2022

LoveVsLike commented Dec 19, 2023

galeone commented Dec 20, 2023

LoveVsLike commented Jan 5, 2024

Potential memory leak on reloading model #72

Potential memory leak on reloading model #72

Comments

jriegner commented Oct 7, 2022

galeone commented Oct 7, 2022

galeone commented Oct 7, 2022

jriegner commented Oct 10, 2022 • edited Loading

galeone commented Oct 16, 2022

jriegner commented Oct 17, 2022

LoveVsLike commented Dec 19, 2023

galeone commented Dec 20, 2023

LoveVsLike commented Jan 5, 2024

jriegner commented Oct 10, 2022 •

edited

Loading