Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of malloc #185

Open
ryandesign opened this issue Sep 28, 2022 · 3 comments
Open

Poor performance of malloc #185

ryandesign opened this issue Sep 28, 2022 · 3 comments

Comments

@ryandesign
Copy link
Contributor

In my project I use a third-party C++ library. I create an instance of its main class for each document I open. It in turn creates a bunch of other objects that it manages. (It parses my document into an internal representation.) I noticed that each subsequent document I open takes longer, even if the documents are the same. I believe the problem is that objects are ultimately created using malloc, and libretro implements malloc using NewPtr, which appears to have poor performance.

I created a test program which demonstrates the poor performance of NewPtr. This program defines a class. When an object of that class is constructed it creates 150 new pointers, and they're disposed of when the object is destructed. The main function times how long it takes to allocate each of 10 objects of this class; the results show that each object takes more time to allocate than the previous one. (There are additional tests in the program which I've commented out. For reasons I don't understand, when the program writes to a file, the performance of those extra tests is much worse than when it writes to the console.)

I found a mention in old Mozilla documentation that they have several memory allocator implementations, one of which uses NewPtr which can aid in debugging but is slow.

Is there a possibility to improve the situation by having libretro use a different implementation of malloc, perhaps using jemalloc, mimalloc, or tcmalloc?

@autc04
Copy link
Owner

autc04 commented Oct 2, 2022

First, it should be possible to link in a different malloc implementation on top of libretro, If you define your own malloc, that should replace the libretro one.
Also note that the newlib library used by Retro68 actually includesdlmalloc, which is a very good allocator for single-threaded situations. It's just not used because libretro overrides malloc rather than providing the underlying system calls for dlmalloc.

What system version did you measure on? IIRC Apple rewrote their Memory Manager implementation once or twice ("Modern Memory Manager").

But yes, NewPtr was never exactly fast.


Directy using a modern allocator for classic Mac OS might not be ideal because the modern world has different tradeoffs. Modern allocators tend to optimize for good multi-threading performance. And if a modern allocator grabs a few megabytes of virtual address space from the OS upon initialization, an old Mac might object to that :-). Parameters will need to be tweaked.

Also, one needs to be careful about using two allocators at the same time (imagine NewPtr/NewHandle/some random Toolbox function failing because the all the memory is taken by an empty heap managed by a custom malloc).


I'd start by sticking your own copy of dlmalloc into your program and see if you can set the right parameters. dlmalloc is just a single source file with lots of #defines at the beginning which you need to customize to tell it about how you want to use it (in particular, you have to teach it to call NewPtr instead of mmap, sbrk or VirtualAlloc to get memory from the OS....)
If you have good results with that, we can then either move that to a library that can be linked into any Retro68 program, make libretro use it by default or figure out how to make libretro use the version included with newlib.

@ryandesign
Copy link
Contributor Author

Thanks! I'll try dlmalloc. My tests were with System 7.1.

@ryandesign
Copy link
Contributor Author

First, here are some things I tried that didn't work. Below the break, what did work. TL;DR: dlmalloc is hundreds of times faster for me.

I tried adding dlmalloc.c to my tester program but it failed at link time because malloc etc. were already defined in libretro:

.../bin/ld.real: .../lib/libretrocrt.a(malloc.c.obj): in function `malloc':
malloc.c:(.text.malloc+0x0): multiple definition of `malloc'; CMakeFiles/app.dir/dlmalloc.c.obj:.../dlmalloc.c:4543: first defined here
.../bin/ld.real: .../lib/libretrocrt.a(malloc.c.obj): in function `free':
malloc.c:(.text.free+0x0): multiple definition of `free'; CMakeFiles/app.dir/dlmalloc.c.obj:.../dlmalloc.c:4681: first defined here
.../bin/ld.real: .../lib/libretrocrt.a(malloc.c.obj): in function `realloc':
malloc.c:(.text.realloc+0x0): multiple definition of `realloc'; CMakeFiles/app.dir/dlmalloc.c.obj:.../dlmalloc.c:5184: first defined here
.../bin/ld.real: .../lib/libretrocrt.a(malloc.c.obj): in function `calloc':
malloc.c:(.text.calloc+0x0): multiple definition of `calloc'; CMakeFiles/app.dir/dlmalloc.c.obj:.../dlmalloc.c:4790: first defined here
.../bin/ld.real: .../lib/libretrocrt.a(malloc.c.obj): in function `memalign':
malloc.c:(.text.memalign+0x0): multiple definition of `memalign'; CMakeFiles/app.dir/dlmalloc.c.obj:.../dlmalloc.c:5260: first defined here
collect2: error: ld returned 1 exit status

I tried recompiling libretro without its malloc functions by removing malloc.c from its CMakeLists.txt but this failed because other parts of libretro need malloc etc.:

.../bin/ld.real: .../lib/libc.a(lib_a-__atexit.o): in function `__register_exitproc':
.../gcc/newlib/libc/stdlib/__atexit.c:98: undefined reference to `malloc'
.../bin/ld.real: .../lib/libc.a(lib_a-__call_atexit.o): in function `__call_exitprocs':
.../gcc/newlib/libc/stdlib/__call_atexit.c:137: undefined reference to `free'
.../bin/ld.real: .../lib/libstdc++.a(del_op.o): in function `operator delete(void*)':
.../gcc/libstdc++-v3/libsupc++/del_op.cc:49: undefined reference to `free'
.../bin/ld.real: .../lib/libstdc++.a(new_op.o): in function `operator new(unsigned long)':
.../gcc/libstdc++-v3/libsupc++/new_op.cc:47: undefined reference to `malloc'
.../bin/ld.real: .../lib/libstdc++.a(eh_alloc.o): in function `__cxa_allocate_exception':
.../gcc/libstdc++-v3/libsupc++/eh_alloc.cc:284: undefined reference to `malloc'
.../bin/ld.real: .../lib/libstdc++.a(eh_alloc.o): in function `__cxa_free_exception':
.../gcc/libstdc++-v3/libsupc++/eh_alloc.cc:305: undefined reference to `free'
.../bin/ld.real: .../lib/libstdc++.a(eh_alloc.o): in function `_GLOBAL__sub_I__ZN9__gnu_cxx9__freeresEv':
.../gcc/libstdc++-v3/libsupc++/eh_alloc.cc:123: undefined reference to `malloc'
collect2: error: ld returned 1 exit status

I tried replacing malloc.c with dlmalloc.c in libretro's CMakeLists.txt but it failed because other parts of libretro need _malloc_r etc. which dlmalloc.c doesn't define:

.../bin/ld.real: .../lib/libc.a(lib_a-wsetup.o): in function `__swsetup_r':
.../gcc/newlib/libc/stdio/wsetup.c:56: undefined reference to `_free_r'
.../bin/ld.real: .../lib/libc.a(lib_a-fflush.o): in function `__sflush_r':
.../gcc/newlib/libc/stdio/fflush.c:197: undefined reference to `_free_r'
.../bin/ld.real: .../lib/libc.a(lib_a-fvwrite.o): in function `__sfvwrite_r':
.../gcc/newlib/libc/stdio/fvwrite.c:145: undefined reference to `_malloc_r'
.../bin/ld.real: .../gcc/newlib/libc/stdio/fvwrite.c:156: undefined reference to `_realloc_r'
.../bin/ld.real: .../gcc/newlib/libc/stdio/fvwrite.c:162: undefined reference to `_free_r'
.../bin/ld.real: .../lib/libc.a(lib_a-makebuf.o): in function `__smakebuf_r':
.../gcc/newlib/libc/stdio/makebuf.c:53: undefined reference to `_malloc_r'
.../bin/ld.real: .../lib/libc.a(lib_a-mprec.o): in function `_Balloc':
.../gcc/newlib/libc/stdlib/mprec.c:106: undefined reference to `_calloc_r'
.../bin/ld.real: .../gcc/newlib/libc/stdlib/mprec.c:123: undefined reference to `_calloc_r'
.../bin/ld.real: .../lib/libc.a(lib_a-fclose.o): in function `_fclose_r':
.../gcc/newlib/libc/stdio/fclose.c:101: undefined reference to `_free_r'
.../bin/ld.real: .../gcc/newlib/libc/stdio/fclose.c:103: undefined reference to `_free_r'
.../bin/ld.real: .../gcc/newlib/libc/stdio/fclose.c:99: undefined reference to `_free_r'
collect2: error: ld returned 1 exit status

I tried following the guidance in gcc/newlib/libc/include/reent.h regarding whether or not to specify -DREENTRANT_SYSCALLS_PROVIDED, -DMISSING_SYSCALL_NAMES, and syscall_dir (in the m68k-apple-macos and powerpc-*-macos cases in gcc/newlib/configure.host) but I was not able to find a combination that didn't lead to more of this type of error.


Finally, wanting to get something working even if it was messy, I returned to standard Retro68 and used the linker's -wrap flag to override the malloc functions at link time with dlmalloc's versions. I used dlmalloc 2.8.6. (Retro68 contains version 2.8.3.) I added this patch to provide mmap/munmap implementations based on NewPtr/DisposePtr.

With dlmalloc, my timing tester program from earlier showed 0 or 1 tick for each test. Trying something a bit more demanding, I added dlmalloc to the app I'm developing using a third-party C++11 library. With dlmalloc, the time taken to have that library parse a test document (in Mini vMac 37.03, Macintosh II emulation, 32⨉ speed (512 MHz), autoslow off, background on, System 7.1) is now 40 ticks (⅔ second) no matter how many documents I open simultaneously. With Retro68's default malloc implementation, the same thing takes 3,575 ticks (1 minute) for the first document, 12,751 ticks (3.54 minutes) for the second document, 22,071 ticks (6.13 minutes) for the third document, and so on.

I told dlmalloc to use a pagesize of 4K which is its default. dlmalloc calls mmap with a size of 64K which seems like a reasonable size. It did occasionally call munmap when I closed documents but not as often as I would like. I'll have to see if I can improve on that. I didn't implement mremap yet because in my programs it never got called. It may be possible to implement it using SetPtrSize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants