full use of jemalloc with NUMA support and rework of thread spawning #2314

chassell · 2017-07-27T22:35:14Z

Lots in here but no memory rework is useful without all the parts: NUMA-binds, hugepages, jemalloc arenas.

--with-jemalloc and --without-jemalloc should both work (tests fine for both)
push -ljemalloc to the end of the LIBS variable so all binaries use it
use hooks for hugepage-size maps made from normal pages, ready for hugepaged to convert
create all nodeset-to-affinity mappings before main(), mapped to later jemalloc arenas
change spawning threads to allocate stack and EThread from node-near memory arenas
#define over all uses of Allocator/ClassAllocator/ThreadAllocator to depend on JM's thread cache
put a barrier on thread spawns until all in a group are done, before binding cpuset
I am sure this kitchen sink I found could be squeezed in given more time ...

This is a big change but mostly has new code for 3, for 4 with some C++-friendly code for jemalloc tuning. Many could probably use its profiling capabilities as well.

[p.s. Yes .. with --without-jemalloc the Allocator code runs as usual]

zwoop · 2017-07-28T02:16:35Z

[approve ci]

chassell · 2017-07-28T04:41:58Z

I used that to call the "constructor" nullptr_t ... in case we used ::mallocx(). Maybe it's too much a stretch. It can just be nullptr.
^
../../lib/ts/ink_memory.h:61:25: note: expanded from macro 'mallocx'
#define mallocx(...) nullptr_t{}
^

chassell · 2017-07-28T05:29:27Z

uploaded a quick (late at night) fix to remove two problems it seems. clang-format tomorrow.

PSUdaemon · 2017-07-28T13:53:54Z

[approve ci]

jpeach · 2017-07-28T15:36:19Z

@chassell Please break this up into small, independent commits. Looking briefly at the diff, I'd expect at least 10 separate commits for this change.

PSUdaemon

Overall this is a very large PR that touches a lot of stuff. It seems like it could be several PRs, a few of which should be several commits. I've left some specific comments, but I think this main issue needs to be addressed.

PSUdaemon · 2017-07-28T14:13:56Z

lib/tsconfig/TsConfigSyntax.c

-#define YY_FLEX_MINOR_VERSION 6
-#define YY_FLEX_SUBMINOR_VERSION 0
+#define YY_FLEX_MINOR_VERSION 5
+#define YY_FLEX_SUBMINOR_VERSION 37


This is committed generated code. You shouldn't be modifying this.

i wondered WTH that's checked in for. In normal practice we don't check in *.o or Makefile.

Accidentally in there.

PSUdaemon · 2017-07-28T14:20:33Z

proxy/http/HttpSessionManager.cc

 {
  m_g_pool = new ServerSessionPool;
-  eventProcessor.schedule_spawn(&initialize_thread_for_http_sessions, ET_NET);
+  eventProcessor.schedule_spawn([](EThread *thread){ thread->server_session_pool = new ServerSessionPool; }, ET_NET);


I'm not opposed to code changes like this, but this PR is already huge and this seems completely unrelated.

Yeh that can get pulled out. I think it was more a "nice thing" that is unlike the others .. and a proof of concept that it worked fine.

It's not critical to any functionality.

PSUdaemon · 2017-07-28T14:21:30Z

lib/ts/test_X509HostnameValidator.cc

+#include <openssl/pem.h>
+#include <openssl/x509.h>
+#include <openssl/ssl.h>
+#include <openssl/bio.h>


Is this really a change you made or do you need to rebase?

didn't make that at all... and I was rebasing every time. Must have crept in during a cherry pick I did back and forth.

I may have fiddled with those because I had another include ... and I'm in the habit of putting the "system-wide" or "more-global" includes after the specialized and local ones.

It always teaches me that I didn't put correct global includes that the includes themselves need (and the include file can accidentally be okay because someone upstream included its deps).

However ... that file in specific needed very little change... so I think I can pull that back too. No big.

PSUdaemon · 2017-07-28T14:23:08Z

lib/ts/ink_thread.h

 }

+static inline void
+ink_get_thread_name(char *name, size_t len)


Maybe it's my browser or the size of this diff, but I don't see where this function is used.

Was used .. now is not. It's a nice idea, but ... ennhh.

PSUdaemon · 2017-07-28T14:54:00Z

lib/ts/ink_align.h

 #define INK_MIN_ALIGN 8
 /* INK_ALIGN() is only to be used to align on a power of 2 boundary */
-#define INK_ALIGN(size, boundary) (((size) + ((boundary)-1)) & ~((boundary)-1))
+#define INK_ALIGN(size, boundary)  aligned_spacing(size,boundary)


Is this different behavior or you are just trying to force type safety with templates?

If so, I think this should be a separate PR.

As for what I was doing... I was making really sure that I didn't cut-and-paste a calc of alignment improperly. It's in so many places in the old code ... cut and pasted cut and pasted cut and pasted.

Nothing typesafe about it, to be sure, but it is like a number without using a define or an enum to clarify it. It's not essential .. but I left that in from earlier changes ... when I made aligned_spacing().

It doesn't have to be in there.. but it is nicer.

Thign is .. a lot of the "let's clean this up" ... seems unlike something that'd be approved from me, eh? I hope to see things cleaned up... ever. If they are dead-code or if they're just a far-flung cut-and-paste.

What's the best way to get cleanups in ... when they're nagging and simple or negligibly dangerous at all. Or is that precisely NOT a good PR?

This doesn't buy type safety. It's an identical result.

Do we want raw numbers or values or expressions cut-and-pasted everywhere?

Okay ... not everywhere. Whatever the case is ... I may put in another PR to lower the number of cut-pastes that have apparently been around a long time for alignments.

PSUdaemon · 2017-07-28T15:13:03Z

lib/ts/Arena.h


-#ifndef __ARENA_H__
-#define __ARENA_H__
+#pragma once


I'm on board with this if all our supported platforms can use it (in the past they couldn't) but I think this needs a separate PR where it's done for every header file for consistency.

That policy isn't what I heard. What I heard is "as you touch it, upgrade to current policy / practice but otherwise leave it alone."

That's fair. We do want to make the code better. I usually think of that as fixing a loop structure because it's poor when I am just changing some behavior inside of it. This seems more like something we should decide to do as a coding standard and make the changes to all header files as they should be consistent.

I saw it discussed recently and the C++11 (for 8.0.0) seems a natural place to change it over.

I don't want to die-on-that-hill btw... but I am wondering what is good cleanup (i.e. a loop structure because the inside needs changing) and what is out-of-bounds busybody cleanup.

I am fine with the #pragma one

It is supported by all compiler I understand we support.
any gcc, icc and clang compiler supports this feature. the versions we care about it ( ie ones that support c++11) support this feature. I already had a PR go in with this same change

-1 on pragma in this PR. If we do pragma, raise it on dev@, get consensus and apply it to the whole codebase at once. This is not the right PR for this change.

PSUdaemon · 2017-07-28T15:14:22Z

lib/ts/Arena.h

 }

-#endif /* __ARENA_H__ */
+struct JemallocArena : public std::allocator<uint64_t>


I was under the impression that ATS Arenas and jemalloc Arenas were two different concepts. Is this not the case?

The "Arena" is a string alloc system ... so this is just a prefix on it. The main reason it's here is to prevent caching/over-allocation from the intermediate system.

I don't follow.

JemallocArena is a bad name, agreed, as Jemalloc + Arena means something very precise.

That's because Arena.cc is kinda weird name for a string-storage system .. but that's all it's used for. In order to gain the benefit of Jemalloc and Thread-cache stuff (and to test it also) ... we need to pull out code that pre-allocates and partitions hoarded memory.

Mostly because we cannot track leaks / failures / profiling at all when memory is grabbed and returned to someone's internal cache system.

PSUdaemon · 2017-07-28T15:18:43Z

iocore/eventsystem/UnixEventProcessor.cc

 /// Global singleton.
 class EventProcessor eventProcessor;

-class ThreadAffinityInitializer : public Continuation


Is removing all this correct? This looks like the stuff that @SolidWallOfCode added recently.

he did. but he was frustrated at how huge it was, as I read his comments. It does nearly nothing. All it does is go through a sequence of added callbacks.

No data. No benefit. 15 times bigger ... and really unrelated to anything that EventProcessor does.

He says he did it to help you out.

[EDIT: okay ... maybe only 8 times bigger... and that's LOC that may change after clang-format]

As it is ... BTW .. the main purpose I wrote what I did is (1) to prevent allocating EThread inside the EventProcessor code and (2) NUMA allocation of EThread needs to be done correctly and (3) race conditions abound early on in thread groups.

Huge and crazy Factory/Object creation (including the antipattern "Cancer-Class") is never something to shove way deep down. RAII says that the creation of an object, just like invoking a C module ... should be done with all the customization done way early on.

I can see how EThread has been way way way overburdened because it's so stuck there. It's far easier to let it get customized (and it already has been elsewhere) to a lighter object type and delegate more out to derived types that are specialized to threads. If a C module can't init something ... it needs all the setup done at once with all the attributes, like pthreads.

That means new-ing it near the most-customized-point. MORE relevantly (for Jemalloc) in the correct Arena and on the correct NUMA nodes.

So I gave a try at shrinking it down using lambdas. Worked great. It's able to be seperated out, though.

It's able to be seperated out, though.

That sounds great.

I don't need an opinion now (with bigger fish to fry) ... but I still don't know what you think.

Chris: <blah blah blah... I put it all this a big convenient garbage bag like you asked!>

Phil: That sounds great ....

:->. I'll leave it till a later discussion then.

PSUdaemon · 2017-07-28T15:25:20Z

lib/ts/StdAllocWrapper.cc

+    if ( advice == MADV_DONTDUMP ) {
+      _arena = jemallctl::proc_arena_nodump;
+    } else if ( advice == MADV_NORMAL ) {
+      _arena = jemallctl::proc_arena;


Why hardwire these?

Because the current code is hardwired too.

The code I see now passes an advice parameter all the way down. There aren't nodump variants of the functions.

We have several properties on malloc'ed normal pages:

NUMA node binding origin (if more than one exists)

being replaceable within a hugepage (by alignment and equal flags)

having been replaced with a hugepage by hugepaged (keeping equal flags)

NODUMP flag on for single pages or for hugepage-aligned group or for single hugepage

mmap() creation with MAP_GROWSDOWN flag for stacks that can increase as one unit

Essentially arenas can flag groups of pages in case 1, case 2+3 together (doesn't matter), case 4 ... and we can leave case 5 as raw mmap of reasonable alignment. It may even make sense for case 5 to be done in smaller pages instead, as many probably don't use their full stack and could stay small automatically.

So arenas exist in advance for NUMA-interleaved memory, for NUMA-isolated memory, for NODUMP pages only [which always stay as NODUMP]. Each defines a set of pages that don't get shared ... though they can be "replaced" with clean pages (and therefore freed from alloc in the kernel ... which of course can use or release physical pages as desired).

The proc_arena is "process arena" and it is the global NUMA-interleaved memory. The proc_arena_nodump is the same as "process arena" except all its pages (hugepages hopefully) get NODUMP flagged in their use.

Each flagging / reflagging is an annoying system call and, again, if we leave caching to Jemalloc then the set of pages are already prepared for use by the IOCache.

Anyway ... to answer the implied question: the arena was created and its jemalloc hooks are different: all its pages are created and then madvise() flagged. Each arena can have different hooks and there are two varieties. With and without NODUMP.

PSUdaemon · 2017-07-28T15:47:56Z

lib/ts/hugepages.h

+bool huge_decommit(void *chunk, size_t size, size_t offset, size_t length, unsigned arena_ind);
+bool huge_purge(void *chunk, size_t size, size_t offset, size_t length, unsigned arena_ind);
+bool huge_split(void *chunk, size_t size, size_t size_a, size_t size_b, bool committed, unsigned arena_ind);
+bool huge_merge(void *chunk_a, size_t size_a, void *chunk_b, size_t size_b, bool committed, unsigned arena_ind);


I'd like to understand better why all the hooks have to be implemented. I've seen examples where just the alloc and dalloc are implemented.

This is over-allocating the pages so that they are (1) aligned on a hugepage boundary and (2) oversized in allocation so that we don't need to commit/decommit individual small-pages.

This is the way we get zero-latency hugepages. It works nicely. :->

I don't follow.

Creating a block of 2Meg (512 pages) ... is instant and as fast as mmap can do it. It never will fail unless memory is gone. There is no wait and there is no muss and no fuss and it is transparent. In fact ... jemalloc first demands an mmap() with no pages actually attached .. and then calls the "commit" calls to attach usable pages to the region laid out in a process.

Why do we want hugepages, then? Not because we demand them (for some hardware). We need memory simplified / copied / rearranged down so that contiguous pages are put directly into a single hugepage ... over time. Best of all would be simply for it to happen in the background.

That's what hugepaged does.

Because of that ... everything in these hooks is aligned and allocated in hugepage-sized groups (512 aligned correctly). This system then lets jemalloc still demand changes to segments and offsets within the page. That means commit/decommit are often no-ops that simply return with no effect: a single hugepage is already present (or may be) and de-committing part of it is somewhat silly.

Currently btw... all these calls are basically stateless. They understand one thing: they get a nullptr (huge_alloc) or they get something that huge_alloc returned to work with. Since I don't allow huge_split() to succeed unless distinct hugepages are created ... they may stay in a big single bloc forever, or allocated one 512-page segment at a time.

De-allocating an aligned set of hugepages ... is performed as needed. However arenas are rarely shut down entirely ... to that degree. A group of 512 pages together will still need some page or another left behind.

And yes ... if we ever demand "this must be a hugepage" ... we are always always going to have to have them assembled for us and will have to wait.

Since we're not in the hardware business, we don't need that kind of latency.

There is a different way to do this with the system, however last I check in windows and Linux this required admin/root access. What is being stated here is that this allows a way for use to get a block of memory ( remember we have 64-bit address space which no hardware can use fully) ideally larger than we need. When used for items like arrays, we can simply grow them by having the system move the virtual size of the memory to commit. So a array of 1024 items can be grown to 2048 as we just say the memory we are using it now 2048 items. given the "virtual address" space is free we can do this. The OS maps the virtual address to random hardware for us on the back end, with no extra cost. However I don't think the cost in jemalloc is so low, as it has to manage chucks etc to be handled out. What i don't get is why we are handling these chunks and not jemalloc.

Check on some of that ... I am checking more closely and there may be better support for hugepages than I'd thought in jemalloc. Not certain if it is latency-inducing or not.

As it is ... it's not in the (new) first commit and I'll leave it out of this PR until jemalloc shows it cannot do it.

chassell · 2017-07-28T18:36:22Z

Eek.. gotta get those tests passing.

My current plan is looking like this .. and I'd like some thoughts on it @PSUdaemon @SolidWallOfCode @jpeach

(1) jemalloc library use and Allocator/ClassAllocator/ThreadAllocator-replace as an experimental commit. A MADV_NODUMP arena for I/O and that's all to add.

No NUMA-node limits using arenas. Same as before ... and kinda broken, but its "experimental". In theory ... the thread cache will handle (and be pre-loaded) with all the same quality of freelist entries as the freelist system did. Probably far more compatible with tools and Jemalloc features.

(2) the Hwloc / NUMA affinity code put in so we have a quick call to "assign by affinity" for memory and then for cpuset ... maybe removing less of UnixEventProcessor's code. It will automatically use Jemalloc Arenas.

No hugepage-replacement by hugepaged .. and EThread is allocated from distant NUMA nodes.

(3) Hugepage-automatic hooks for Jemalloc. Clear and simple for a single commit so all arenas attempt to allocate (if enabled) correct size and alignment, automatically.

Not much downside to this. I have to be conservative in applying madvise() because contradictory flags cannot be put on a single hugepage.

(4) Improved EThread allocation (within correct node and arena) and improve startup of cpuset binding so that no race conditions exist near the start of a thread group. (use a barrier to release them all at once).

One other useful possibility is to improve the cache-MADV_NODUMP arena so that it uses HWLOC_..._REPLICATE and can put read-only data on near NUMA nodes.

(5) Other and wholly unrelated things: [A] a shim to change arenas in and out of plugin code, [B] recording the set of Continuation callbacks as set/called and [C] performing profiling within plugins or within individual callbacks.

All that is a dependent DAG ... but not necessarily monolithic.

alficles · 2017-07-28T19:19:01Z

@jpeach You say ~10 commits, but I'm not quite as certain. Are you picturing each commit being self-contained, operating properly and independent of the other commits? There are definitely some things in here that are cleanup that could be moved entirely to a separate PR, but most of the rest is fairly interdependent.

chassell · 2017-07-28T20:55:09Z

@jpeach @alficles I looked more at the list and talked with @jrushford.

He thinks I could split out maybe an "experimental-and-non-default" jemalloc first of all, get it approved, ... and then add on more independent work for NUMA-init changes etc... above.

only exception may be the thread-spawn code rework that needs both NUMA and jemalloc changes.

dragon512 · 2017-07-28T22:15:15Z

lib/ts/Arena.h

+  void str_free(char *str) { free(str,0); }
+
+  // does it the long hard way
+  static size_t str_length(const char *str) 


I am unclear on why we have all these new functions.

Adding a memory allocator is more about replacing new/malloc and delete/free backends. For example, this function does not seem to me that it should exist. strlen is a compiler intrinsic

alficles

I've gotten through only part of this. It's pretty big and could definitely be smaller. I'd see what we can whittle out of this that isn't necessary. The stuff that comes out can definitely go into separate PRs, where it's easier to digest.

alficles · 2017-07-28T19:45:33Z

lib/ts/jemallctl.h

+   const objpath_t _oid;
+};
+
+template <typename T_VALUE, size_t N_DIFF=0>


What is N_DIFF for? I may have missed it, but I don't see it used.

alficles · 2017-07-28T19:49:28Z

lib/ts/jemallctl.h

+
+using objpath_t = std::vector<size_t>;
+
+struct ObjBase {


Reading the name, I initially thought (and worried) that this was a new base for all objects. Maybe something like ObjFxnBase would be clearer, in context with (G|S)etObjFxn?

alficles · 2017-07-28T20:03:27Z

lib/ts/jemallctl.cc

+auto mallctl_get(const objpath_t &oid) -> T_VALUE;
+
+template <typename T_VALUE>
+auto mallctl_set(const objpath_t &oid, const T_VALUE &v) -> int;


I'm going to comment here, but you've used trailing returns unnecessarily, basically everywhere. Am I missing something? Is there a reason to use trailing returns when the return type does not depend on the types of parameters defined with template parameter types?

alficles · 2017-07-28T20:58:25Z

lib/ts/ink_align.h

+    void *ptr = static_cast<char*>(nullptr) + len; // pointer from zero
+    // next size >= len to get next a new aligned block
+    return static_cast<const char*>( ats_align(block, 0, ptr, block) ) 
+                   - static_cast<const char*>(nullptr);


I'm a fan of type-safe things, but I'm pretty sure we're not gaining any type-safety here, and I think this function could be drastically simplified into a single function call. As it is, it takes far too many brain-cycles to figure out that this is just returning the results of one function.

If you can find it ... I'm really wondering wth the C++17 design wanted as the standard "alignment" function. I had it on my machine but not elsewhere ... so std::align() ... really SUCKS.

It's deprecated and it's difficult and looks like it was designed by committee ... and I replaced it with "our own version" ... and it even now makes things tougher than they should be.

Cut and paste is not the way to make things clearer, though... because one messup becomes invisible. Every bit-system-AND-and-NOT is far worse than doing it well somewhere.

Though I have had it with the std::align() interface.

BTW .. type safety is ... not directly what I'd call it. It is more "obscurity removal" (i.e. enums versus raw numbers and cut-and-paste exprs) and "removing-poorly-handled-#define-cases".

But ... it is not the core issue now. At least I trust this more as it compiles with full safety, but something has to be better.

alficles · 2017-07-28T22:10:41Z

lib/ts/ink_align.h

+// workalike to std::align() ... though it's not a great API
+//
+template <typename T_OBJ>
+inline T_OBJ *ats_align(std::size_t alignment, std::size_t size, T_OBJ *&ptr, std::size_t &space)


Looking at the use-cases, I'm not sure this is a significant improvement over the old INK_ALIGN. It modifies both ptr and space, but nothing intentionally uses these modified values. Separately, I'm not certain it is necessary for jemalloc. Consider sliding this into its own PR, perhaps re-thinking some of the signatures?

And the space argument isn't used for its purpose anywhere either.

alficles · 2017-07-28T22:15:39Z

lib/ts/ink_align.h

-  return (void *)aligned;
+    size_t left = alignment;
+    void *aptr = ats_align(alignment, 0, pointer_, left); 
+    memset( pointer_, '\0', alignment - left ); // zero bytes before new block


I'm pretty sure this doesn't work. pointer_ is changed by ats_align in the return, so it and aptr will always be the same. The previous code seems both clearer and correct in this case.

alficles · 2017-07-28T22:34:27Z

iocore/eventsystem/UnixEventProcessor.cc

-} Thread_Init_Func;
-}
+  auto n = std::find( std::begin(kAffinity_objs), std::end(kAffinity_objs), objtype ) - std::begin(kAffinity_objs);
+  n %= countof(kAffinity_objs); // remap end to index zero


There may be easier ways to accomplish wrapping the end to 0 than re-writing countof.

it's used elsewhere and ... the compiler doesn't care at all in many cases.

alficles · 2017-07-28T22:35:44Z

lib/ts/ink_defs.h

+  static_assert(std::extent<T>::value > 0,  // [0]
+                "zero- or unknown-size array");
+  return std::extent<T>::value;
+}


I don't think I'm convinced that the one new use case justifies re-writing countof. Is there a major payoff I'm missing?

Countof outside of #define is a lot nicer and handles more case correctly, no matter what is put in there. #define can return a "value" in cases where it shouldn't. I just don't trust it.

alficles · 2017-07-28T22:36:37Z

lib/ts/ink_defs.h

+  return std::extent<T>::value;
+}
+
+template<typename T, typename U> constexpr size_t offset_of(U T::*member)


Is this function used?

no... it lived there to be used in another branch for C++-style lists instead of the #define LINK.

This thing has older little stuff to shrink down, to be sure. I will have some minimalization by later Monday.

alficles · 2017-07-28T22:38:22Z

lib/ts/ink_defs.h

+#endif
+#elif __cplusplus
+// some signatures need this
+using hwloc_obj_type_t = int; 


Is this how it's done in the rest of the code? Most of the places I've seen use the typedef form.

Typedef is not deprcated .. but it's much harder to read and it cannot easily work the the "default" situations in a stack of derived classes. It's nice to just "using SubClass::TypeInSubclass_t;" and then ... frankly .. all types should be introduced that way.

dragon512 · 2017-07-28T23:30:33Z

Have you run this with a Vtune profiler. I would be interested on the change in your cache and TLB miss and hit rates.

jpeach · 2017-07-29T22:28:49Z

I can't follow this at all. This PR is some tangled web of cleanups and allocation changes. There's no problem statement of what you are trying to do there, no benchmarks and a single giant commit. Every cleanup here should be a separate commit, possibly a separate PR. Substantial changes should be made one at a time. If you have interdependent changes, restructure them to break the dependency. That is always possible.

IMHO this PR is a total non-starter. No-one should be landing this as it stands.

mingzym · 2017-07-30T11:55:02Z

In this PR, I think I can not get what problem this PR is addressing, or how better the performance may benefit from, I think that is what we want from this PR, right?

SolidWallOfCode · 2017-07-30T23:37:08Z

iocore/eventsystem/I_EventProcessor.h

      @see spawn_event_threads
   */
-  EventType register_event_type(char const *name);
+  EventType register_event_type(char const *name)


Why make this inline? It's not called often, it's cleaner to have it in the .cc file.

SolidWallOfCode · 2017-07-30T23:38:11Z

iocore/eventsystem/I_EventProcessor.h


    @param cont continuation that the spawn thread will call back
      immediately.
    @return event object representing the start of the thread.


The comment doesn't seem consistent with the return type.

I found not a single soul used that return. However the docs then become misleading as they are.

This part will be separated out for other discussion.

SolidWallOfCode · 2017-07-30T23:44:32Z

iocore/eventsystem/I_EventProcessor.h


  */
-  void shutdown() override;
+  virtual void shutdown() override;


No. Methods can have at most one of virtual, override, or final.

Was my way of making a function look the same every time. I can remove it.

SolidWallOfCode · 2017-07-31T01:05:12Z

iocore/eventsystem/UnixEventProcessor.cc

-  ThreadGroupDescriptor *tg = &(thread_group[ev_type]);
+    // perform all inits for EThread 
+    for( auto &&cb : _spawnQueue ) {
+      cb(t); // init calls


How will this work correctly if the callbacks are needed to initialize thread local storage, especially the C++ variety?

Yep. It's quite precisely identical to ....

for( int i = 0 ; i < countof(_spawnQueue) ; ++i )
{
(_spawnQueue[i])(t);
// i.e. use a temporary rvalue-only-ref of the pointer / functor
// and call the void (*f)(EThread *)
}

chassell · 2017-07-31T22:09:53Z

Put in jemalloc in a much more raw form. No use of Arenas excepting to divide up normal and NODUMP memory.

This patch is meant for performance, cleanup, maintainability [fewer caches].

…ee-memory caches

chassell · 2017-07-31T22:42:54Z

New forced push ... fewer include reorderings. :->

alficles · 2017-07-31T22:37:08Z

lib/ts/Allocator.h


 #define RND16(_x) (((_x) + 15) & ~15)

+struct _InkFreeList {


Looks like this was moved from ink_queue.h. Why?

alficles · 2017-07-31T22:40:27Z

lib/ts/Arena.h


 #include <sys/types.h>
 #include <memory.h>
-#include "ts/ink_assert.h"


Do we need to reorder these for a particular reason?

alficles

I've only gotten through about half, so I'll try to get through the rest later. But here are my initial thoughts on round 2. I've re-mentioned stuff still in the queue from the first round, so it stays fresh.

alficles · 2017-07-31T22:50:57Z

lib/ts/Arena.h

+};
+
+#if HAVE_LIBJEMALLOC
+#define Arena Arena_NoCache


This feels odd. You're over-riding the alloc/free functions in a non-virtual class by way of the preprocessor. A person reading the code is unlikely to properly guess which function is actually going to be called. I'd move the #ifdef into each of the alloc and free functions and the body of the class Arena.

As it is, if someone changes the signature of the first alloc (and the places that call it) without changing the jemalloc, not only will it compile, it will mostly work, but with the wrong allocator.

.. nope. C++ blocks all signature overloads from a base class that have the same function name.

Lots and lots and lots of different techniques of specialized memory-hoarding doesn't speed things up, but often makes them vulnerable to unchecked and invisible bad-pointer usage.

There's also no way to profile deletions or allocations ... when they're hidden deep inside a memory-hoarder. The thread-cache in Jemalloc is far more efficient.

Ah, fair enough. It's still odd. I'm not seeing how the claim about memory-hoarding is relevant to this observation thought? I'm claiming that a reader is unlikely to determine correctly which code is actually going to execute by reading the function.

Currently I'm trying to make a non-mainline Jemalloc install (which is the point of the commit) work such that we can estimate it's performance. As it is ... the burden of #ifdef is worse than the burden of two overridden calls.

alficles · 2017-07-31T22:53:15Z

lib/ts/StdAllocWrapper.h

+  {
+    value_type *preCached[chunk_size];
+
+    for (int n = chunk_size; n--;) {


n declared signed with unsigned argument. Probably not a problem in reality, but feels wrong.

should auto it. But signed/unsigned mismatches (and casts) are all over the code. :->

alficles · 2017-07-31T22:54:11Z

lib/ts/StdAllocWrapper.h

+public:
+  using typename std::allocator<T_OBJECT>::value_type;
+
+  ObjAllocator(const char *name, unsigned chunk_size = 128) : _name(name)


Can a more seasoned ATS hacker chime in here? Do we prefer unsigned or unsigned int? I prefer the latter, but if the community is using the former, that's fine.

I think the preference is to be less ambiguous by using unsigned int but it's certainly not consistent throughout the code.

alficles · 2017-07-31T23:06:44Z

lib/ts/StdAllocWrapper.h

+  }
+
+private:
+  const char *_name;


Do I correctly observe that this name is never read? Only written once on initialization? And the caller is expected to maintain the memory allocation, since it's never freed?

[ looking into that ... as it may be old. it may have been pointing to non-malloc'ed static strings? bad as it is not nullptr'ed anyway.]

alficles · 2017-07-31T23:07:38Z

lib/ts/ink_align.h

+aligned_spacing(size_t len, size_t block = INK_MIN_ALIGN)
+{
+  return INK_ALIGN(len, block);
+}


This function is superfluous. In every case added here, INK_ALIGN does exactly the same job without asking the reader to comprehend a new function.

The semantic burden is much worse with INK_ALIGN. It does no aligning of any pointer. The burden is worse also because INK_ALIGN is the odd man out. All the other functions in that header look like the ones below.

My case is that INK_ALIGN does not do what the name is... and even has the "INK_*" prefix that seems rather redundant and old compared with ats_

I'm not sure I understand what you mean by "the semantic burden".

You also say, "It does no aligning of any pointer." aligned_spacing does nothing but call INK_ALIGN, so if INK_ALIGN does no aligning, then aligned_spacing does no aligning either.

INK_ALIGN absolutely rounds a pointer (or whatever value) up to the next multiple of the given power of 2. That sounds very much like something I'd use the verb "align" to describe. I'm not clear what you're claiming here?

It's a delicate question as to "what is more complex to use". Minimalism is just a heuristic that may lead to a simpler interface or may not. In this case .. I find minimalism (deferring to legacy #defines) to be a problem.

Simplicity [which is not always minimal] and consistency is the aesthetic I'm looking for: the name of a function relates clearly to what it does.

As for the verb: alignment is for pointers ... by precise distance from address 0x00000000....

Alignment can be extended to elements from an aligned pointer, by spacing them larger than they require so their pointer matches an alignment similar to the first. That's not alignment of a size ... but of the second pointer.

Nevertheless, I'm not certain we need two functions that do the exact same thing, even if the existing thing's name isn't quite as descriptive as one might prefer.

alficles · 2017-07-31T23:10:11Z

lib/ts/ink_queue.cc

 #include "ts/Diags.h"

+
+// in order to compile correctly


This comment warns against unknown danger. Can you expound upon what "compile correctly" means in this context? Why must we undefine that symbol? Is there no better way?

That's the most minimal number of lines by far. What is more ideal?

I'm not commenting on the number of lines here...

I am confused by this warning comment. It is telling me that we have to undef HAVE_LIBJEMALLOC, but doesn't indicate why, in enough detail for me to know what danger I'm avoiding.

Can you expound upon what "compile correctly" means in this context? Why must we undefine that symbol?

I may be mistaking the line in ink_queue.cc with these. The ink_queue one may be superfluous now.

iocore/eventsystem/ProxyAllocator.cc +29;: :#undef HAVE_LIBJEMALLOC lib/ts/Arena.cc +28;: :#undef HAVE_LIBJEMALLOC

alficles · 2017-07-31T23:10:40Z

lib/ts/ink_queue.h

-void ink_freelists_dump(FILE *f);
-void ink_freelists_dump_baselinerel(FILE *f);
-void ink_freelists_snap_baseline();
-


I commented earlier, but this was relocated. I'm not certain I see the motivation.

The freelist system is a very very specific use that is only used in Allocator.h and it is comprehensive and connected and unrelated to anything in ink_queue. Any changes to allocation and any traces for allocation ... will only relate to it.

If, as I needed to, it needs to be ifdef'ed out or checked for dependencies ... it cannot be done while it's sitting next door to incredibly commonly used data structures.

alficles · 2017-07-31T23:15:48Z

lib/ts/jemallctl.cc

+
+#include <cassert>
+#if defined(linux) && !defined(_XOPEN_SOURCE)
+#define _XOPEN_SOURCE 600


Are there specific features of the more recent POSIX that we're using here?

That was in the code previously ... and was brought in.

Previously to what? Brought in from where? This entire file appears to be new, and I'm not seeing a corresponding removal that would suggest that it was moved.

ink_memory.cc

Ah, you took the top of ink_memory, it looks like. Does this file need all these includes?

alficles · 2017-07-31T23:16:32Z

lib/ts/jemallctl.cc

+
+int mallctl_void(const objpath_t &oid);
+
+template <typename T_VALUE> auto mallctl_get(const objpath_t &oid) -> T_VALUE;


Already commented on trailing returns. Same observations as before.

alficles · 2017-07-31T23:18:51Z

lib/ts/jemallctl.cc

+{
+}
+
+template <typename T_VALUE, size_t N_DIFF>


N_DIFF is always zero, I think. What is it for? It appears to be unused.

It allows multiple variants of the same type of argument. Had to add it earlier but stopped using it when I realized we needed every hook filled.

Connecting template specialization on only the type makes it impossible to use a return / function type for any other purpose ... so adding in an int or something (i.e. an identifier or enum tacked on) allows one to have two versions using the same type.

I'm not sure I understand how that helps here. The parameter is unused by both callers and the class itself. Are you picturing a scenario where we need to add a second parameter for some reason?

lots of other JEmalloc params exist with need to be wrapped more nicely.

What scenario are you picturing where this argument might be useful? What does it mean?

chassell · 2017-08-02T16:03:08Z

For those who want to know: I will close this PR and bring out the smaller version I have now (passing tests) .... but with some profiling and speed comparisons.

Internally we are using some of this Jemalloc with 6.2.* to load test and to do more profiling, but this is the only target for merging.

zwoop requested review from PSUdaemon, SolidWallOfCode and bryancall July 28, 2017 02:16

zwoop assigned chassell Jul 28, 2017

zwoop added Core Performance labels Jul 28, 2017

zwoop added this to the 8.0.0 milestone Jul 28, 2017

PSUdaemon requested review from jpeach and zwoop July 28, 2017 13:52

PSUdaemon added the WIP label Jul 28, 2017

PSUdaemon requested changes Jul 28, 2017

View reviewed changes

dragon512 reviewed Jul 28, 2017

View reviewed changes

alficles suggested changes Jul 28, 2017

View reviewed changes

SolidWallOfCode reviewed Jul 30, 2017

View reviewed changes

SolidWallOfCode reviewed Jul 31, 2017

View reviewed changes

chassell force-pushed the jemalloc-and-thread-init branch 2 times, most recently from 9b68b68 to fa476b4 Compare July 31, 2017 21:59

jemalloc support (upon --with-jemalloc) with override for internal fr…

452d140

…ee-memory caches

chassell force-pushed the jemalloc-and-thread-init branch from fa476b4 to 452d140 Compare July 31, 2017 22:42

alficles suggested changes Jul 31, 2017

View reviewed changes

chassell closed this Aug 2, 2017

chassell deleted the jemalloc-and-thread-init branch August 2, 2017 16:04

chassell restored the jemalloc-and-thread-init branch August 3, 2017 16:42

zwoop removed this from the 8.0.0 milestone May 17, 2018


		int mallctl_void(const objpath_t &oid);

		template <typename T_VALUE> auto mallctl_get(const objpath_t &oid) -> T_VALUE;

full use of jemalloc with NUMA support and rework of thread spawning #2314

full use of jemalloc with NUMA support and rework of thread spawning #2314

Uh oh!

Conversation

chassell commented Jul 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zwoop commented Jul 28, 2017

Uh oh!

chassell commented Jul 28, 2017

Uh oh!

chassell commented Jul 28, 2017

Uh oh!

PSUdaemon commented Jul 28, 2017

Uh oh!

jpeach commented Jul 28, 2017

Uh oh!

PSUdaemon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chassell Jul 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chassell Jul 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

chassell commented Jul 27, 2017 •

edited

Loading

chassell Jul 28, 2017 •

edited

Loading

chassell Jul 28, 2017 •

edited

Loading

chassell Jul 28, 2017 •

edited

Loading

chassell Jul 28, 2017 •

edited

Loading

chassell Jul 28, 2017 •

edited

Loading