Skip to content

Conversation

@chassell
Copy link

@chassell chassell commented Jul 27, 2017

Lots in here but no memory rework is useful without all the parts: NUMA-binds, hugepages, jemalloc arenas.

  1. --with-jemalloc and --without-jemalloc should both work (tests fine for both)
  2. push -ljemalloc to the end of the LIBS variable so all binaries use it
  3. use hooks for hugepage-size maps made from normal pages, ready for hugepaged to convert
  4. create all nodeset-to-affinity mappings before main(), mapped to later jemalloc arenas
  5. change spawning threads to allocate stack and EThread from node-near memory arenas
  6. #define over all uses of Allocator/ClassAllocator/ThreadAllocator to depend on JM's thread cache
  7. put a barrier on thread spawns until all in a group are done, before binding cpuset
  8. I am sure this kitchen sink I found could be squeezed in given more time ...

This is a big change but mostly has new code for 3, for 4 with some C++-friendly code for jemalloc tuning. Many could probably use its profiling capabilities as well.

[p.s. Yes .. with --without-jemalloc the Allocator code runs as usual]

@zwoop
Copy link
Contributor

zwoop commented Jul 28, 2017

[approve ci]

@chassell
Copy link
Author

I used that to call the "constructor" nullptr_t ... in case we used ::mallocx(). Maybe it's too much a stretch. It can just be nullptr.
^
../../lib/ts/ink_memory.h:61:25: note: expanded from macro 'mallocx'
#define mallocx(...) nullptr_t{}
^

@chassell
Copy link
Author

uploaded a quick (late at night) fix to remove two problems it seems. clang-format tomorrow.

@PSUdaemon PSUdaemon requested review from jpeach and zwoop July 28, 2017 13:52
@PSUdaemon PSUdaemon added the WIP label Jul 28, 2017
@PSUdaemon
Copy link
Contributor

[approve ci]

@jpeach
Copy link
Contributor

jpeach commented Jul 28, 2017

@chassell Please break this up into small, independent commits. Looking briefly at the diff, I'd expect at least 10 separate commits for this change.

Copy link
Contributor

@PSUdaemon PSUdaemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a very large PR that touches a lot of stuff. It seems like it could be several PRs, a few of which should be several commits. I've left some specific comments, but I think this main issue needs to be addressed.

#define YY_FLEX_MINOR_VERSION 6
#define YY_FLEX_SUBMINOR_VERSION 0
#define YY_FLEX_MINOR_VERSION 5
#define YY_FLEX_SUBMINOR_VERSION 37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is committed generated code. You shouldn't be modifying this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wondered WTH that's checked in for. In normal practice we don't check in *.o or Makefile.

Accidentally in there.

{
m_g_pool = new ServerSessionPool;
eventProcessor.schedule_spawn(&initialize_thread_for_http_sessions, ET_NET);
eventProcessor.schedule_spawn([](EThread *thread){ thread->server_session_pool = new ServerSessionPool; }, ET_NET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to code changes like this, but this PR is already huge and this seems completely unrelated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh that can get pulled out. I think it was more a "nice thing" that is unlike the others .. and a proof of concept that it worked fine.

It's not critical to any functionality.

#include <openssl/pem.h>
#include <openssl/x509.h>
#include <openssl/ssl.h>
#include <openssl/bio.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a change you made or do you need to rebase?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't make that at all... and I was rebasing every time. Must have crept in during a cherry pick I did back and forth.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have fiddled with those because I had another include ... and I'm in the habit of putting the "system-wide" or "more-global" includes after the specialized and local ones.

It always teaches me that I didn't put correct global includes that the includes themselves need (and the include file can accidentally be okay because someone upstream included its deps).

However ... that file in specific needed very little change... so I think I can pull that back too. No big.

}

static inline void
ink_get_thread_name(char *name, size_t len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's my browser or the size of this diff, but I don't see where this function is used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was used .. now is not. It's a nice idea, but ... ennhh.

#define INK_MIN_ALIGN 8
/* INK_ALIGN() is only to be used to align on a power of 2 boundary */
#define INK_ALIGN(size, boundary) (((size) + ((boundary)-1)) & ~((boundary)-1))
#define INK_ALIGN(size, boundary) aligned_spacing(size,boundary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different behavior or you are just trying to force type safety with templates?

If so, I think this should be a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for what I was doing... I was making really sure that I didn't cut-and-paste a calc of alignment improperly. It's in so many places in the old code ... cut and pasted cut and pasted cut and pasted.

Nothing typesafe about it, to be sure, but it is like a number without using a define or an enum to clarify it. It's not essential .. but I left that in from earlier changes ... when I made aligned_spacing().

It doesn't have to be in there.. but it is nicer.

Thign is .. a lot of the "let's clean this up" ... seems unlike something that'd be approved from me, eh? I hope to see things cleaned up... ever. If they are dead-code or if they're just a far-flung cut-and-paste.

What's the best way to get cleanups in ... when they're nagging and simple or negligibly dangerous at all. Or is that precisely NOT a good PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't buy type safety. It's an identical result.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want raw numbers or values or expressions cut-and-pasted everywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay ... not everywhere. Whatever the case is ... I may put in another PR to lower the number of cut-pastes that have apparently been around a long time for alignments.

lib/ts/Arena.h Outdated

#ifndef __ARENA_H__
#define __ARENA_H__
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board with this if all our supported platforms can use it (in the past they couldn't) but I think this needs a separate PR where it's done for every header file for consistency.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That policy isn't what I heard. What I heard is "as you touch it, upgrade to current policy / practice but otherwise leave it alone."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. We do want to make the code better. I usually think of that as fixing a loop structure because it's poor when I am just changing some behavior inside of it. This seems more like something we should decide to do as a coding standard and make the changes to all header files as they should be consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw it discussed recently and the C++11 (for 8.0.0) seems a natural place to change it over.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to die-on-that-hill btw... but I am wondering what is good cleanup (i.e. a loop structure because the inside needs changing) and what is out-of-bounds busybody cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the #pragma one

It is supported by all compiler I understand we support.
any gcc, icc and clang compiler supports this feature. the versions we care about it ( ie ones that support c++11) support this feature. I already had a PR go in with this same change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on pragma in this PR. If we do pragma, raise it on dev@, get consensus and apply it to the whole codebase at once. This is not the right PR for this change.

lib/ts/Arena.h Outdated
}

#endif /* __ARENA_H__ */
struct JemallocArena : public std::allocator<uint64_t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that ATS Arenas and jemalloc Arenas were two different concepts. Is this not the case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Arena" is a string alloc system ... so this is just a prefix on it. The main reason it's here is to prevent caching/over-allocation from the intermediate system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JemallocArena is a bad name, agreed, as Jemalloc + Arena means something very precise.

That's because Arena.cc is kinda weird name for a string-storage system .. but that's all it's used for. In order to gain the benefit of Jemalloc and Thread-cache stuff (and to test it also) ... we need to pull out code that pre-allocates and partitions hoarded memory.

Mostly because we cannot track leaks / failures / profiling at all when memory is grabbed and returned to someone's internal cache system.

/// Global singleton.
class EventProcessor eventProcessor;

class ThreadAffinityInitializer : public Continuation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is removing all this correct? This looks like the stuff that @SolidWallOfCode added recently.

Copy link
Author

@chassell chassell Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

he did. but he was frustrated at how huge it was, as I read his comments. It does nearly nothing. All it does is go through a sequence of added callbacks.

No data. No benefit. 15 times bigger ... and really unrelated to anything that EventProcessor does.

He says he did it to help you out.

[EDIT: okay ... maybe only 8 times bigger... and that's LOC that may change after clang-format]

Copy link
Author

@chassell chassell Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is ... BTW .. the main purpose I wrote what I did is (1) to prevent allocating EThread inside the EventProcessor code and (2) NUMA allocation of EThread needs to be done correctly and (3) race conditions abound early on in thread groups.

Huge and crazy Factory/Object creation (including the antipattern "Cancer-Class") is never something to shove way deep down. RAII says that the creation of an object, just like invoking a C module ... should be done with all the customization done way early on.

I can see how EThread has been way way way overburdened because it's so stuck there. It's far easier to let it get customized (and it already has been elsewhere) to a lighter object type and delegate more out to derived types that are specialized to threads. If a C module can't init something ... it needs all the setup done at once with all the attributes, like pthreads.

That means new-ing it near the most-customized-point. MORE relevantly (for Jemalloc) in the correct Arena and on the correct NUMA nodes.

So I gave a try at shrinking it down using lambdas. Worked great. It's able to be seperated out, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's able to be seperated out, though.

That sounds great.

Copy link
Author

@chassell chassell Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need an opinion now (with bigger fish to fry) ... but I still don't know what you think.

Chris: <blah blah blah... I put it all this a big convenient garbage bag like you asked!>

Phil: That sounds great ....

:->. I'll leave it till a later discussion then.

if ( advice == MADV_DONTDUMP ) {
_arena = jemallctl::proc_arena_nodump;
} else if ( advice == MADV_NORMAL ) {
_arena = jemallctl::proc_arena;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hardwire these?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the current code is hardwired too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code I see now passes an advice parameter all the way down. There aren't nodump variants of the functions.

Copy link
Author

@chassell chassell Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have several properties on malloc'ed normal pages:

  1. NUMA node binding origin (if more than one exists)
  2. being replaceable within a hugepage (by alignment and equal flags)
  3. having been replaced with a hugepage by hugepaged (keeping equal flags)
  4. NODUMP flag on for single pages or for hugepage-aligned group or for single hugepage
  5. mmap() creation with MAP_GROWSDOWN flag for stacks that can increase as one unit

Essentially arenas can flag groups of pages in case 1, case 2+3 together (doesn't matter), case 4 ... and we can leave case 5 as raw mmap of reasonable alignment. It may even make sense for case 5 to be done in smaller pages instead, as many probably don't use their full stack and could stay small automatically.

So arenas exist in advance for NUMA-interleaved memory, for NUMA-isolated memory, for NODUMP pages only [which always stay as NODUMP]. Each defines a set of pages that don't get shared ... though they can be "replaced" with clean pages (and therefore freed from alloc in the kernel ... which of course can use or release physical pages as desired).

The proc_arena is "process arena" and it is the global NUMA-interleaved memory. The proc_arena_nodump is the same as "process arena" except all its pages (hugepages hopefully) get NODUMP flagged in their use.

Each flagging / reflagging is an annoying system call and, again, if we leave caching to Jemalloc then the set of pages are already prepared for use by the IOCache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway ... to answer the implied question: the arena was created and its jemalloc hooks are different: all its pages are created and then madvise() flagged. Each arena can have different hooks and there are two varieties. With and without NODUMP.

bool huge_decommit(void *chunk, size_t size, size_t offset, size_t length, unsigned arena_ind);
bool huge_purge(void *chunk, size_t size, size_t offset, size_t length, unsigned arena_ind);
bool huge_split(void *chunk, size_t size, size_t size_a, size_t size_b, bool committed, unsigned arena_ind);
bool huge_merge(void *chunk_a, size_t size_a, void *chunk_b, size_t size_b, bool committed, unsigned arena_ind);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand better why all the hooks have to be implemented. I've seen examples where just the alloc and dalloc are implemented.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is over-allocating the pages so that they are (1) aligned on a hugepage boundary and (2) oversized in allocation so that we don't need to commit/decommit individual small-pages.

This is the way we get zero-latency hugepages. It works nicely. :->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow.

Copy link
Author

@chassell chassell Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a block of 2Meg (512 pages) ... is instant and as fast as mmap can do it. It never will fail unless memory is gone. There is no wait and there is no muss and no fuss and it is transparent. In fact ... jemalloc first demands an mmap() with no pages actually attached .. and then calls the "commit" calls to attach usable pages to the region laid out in a process.

Why do we want hugepages, then? Not because we demand them (for some hardware). We need memory simplified / copied / rearranged down so that contiguous pages are put directly into a single hugepage ... over time. Best of all would be simply for it to happen in the background.

That's what hugepaged does.

Because of that ... everything in these hooks is aligned and allocated in hugepage-sized groups (512 aligned correctly). This system then lets jemalloc still demand changes to segments and offsets within the page. That means commit/decommit are often no-ops that simply return with no effect: a single hugepage is already present (or may be) and de-committing part of it is somewhat silly.

Currently btw... all these calls are basically stateless. They understand one thing: they get a nullptr (huge_alloc) or they get something that huge_alloc returned to work with. Since I don't allow huge_split() to succeed unless distinct hugepages are created ... they may stay in a big single bloc forever, or allocated one 512-page segment at a time.

De-allocating an aligned set of hugepages ... is performed as needed. However arenas are rarely shut down entirely ... to that degree. A group of 512 pages together will still need some page or another left behind.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes ... if we ever demand "this must be a hugepage" ... we are always always going to have to have them assembled for us and will have to wait.

Since we're not in the hardware business, we don't need that kind of latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a different way to do this with the system, however last I check in windows and Linux this required admin/root access. What is being stated here is that this allows a way for use to get a block of memory ( remember we have 64-bit address space which no hardware can use fully) ideally larger than we need. When used for items like arrays, we can simply grow them by having the system move the virtual size of the memory to commit. So a array of 1024 items can be grown to 2048 as we just say the memory we are using it now 2048 items. given the "virtual address" space is free we can do this. The OS maps the virtual address to random hardware for us on the back end, with no extra cost. However I don't think the cost in jemalloc is so low, as it has to manage chucks etc to be handled out. What i don't get is why we are handling these chunks and not jemalloc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check on some of that ... I am checking more closely and there may be better support for hugepages than I'd thought in jemalloc. Not certain if it is latency-inducing or not.

Copy link
Author

@chassell chassell Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is ... it's not in the (new) first commit and I'll leave it out of this PR until jemalloc shows it cannot do it.

@chassell
Copy link
Author

Eek.. gotta get those tests passing.

My current plan is looking like this .. and I'd like some thoughts on it @PSUdaemon @SolidWallOfCode @jpeach

(1) jemalloc library use and Allocator/ClassAllocator/ThreadAllocator-replace as an experimental commit. A MADV_NODUMP arena for I/O and that's all to add.

No NUMA-node limits using arenas. Same as before ... and kinda broken, but its "experimental". In theory ... the thread cache will handle (and be pre-loaded) with all the same quality of freelist entries as the freelist system did. Probably far more compatible with tools and Jemalloc features.

(2) the Hwloc / NUMA affinity code put in so we have a quick call to "assign by affinity" for memory and then for cpuset ... maybe removing less of UnixEventProcessor's code. It will automatically use Jemalloc Arenas.

No hugepage-replacement by hugepaged .. and EThread is allocated from distant NUMA nodes.

(3) Hugepage-automatic hooks for Jemalloc. Clear and simple for a single commit so all arenas attempt to allocate (if enabled) correct size and alignment, automatically.

Not much downside to this. I have to be conservative in applying madvise() because contradictory flags cannot be put on a single hugepage.

(4) Improved EThread allocation (within correct node and arena) and improve startup of cpuset binding so that no race conditions exist near the start of a thread group. (use a barrier to release them all at once).

One other useful possibility is to improve the cache-MADV_NODUMP arena so that it uses HWLOC_..._REPLICATE and can put read-only data on near NUMA nodes.

(5) Other and wholly unrelated things: [A] a shim to change arenas in and out of plugin code, [B] recording the set of Continuation callbacks as set/called and [C] performing profiling within plugins or within individual callbacks.

All that is a dependent DAG ... but not necessarily monolithic.

@alficles
Copy link
Contributor

@jpeach You say ~10 commits, but I'm not quite as certain. Are you picturing each commit being self-contained, operating properly and independent of the other commits? There are definitely some things in here that are cleanup that could be moved entirely to a separate PR, but most of the rest is fairly interdependent.

@chassell
Copy link
Author

@jpeach @alficles I looked more at the list and talked with @jrushford.

He thinks I could split out maybe an "experimental-and-non-default" jemalloc first of all, get it approved, ... and then add on more independent work for NUMA-init changes etc... above.

only exception may be the thread-spawn code rework that needs both NUMA and jemalloc changes.

lib/ts/Arena.h Outdated
void str_free(char *str) { free(str,0); }

// does it the long hard way
static size_t str_length(const char *str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unclear on why we have all these new functions.

Adding a memory allocator is more about replacing new/malloc and delete/free backends. For example, this function does not seem to me that it should exist. strlen is a compiler intrinsic

Copy link
Contributor

@alficles alficles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gotten through only part of this. It's pretty big and could definitely be smaller. I'd see what we can whittle out of this that isn't necessary. The stuff that comes out can definitely go into separate PRs, where it's easier to digest.

const objpath_t _oid;
};

template <typename T_VALUE, size_t N_DIFF=0>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is N_DIFF for? I may have missed it, but I don't see it used.


using objpath_t = std::vector<size_t>;

struct ObjBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the name, I initially thought (and worried) that this was a new base for all objects. Maybe something like ObjFxnBase would be clearer, in context with (G|S)etObjFxn?

auto mallctl_get(const objpath_t &oid) -> T_VALUE;

template <typename T_VALUE>
auto mallctl_set(const objpath_t &oid, const T_VALUE &v) -> int;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to comment here, but you've used trailing returns unnecessarily, basically everywhere. Am I missing something? Is there a reason to use trailing returns when the return type does not depend on the types of parameters defined with template parameter types?

void *ptr = static_cast<char*>(nullptr) + len; // pointer from zero
// next size >= len to get next a new aligned block
return static_cast<const char*>( ats_align(block, 0, ptr, block) )
- static_cast<const char*>(nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a fan of type-safe things, but I'm pretty sure we're not gaining any type-safety here, and I think this function could be drastically simplified into a single function call. As it is, it takes far too many brain-cycles to figure out that this is just returning the results of one function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can find it ... I'm really wondering wth the C++17 design wanted as the standard "alignment" function. I had it on my machine but not elsewhere ... so std::align() ... really SUCKS.

It's deprecated and it's difficult and looks like it was designed by committee ... and I replaced it with "our own version" ... and it even now makes things tougher than they should be.

Cut and paste is not the way to make things clearer, though... because one messup becomes invisible. Every bit-system-AND-and-NOT is far worse than doing it well somewhere.

Though I have had it with the std::align() interface.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW .. type safety is ... not directly what I'd call it. It is more "obscurity removal" (i.e. enums versus raw numbers and cut-and-paste exprs) and "removing-poorly-handled-#define-cases".

But ... it is not the core issue now. At least I trust this more as it compiles with full safety, but something has to be better.

// workalike to std::align() ... though it's not a great API
//
template <typename T_OBJ>
inline T_OBJ *ats_align(std::size_t alignment, std::size_t size, T_OBJ *&ptr, std::size_t &space)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the use-cases, I'm not sure this is a significant improvement over the old INK_ALIGN. It modifies both ptr and space, but nothing intentionally uses these modified values. Separately, I'm not certain it is necessary for jemalloc. Consider sliding this into its own PR, perhaps re-thinking some of the signatures?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the space argument isn't used for its purpose anywhere either.

return (void *)aligned;
size_t left = alignment;
void *aptr = ats_align(alignment, 0, pointer_, left);
memset( pointer_, '\0', alignment - left ); // zero bytes before new block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this doesn't work. pointer_ is changed by ats_align in the return, so it and aptr will always be the same. The previous code seems both clearer and correct in this case.

} Thread_Init_Func;
}
auto n = std::find( std::begin(kAffinity_objs), std::end(kAffinity_objs), objtype ) - std::begin(kAffinity_objs);
n %= countof(kAffinity_objs); // remap end to index zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be easier ways to accomplish wrapping the end to 0 than re-writing countof.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's used elsewhere and ... the compiler doesn't care at all in many cases.

static_assert(std::extent<T>::value > 0, // [0]
"zero- or unknown-size array");
return std::extent<T>::value;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm convinced that the one new use case justifies re-writing countof. Is there a major payoff I'm missing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Countof outside of #define is a lot nicer and handles more case correctly, no matter what is put in there. #define can return a "value" in cases where it shouldn't. I just don't trust it.

return std::extent<T>::value;
}

template<typename T, typename U> constexpr size_t offset_of(U T::*member)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no... it lived there to be used in another branch for C++-style lists instead of the #define LINK.

This thing has older little stuff to shrink down, to be sure. I will have some minimalization by later Monday.

#endif
#elif __cplusplus
// some signatures need this
using hwloc_obj_type_t = int;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how it's done in the rest of the code? Most of the places I've seen use the typedef form.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typedef is not deprcated .. but it's much harder to read and it cannot easily work the the "default" situations in a stack of derived classes. It's nice to just "using SubClass::TypeInSubclass_t;" and then ... frankly .. all types should be introduced that way.

@dragon512
Copy link
Contributor

Have you run this with a Vtune profiler. I would be interested on the change in your cache and TLB miss and hit rates.

@jpeach
Copy link
Contributor

jpeach commented Jul 29, 2017

I can't follow this at all. This PR is some tangled web of cleanups and allocation changes. There's no problem statement of what you are trying to do there, no benchmarks and a single giant commit. Every cleanup here should be a separate commit, possibly a separate PR. Substantial changes should be made one at a time. If you have interdependent changes, restructure them to break the dependency. That is always possible.

IMHO this PR is a total non-starter. No-one should be landing this as it stands.

@mingzym
Copy link
Member

mingzym commented Jul 30, 2017

In this PR, I think I can not get what problem this PR is addressing, or how better the performance may benefit from, I think that is what we want from this PR, right?

@see spawn_event_threads
*/
EventType register_event_type(char const *name);
EventType register_event_type(char const *name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this inline? It's not called often, it's cleaner to have it in the .cc file.

@param cont continuation that the spawn thread will call back
immediately.
@return event object representing the start of the thread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment doesn't seem consistent with the return type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found not a single soul used that return. However the docs then become misleading as they are.

This part will be separated out for other discussion.

*/
void shutdown() override;
virtual void shutdown() override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Methods can have at most one of virtual, override, or final.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was my way of making a function look the same every time. I can remove it.

ThreadGroupDescriptor *tg = &(thread_group[ev_type]);
// perform all inits for EThread
for( auto &&cb : _spawnQueue ) {
cb(t); // init calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this work correctly if the callbacks are needed to initialize thread local storage, especially the C++ variety?

Copy link
Author

@chassell chassell Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's quite precisely identical to ....

for( int i = 0 ; i < countof(_spawnQueue) ; ++i )
{
(_spawnQueue[i])(t);
// i.e. use a temporary rvalue-only-ref of the pointer / functor
// and call the void (*f)(EThread *)
}

@chassell chassell force-pushed the jemalloc-and-thread-init branch 2 times, most recently from 9b68b68 to fa476b4 Compare July 31, 2017 21:59
@chassell
Copy link
Author

Put in jemalloc in a much more raw form. No use of Arenas excepting to divide up normal and NODUMP memory.

This patch is meant for performance, cleanup, maintainability [fewer caches].

@chassell chassell force-pushed the jemalloc-and-thread-init branch from fa476b4 to 452d140 Compare July 31, 2017 22:42
@chassell
Copy link
Author

New forced push ... fewer include reorderings. :->


#define RND16(_x) (((_x) + 15) & ~15)

struct _InkFreeList {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this was moved from ink_queue.h. Why?


#include <sys/types.h>
#include <memory.h>
#include "ts/ink_assert.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to reorder these for a particular reason?

Copy link
Contributor

@alficles alficles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only gotten through about half, so I'll try to get through the rest later. But here are my initial thoughts on round 2. I've re-mentioned stuff still in the queue from the first round, so it stays fresh.

};

#if HAVE_LIBJEMALLOC
#define Arena Arena_NoCache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels odd. You're over-riding the alloc/free functions in a non-virtual class by way of the preprocessor. A person reading the code is unlikely to properly guess which function is actually going to be called. I'd move the #ifdef into each of the alloc and free functions and the body of the class Arena.

As it is, if someone changes the signature of the first alloc (and the places that call it) without changing the jemalloc, not only will it compile, it will mostly work, but with the wrong allocator.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. nope. C++ blocks all signature overloads from a base class that have the same function name.

Lots and lots and lots of different techniques of specialized memory-hoarding doesn't speed things up, but often makes them vulnerable to unchecked and invisible bad-pointer usage.

There's also no way to profile deletions or allocations ... when they're hidden deep inside a memory-hoarder. The thread-cache in Jemalloc is far more efficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough. It's still odd. I'm not seeing how the claim about memory-hoarding is relevant to this observation thought? I'm claiming that a reader is unlikely to determine correctly which code is actually going to execute by reading the function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I'm trying to make a non-mainline Jemalloc install (which is the point of the commit) work such that we can estimate it's performance. As it is ... the burden of #ifdef is worse than the burden of two overridden calls.

{
value_type *preCached[chunk_size];

for (int n = chunk_size; n--;) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n declared signed with unsigned argument. Probably not a problem in reality, but feels wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should auto it. But signed/unsigned mismatches (and casts) are all over the code. :->

public:
using typename std::allocator<T_OBJECT>::value_type;

ObjAllocator(const char *name, unsigned chunk_size = 128) : _name(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a more seasoned ATS hacker chime in here? Do we prefer unsigned or unsigned int? I prefer the latter, but if the community is using the former, that's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the preference is to be less ambiguous by using unsigned int but it's certainly not consistent throughout the code.

}

private:
const char *_name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I correctly observe that this name is never read? Only written once on initialization? And the caller is expected to maintain the memory allocation, since it's never freed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ looking into that ... as it may be old. it may have been pointing to non-malloc'ed static strings? bad as it is not nullptr'ed anyway.]

aligned_spacing(size_t len, size_t block = INK_MIN_ALIGN)
{
return INK_ALIGN(len, block);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is superfluous. In every case added here, INK_ALIGN does exactly the same job without asking the reader to comprehend a new function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic burden is much worse with INK_ALIGN. It does no aligning of any pointer. The burden is worse also because INK_ALIGN is the odd man out. All the other functions in that header look like the ones below.

My case is that INK_ALIGN does not do what the name is... and even has the "INK_*" prefix that seems rather redundant and old compared with ats_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean by "the semantic burden".

You also say, "It does no aligning of any pointer." aligned_spacing does nothing but call INK_ALIGN, so if INK_ALIGN does no aligning, then aligned_spacing does no aligning either.

INK_ALIGN absolutely rounds a pointer (or whatever value) up to the next multiple of the given power of 2. That sounds very much like something I'd use the verb "align" to describe. I'm not clear what you're claiming here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a delicate question as to "what is more complex to use". Minimalism is just a heuristic that may lead to a simpler interface or may not. In this case .. I find minimalism (deferring to legacy #defines) to be a problem.

Simplicity [which is not always minimal] and consistency is the aesthetic I'm looking for: the name of a function relates clearly to what it does.

As for the verb: alignment is for pointers ... by precise distance from address 0x00000000....

Alignment can be extended to elements from an aligned pointer, by spacing them larger than they require so their pointer matches an alignment similar to the first. That's not alignment of a size ... but of the second pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevertheless, I'm not certain we need two functions that do the exact same thing, even if the existing thing's name isn't quite as descriptive as one might prefer.

#include "ts/Diags.h"


// in order to compile correctly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment warns against unknown danger. Can you expound upon what "compile correctly" means in this context? Why must we undefine that symbol? Is there no better way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the most minimal number of lines by far. What is more ideal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not commenting on the number of lines here...

I am confused by this warning comment. It is telling me that we have to undef HAVE_LIBJEMALLOC, but doesn't indicate why, in enough detail for me to know what danger I'm avoiding.

Can you expound upon what "compile correctly" means in this context? Why must we undefine that symbol?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be mistaking the line in ink_queue.cc with these. The ink_queue one may be superfluous now.

iocore/eventsystem/ProxyAllocator.cc +29;: :#undef HAVE_LIBJEMALLOC
lib/ts/Arena.cc +28;: :#undef HAVE_LIBJEMALLOC

void ink_freelists_dump(FILE *f);
void ink_freelists_dump_baselinerel(FILE *f);
void ink_freelists_snap_baseline();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented earlier, but this was relocated. I'm not certain I see the motivation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The freelist system is a very very specific use that is only used in Allocator.h and it is comprehensive and connected and unrelated to anything in ink_queue. Any changes to allocation and any traces for allocation ... will only relate to it.

If, as I needed to, it needs to be ifdef'ed out or checked for dependencies ... it cannot be done while it's sitting next door to incredibly commonly used data structures.


#include <cassert>
#if defined(linux) && !defined(_XOPEN_SOURCE)
#define _XOPEN_SOURCE 600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there specific features of the more recent POSIX that we're using here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was in the code previously ... and was brought in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously to what? Brought in from where? This entire file appears to be new, and I'm not seeing a corresponding removal that would suggest that it was moved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ink_memory.cc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you took the top of ink_memory, it looks like. Does this file need all these includes?


int mallctl_void(const objpath_t &oid);

template <typename T_VALUE> auto mallctl_get(const objpath_t &oid) -> T_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already commented on trailing returns. Same observations as before.

{
}

template <typename T_VALUE, size_t N_DIFF>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N_DIFF is always zero, I think. What is it for? It appears to be unused.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows multiple variants of the same type of argument. Had to add it earlier but stopped using it when I realized we needed every hook filled.

Connecting template specialization on only the type makes it impossible to use a return / function type for any other purpose ... so adding in an int or something (i.e. an identifier or enum tacked on) allows one to have two versions using the same type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how that helps here. The parameter is unused by both callers and the class itself. Are you picturing a scenario where we need to add a second parameter for some reason?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of other JEmalloc params exist with need to be wrapped more nicely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario are you picturing where this argument might be useful? What does it mean?

@chassell
Copy link
Author

chassell commented Aug 2, 2017

For those who want to know: I will close this PR and bring out the smaller version I have now (passing tests) .... but with some profiling and speed comparisons.

Internally we are using some of this Jemalloc with 6.2.* to load test and to do more profiling, but this is the only target for merging.

@chassell chassell closed this Aug 2, 2017
@chassell chassell deleted the jemalloc-and-thread-init branch August 2, 2017 16:04
@chassell chassell restored the jemalloc-and-thread-init branch August 3, 2017 16:42
@zwoop zwoop removed this from the 8.0.0 milestone May 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants