Parallelism #41

shelby3 · 2018-05-22T12:16:35Z

Unfortunately this discussion started in Error handling thread.

And the discussion derived from a post I made about C no longer being a low-level language because of its mismatch to the fact that all scaling must come from parallelism.

I hope that discussion will continue here instead.

How can we ever find any of our 1000s of comments of past discussion if they are buried in unrelated threads.

shelby3 · 2018-05-22T15:53:42Z

@keean wrote:

Actually you have that backwards, there is more information about parallelism available at runtime.This is why VLIW processors failed to outperform x86.

VLIW was based on the idea that the compiler could do more to extract parallelism from the code than the CPU, so they made a highly parallel CPU with many execution units that could run multiple instructions in parallel.

You mentioned this VLIW comparison and then VLSI/VHDL parallelism in our past discussion in the Concurrency thread. And ironically you made the same conflation of “low and high-level parallelism” which I will be correcting you on again herein.

I agree there’s more information at runtime relative to compile-time approach which is parallelizing the same algorithm.

But your thought process is incorrectly applied to what I wrote. You conflated low-level runtime, with high-level compile-time. You must compare apples to apples. Your point is true for low-level runtime versus low-level compile-time.

I did not write “more information”. I wrote “more semantic information”. The distinction apparently means nothing to you but it should.

The “more information” at runtime is limited in its character by the semantics expressed in the code. The higher level semantic information enables (the programmer, not the compiler!) to make changes in the algorithm which are impossible for the low-level speculative pipelining parallelism to achieve, because the low-level transformations cannot change the high-level algorithm. A high-level algorithm designed to achieve parallelism can obtains orders-of-magnitude scale of performance; whereas, the low-level speculative pipelining is only achieving much less than an order-of-magnitude increase and this increase is not scaling by Moore’s law over time because the CPU designers have hit a limitation of nature which can’t scale. Advocating the futility of fighting against nature is inane, idiotic, myopic, etc..

In addition to my slamdunk point above, your point about runtime information can also be applied by a compiler+runtime at a higher semantic level which the low-level pipelining can’t do. For example, with pure functions a map operation can parallelize all the operation for all elements, but the decision about whether it should or not is dependent on the ratio of the overhead of parallelization relative to runtime cost for the operation. So the runtime could measure this ratio and decide whether to parallelize the map. A compiler could estimate it, but for cases near to the margin a runtime could measure it.

Another tangential point is that when you claim that it’s too difficult for programmers to extract parallelism, you are ignore the points made in the ACM article that extant PLs lack the proper designs to extract the parallelism that they could plausibly do without too much programmer intervention, such as the aforementioned pure function example. But my point has been also that ingenuous programmers will find even more clever algorithms for parallelism.

That ACM article argued that C is no longer a low-level language because due to lacking these facilities for the compiler to automatically extract higher-level parallelism from the code which the low-level speculative pipelining cannot extract, then CPU designers are forced to create a virtual microcode pipelining hardware inside the CPU which C does not even expose to the programmer. IOW that parallelism is the only way to continue to scale Moore’s law and C encourages CPU designers to instead not scale and waste most of the transistors on trying to squeeze diminishing incremental improvements in serial performance. That is a recipe for disaster with the future of computing becoming stagnant and technology which relies on computing failing to improve. Thus it behooves us as PL designers to address this issue.

Essentially they stripped the speculative out or order hardware out of the CPU and used that space for extra execution units. This relied on the compiler to produce the parallel instruction.bundles. What they found was it could only outperform the x86 on very specific scientific computation tasks, and failed to be faster for general computation.

Because they were running serial algorithms. No where have I claimed that running serial algorithms on parallel hardware will be faster than running serial algorithms on speculative pipelining. Thus you shouldn’t write this irrelevant information which doesn’t even apply to my point.

My point was about the need to parallelize our performance critical code paths, otherwise our programs will not scale. This is a statement of fact, which none of your obfuscating BS has refuted.

I go with evidence over speculation every time.

Me too I will also choose speculative parallelism because of the evidence. At a higher semantic level. Which is for example how 3-SAT is made practical for some working sets.

And you will go with evidence? Then why do you ignore the evidence that future improvements in serial speedups are dying? And why am I repeating this for the 4th or 5th time?

I simply can’t believe you’re this slow minded because you’ve demonstrated to me in many instances that you have a sharp intellect. Why the cognitive dissonance here? Is it because and do you lack divergent thinking? Is it some obstinance issue? Are you preoccupied? Are you being lazy and not really thinking it out and relying on past conclusions you had made instead of reanalyzing them in light of my points?

keean · 2018-05-22T16:12:18Z

@shelby3

3-SAT is made practical for some working sets.

And 3-SAT is not general computation, it is a very specific algorithm, for a class of problems, but it looks very different from general computation.

The general part of most programs is a event loop that queues sequential processes, that necessarily follow one another, or are connected by streams in the pipelined case.

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I agree that a new language has to handle parallelism. There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

shelby3 · 2018-05-23T01:46:51Z

@keean wrote:

The general part of most programs is a event loop that queues sequential[ly ordered] processes, that necessarily follow one another, or are [analogously sequentially ordered by being] connected by streams in the [Unix shell paradigm] pipelined case.

ftfy.

Sequentially ordered != sequential process. There may be parallelism within the individual processes each of which are sequentially ordered. I know you know that nature is fractal, so there can be parallelism within serialism and vice versa.

Also the events in the event loop which are in the performance critical path are typically not sequentially ordered and can be performed in any order or even in parallel in some cases. The GUI events which are ordered are typically not performance critical (we do not need to scale the performance of GUIs by orders-of-magnitude).

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I have not looked at what are the limitations of the designs of existing GPUs. Seems that the ones built into the CPU don’t have the fast GBs of dual-ported main memory that dedicated GPU cards have? And what are the issues with exchanging work between CPU threads and GPU threads? Are the cost barriers high enough that the two processors pretty much can only send messages to each other? I know you’re reasonably knowledgeable about the extant hardware designs, so I will defer to your knowledge on this.

But yeah, maybe our compiler should be doing more to leverage that GPU (and also the CPU’s multicore) hardware which often sits idle.

I agree that a new language has to handle parallelism.

I think my recent point can be recapitulated/reconstituted as the more parallelism we can push for in code, then the more incentive CPU designers will have to prioritize high-level parallelism instead of low-level pipelining. Hopefully resulting in scaling trending back to Moore’s law progress.

My point is we should do all we can and reduce dependence on the low-level speculative pipelining in our design decisions. For example, such as where you argued in the Error Handling thread that exceptions as return values would not be an extra performance cost because of the low-level speculative pipelining. But that design decision if taken would have further cemented the need for low-level speculative pipelining and helped kill future scaling. And upon further analysis of your proposal independent of the concerns with reliance on speculative pipelining, it became clear we didn’t need to accept that design decision.

There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

Not just parallelism but also we need to diminish our reliance of the model of main memory as having latency that is low enough to keep up with a serial processor. Because this requires expending the transistor budget on complex deleterious caching mechanisms.

The ACM article also points other things we can improve to reduce for example the reliance on cache coherency, which is necessary in order to enable hardware designers to prioritize parallel computation units instead of the very costly (inefficient!) model of low-latency main memory.

For example that ACM article pointed out that if our PL allows for the compiler+runtime to reformat arrays of structures to structures of arrays, without intervention from the programmer. Also I wrote:

Let me quote what I wrote in my original post:

So the author of the article explained that instead of complex caches (and to accommodate the large working sets that the Transputer didn’t) we perhaps need an architecture where we have so many threads running that the latency to shared memory can be offset by computation these threads have to do in between accesses to main memory such that the CPU is always fully utilized. IOW, while some threads are waiting on main memory, other threads are running computation.

Note AFAIK that most GPUs cards include a lot of shared memory with a very fast, wide bus. So perhaps they’re already close to what is optimal even for large working sets.

However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

Not always true. Again nature is fractal.

shelby3 · 2018-05-23T04:05:33Z

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

And the corollary is that expending it to speed up serial algorithms that are in the performance critical path of the application futile non-scaling.

keean · 2018-05-23T05:45:37Z

@shelby3

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Having 4 cores does not make up for this, just look at the failure of the Power Architecture. PowerPC was IBMs attempt to follow that approach, lots of simpler cores. It was tried in the playstation2, and the added complexity of programming the thing drove developers away, leading to the dominance of the XBox, as it provided a proper PC like CPU with caches and super-scalar-out-of-order processing.

The ideas you and that article are proposing are not new, they have been tried, and they failed the test of the free market.

shelby3 · 2018-05-23T10:29:40Z

@keean wrote:

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Do you not understand what I intend by “are not in the performance critical path of the application”?

It can’t be slower in any relevant way, if it is not in the performance critical path.

Whoop-de-doo if you make your highly responsive GUI more responsive and nobody can discern the speedup, because it is not code in the performance critical path. Law of diminishing returns aka marginal (non-uniform) utility.

Another corollary is that only 5% (or maybe up to 20%) of your application’s code is typically in the performance critical path, although there are exceptions. Thus the less stellar programmers will still have a lot of work they’re qualified to do that can be serial algorithms.

Btw, one (or maybe 2) core in the CPU could retain that speculative pipelining and the other 8, 16, or 24 (and increasing in the future) could then not waste their transistor budget on marketing hype. Has AMD already designed a CPU like this, such as with most of the transistor budget for the GPU on the same die?

shelby3 · 2018-05-24T05:32:16Z

Yet another example of parallelism I think the compiler can extract without intervention from the programmer.

keean · 2018-05-24T06:38:12Z

@shelby3

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

shelby3 · 2018-05-24T14:11:20Z

@keean wrote:

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

I did not propose to employ parallelism in code that is not in the throughput performance critical path. Thus latency would not be impacted in the typically up to 95% of the code of a program that is not in the throughput performance critical path. Since that 95% of the code can only run serially then it can’t leverage the other cores. Thus one or two cores with the speculative pipelining is sufficient (on the desktop and servers will eventually discard the speculative pipelining entirely once we improve the software). The rest of the transistor budget can be allocated to parallelism which is the only way to scale.

You could argue that running multiple copies of the program such as for a webserver where each request runs a separate copy of the program, thus would need every core to have speculative pipelining in order to minimize latency. But I would argue that a 50% reduction in CPU-bounded latency is not worth killing the scaling (by Moore’s Law) of the number of cores and thus number of requests that can be served per CPU. This is why Xeon (which sacrifices clock rate for more cores) instead of desktop Haswell/Broadwell architecture runs on most servers. So Xeon is sacrificing latency for parallelism.

Also Ryzen is kicking ass on Intel precisely because they made the right tradeoff for where computing is headed, “Ryzen CPUs offered stronger multi-threaded performance and weaker single-threaded performance relative to comparable Intel CPUs.”

Sparc was too geared towards parallelism too soon. The software had not caught up yet. AMD is probably hitting the sweet spot for the transition phase. They decided to emphasize their strengths in design and parallelism (GPUs) and outsource the foundry (eventually leveraging the coming strength of China in foundries!). AMD is following Android’s model and Intel is following Apple’s model, so I expect AMD to end up with more market share than Intel unless they fuck up again to snatch defeat from the jaws of victory. Also AMD has rising capital to expend because of the rising demand for GPUs for mining cryptocurrencies. But eventually something like Sparc or GPUs will come back much stronger in the future when the software adapts to more parallelism. And look who continues Sparc now. Japan! The Asians are going to kick ass in the future.

Also the latency is typically not bounded by the CPU but rather bounded by the persistent storage media, e.g. SSDs. So there is sometimes (or maybe even typically?) ample overhead to allow for parallelism and discard the speculative pipelining without increasing latency, although this can vary in each situation. You could attempt to make the argument that latency of the CPU and persistent storage are additive, but smart coders know to mask the CPU latency in the persistent storage latency with interleaving. Ditto the masking the CPU throughput and latency with memory latency via interleaving computation with read/write accesses.

shelby3 · 2018-05-27T15:22:37Z

I farted:

If for example that is an image filter, then split up the image file into tiles and parallelize the entire filter chain over tiles. If necessary overlap the tiles slightly if the filter input domain exceeds its output domain.

shelby3 · 2018-06-02T06:46:44Z

Some parallelism discussion in the Concurrency issues thread #17.

sighoya · 2018-06-06T11:36:07Z

@shelby, Do you know how the ram capacity of a normal PC Board will increase in future? Massive multi core increase for the future is one good thing but to run multiple instances simultaneously wee need also more ram capacity.

keean · 2018-06-06T13:46:53Z

@shelby3 Threadripper!

https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh

It's still speculative out of order with up to 32 cores.

shelby3 · 2018-06-07T06:11:20Z

@keean wrote:

It's still speculative out of order with up to 32 cores.

64 hyperthreads on a desktop! Wow.

Yeah I had realized that removing the speculative out-of-order doesn’t gain much. But the point remains that further increases due to Moore’s law are going into cores now. There’s no more that can be squeezed out of OoO exection.

Remember I wrote recently in the Concurrency issues thread #17:

So again I ask you, “Do you know what percentage of the silicon of Intel CPUs are for the speculative execution and register rewriting as contrasted with the caching?”

Apparently only about 10 – 15% of the transistor budget is expended on OoO execution (percentage of TPD power budget may be higher):

https://hothardware.com/news/amd-ryzen-die-shot-cache-structure-architecture-advantages-exposed

https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Core_2

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge_(client)#Core_2

So apparently there will be not much sense in removing for purposes of increasing number of cores.

Although it would still perhaps be beneficial to turn it off when we don’t need it (i.e. in I/O bound concurrency) in order to conserve power consumption (important on mobile and in data centers), and when we need our code to surely free from timing attacks that might be lurking in the nondeterministic out-of-order speculation.

Nevertheless the point remains that transistor budget is increasing faster than transistors can be consumed for extant core counts that would increase performance. IOW, the OoO execution and superscalar features can’t improved. So Intel and AMD are forced to use those transistors for increased core counts.

So more extensive use of concurrency and parallelism is unavoidable.

shelby3 · 2018-06-07T06:14:31Z

@sighoya I don’t know the plans about DRAM. I presume they’ll scale everything up. It’s the latency which can’t keep pace. But with massive concurrency then we can coalesce main memory accesses as GPUs do.

@keean wrote:

But we need the L3 cache.

Well actually one day in the distant future when the concurrency is massive, then can do coalescing of main memory (and also persistent storage) accesses with concurrency then we won’t even need L3 cache, but that is I think too far into the future to design for now.

For that sort of use C + OpenMP would be a good target.

Found this:

AMD told us you should theoretically be able to build a Threadripper system with up to 1TB of memory when 128GB LR-DIMMs are used—hopefully enough to hold you for a few years. (AMD also says Threadripper should technically be able to support up to 2TB of RAM, although the company hasn’t validated this because there are no DIMMs that support the capacity yet.)

keean · 2018-06-07T06:16:58Z

@shelby3
Think of it like this, CPUs like threadripper are best when you have threads that need to take different paths, or run different tasks. GPUs are best when all the thread run in lockstep, all taking the same codepath in parallel.

On a CPU you might lauch parallel threads, and then wait for them all to complete (a rendezvous) on a GPU you know all the tasks finish at the same time. So a GPU is going to be more efficient for loop parallelisation (as long as its a big enough chunk of work to overcome the cost of launching the task and collecting the results). A CPU is going to be more efficient at having one thread per service connection dealing with client connections etc.

keean · 2018-06-07T06:20:12Z

@shelby3 A good parallel language should have different syntactic structures to represent these different kinds of parallelism, so the programmer is clear how things will execute and how much they will cost (the principle of visibility, and not paying for what you don't use). You could automake this with heuristics, but then the programmer has to play chase the heuristic, where small seemingly inconsequential code changes can tip the balance and result in unpredictable changes in performance.

shelby3 · 2018-06-21T07:45:52Z

@keean made me aware of this:

Microsoft is working on a new paradigm for increased parallelism:

Now Microsoft ports Windows 10, Linux to homegrown CPU

Explicit data graph execution - Wikipedia

Also remember I predicted this earlier in this discussion about the death of speculative superscalar architecture and shared caching between threads:

OpenBSD disables Intel’s hyper-threading over CPU data leak fears

https://www.itwire.com/security/83347-openbsd-chief-fix-for-new-intel-cpu-bug.html

And remember that there is no way to disable timing information (can merely increment a variable in a loop):

https://gruss.cc/files/fantastictimers.pdf

The only 100% safe way to stop code from software timing is to make all code run in a continuously randomly changing speed of clock, which isn’t desirable nor realistic.

OS preemptive threading may to some extent foil the use of timers which rely on incrementing a variable and non-preempted execution, but the devil is in the details. I don’t know the minimum interval needed to measure a timing attack with software timers. And if the OS preempts too frequently then overhead increases, which eats into the performance that was supposed to be gained from those hardware performance features (speculative execution and shared caches) which make the timing attacks possible. Thus I’m not sure that removing hardware timers is a complete solution. Also some software needs high-precision timers. Thus I still propose that in the future we will need the option to have code request that it be run only on cores with that vulnerable performance enhancements turned off and/or to dictate which other code can run along with it in the other hyperthreads which share the same vulnerable performance enhancements. Code that prioritizes security over maximum performance or which employs suitable parallelism may want to turn those vulnerable performance enhancements off.

EDIT: if you want some flavor of why correlation attacks can be so insidious and we should not think we can so easily squash them just by removing a hardware counter.

keean · 2018-06-21T08:16:58Z

@shelby3 I am not sure, I think for most people performance is more important than security. I mean you can still exploit "rowhammer" on DRAM, but people are still using DRAM :-)

shelby3 · 2018-06-22T12:29:35Z

As I wrote previously, I think eventually we will end up with a way for applications to turn it off on a per core basis so that applications can decide whether they need to prioritize security (and battery life) over serial performance.

shelby3 · 2018-07-09T09:45:58Z

I want to make my stance clear because I learned in private messages that some misunderstood my stance. In my post that responded to Eric Raymond’s blog post and claim about the invariant of SICK algorithms requiring serial performance and the discussion here in Github that has ensued, I didn’t intend to imply that we will ameliorate mathematical complexity theory. Rather my stance is that we will replace the need for those algorithms by solving problems with different strategies which can be parallelized and that do not require those algorithms. My stance is that we have no choice and humans necessarily rise to the challenge and are ingenuous when the market demands them to be.

We may also find algorithmic strategies to apply to existing SICK algorithms which leverage parallelism, but that wasn’t my central thesis.

shelby3 · 2018-07-23T09:23:32Z

@keean can you concur or correct my thought that §4. Challenge #‍1: Traffic on pg.3 of the paper Why On-Chip Cache Coherence is Here to Stay (which also appeared in a journal) contains an error in logic:

We now tackle the concerns regarding the scalability of coherence’s traffic on the on-chip interconnection network. To perform a traffic analysis, we consider for each cache miss how many bytes need to be transferred to obtain and relinquish the given block
[…]
In a coherent system, when a core reads a block that is shared, the coherence protocol may need to forward the request, but to at most one core (and thus the traffic for each read miss is independent of the number of cores). However, when a core incurs a write miss to a block that is cached by one or more other cores, the coherence protocol generates extra messages to invalidate the block from the other cores. These invalidation messages are often used to argue for the non-scalability of cache coherence, because when all cores are sharing a block, a coherent system must send an invalidation message to all other cores. However, our analysis shows that when sharers are tracked exactly, the overall traffic per miss of cache coherence is independent of the number of cores
[…]
Consider the access pattern in which a block is read by all cores and then written by one core. The writer core will indeed be forced to send an invalidation message to all cores, and each core will respond with an acknowledgment message (i.e., a cost of 2N messages for N sharers). However, such an expensive write operation can occur only after a read miss by each of those cores. More generally, for every write that invalidates N caches, it must have been preceded by N read misses. The traffic cost of a read miss is independent of the number of cores

Although they’re correct that the bus traffic overhead of the invalidation messages is constant relative to the traffic of the read misses, the cost of coherency for per-core private caches is that every write invalidation is by definition going to cause an additional read miss for each core that still needs access to anything (that wasn’t even invalid but was included) in the block (aka cache line) that was invalidated. IOW, the overall increase in traffic cost for per-core caches due to write invalidation, doesn’t scale with the number of cores. Or more simply stated that sharing mutable data between cores that have private caches doesn’t scale.

EDIT: I received an email reply from one of the authors of that paper. He concurred but pointed out that my point is only true when there exists the “unusual situation” of “false sharing” of “unrelated data … collocated on the same cache block, leading to needless coherence traffic.” His opinion is that “good data placement, false sharing shouldn't be a big problem.” I responded that I disagree because there will be cases where the record of related fields being written to also needs to be read again, so it’s not necessarily due to false sharing and thus not necessarily unusual or uncommon. Thus I continue to believe that the discovery by mostly @keean with my follow-up insights is very important. However I will say I don’t have any statistics or functioning models to check the frequency of this issue and its quantitative impact.

I don’t necessarily disagree with the title of the paper though once writable shared data is disallowed in the system. There’s still some coherency required…

If we adopt @keean’s suggestion in the Pointers #38 issues thread that all data shared between threads must be immutable (aka read-only), then my realization is that the only time that per-core caches need to be invalidated is when an immutable data structure has been deallocated and the memory it occupied is allocated anew for some other data. Because the invalid data remaining the cache will not be accessed by any program which respects the allocation until the same area in shared memory has been allocated anew. The allocation could optimize this further (so that other data in the same block which is still valid as cached is not prematurely invalidated) by refusing to allocate anew the memory that the deallocated immutable data structure formerly occupied until all of the surrounding memory which comprises an entire block (regardless whether that entire block is loaded into any per-core cache) that will be invalidated is also deallocated. Thus the invalidation broadcasts (to the per-core caches) would only happen when the entire block in shared memory is no longer in use and has accepted a new allocation.

This does conflate software and hardware though. One way to implement it is to have a system call for invalidating blocks and rely on software to manage this correctly for its virtual memory pages.

The conclusion section of that paper states:

First, this paper does not do detailed architectural simulations with specific benchmarks. We instead show that coherence’s per-miss traffic is independent of the miss pattern and number of cores. Although less precise than detailed simulation, our results are actually more robust as they are not limited to the specific benchmarks studied.

They actually failed to consider the holistic issue, which architectural simulations under scaling would presumably have illuminated. They considered each read miss independently without considering (and thus implicitly presuming not) that write invalidation drives more read misses thus amplifying traffic.

Second, we did not compare against multicore chips without caches or without a shared address space.

They also failed to consider the case of an Erlang-like model wherein all shared data is immutable.

_{P.S. I emailed the authors of that paper a link to this post.}

Additionally immutable shared data can facilitate less complexity for cache-to-cache transfers as explained in 3) Cache-to-Cache Transfers of subsection B. Coherence Protocol Independent Design Considerations in §III. IMPLEMENTATION CONSIDERATIONS on pg. 7 of the research paper An Evaluation of Snoop-Based Cache Coherence Protocols:

^{3) Cache-to-Cache Transfers: To improve the timeliness of requests for data that is cached somewhere, it may be useful to allow another cache to supply the data rather than always going to memory. To support cache-to-cache transfers, each BIU must have a way of indicating whether or not its cache has the data and is able to supply it. This can be accomplished by means of a single wired-OR bus control line, which is asserted by each BIU if its cache is able to supply the data. The memory controller will also know that it should not supply the data if this control line is asserted. If several caches are able to supply a copy of the data, there must either be a way of selecting one of those caches (by using a priority or ownership scheme, for example) or it must be guaranteed that all caches will put the same value on the bus. In [1], it is noted that this additional complexity and the potential for slowing down the processors that have to provide the data from their cache has resulted in a limited use of cache-to-cache transfers.}

^{Another issue with cache-to-cache transfers is whether or not one of the caches has the cache block in question in a Modified state, which can be indicated by another wired-OR bus control signal asserted by the BIU. If one of the caches has a modified copy of the block, it should not only be the cache that is chosen to supply the data, but it should also inhibit memory from supplying the data since memory has a stale copy [1][3]. As the cache is supplying the modified data to the requestor on the system bus, it is possible to write the modified value back to memory at the same time (as opposed to writing it back in a separate bus transaction). While this saves a bus transfer and improves bus performance, it has the additional complexity of having three cooperating members on the bus [1].}

And cache-to-cache transfers provide power-savings as explained in subsection F. Energy Efficiency of Cache Coherence Protocols of §II. SURVEY OF EXISTING CACHE COHERENCE PROTOCOLS on pg. 5 of the same research paper. And also explained in §The case for cache coherency of the article How Cache Coherency Impacts Power, Performance.

Read-only shared data is actually faster and uses less power on existing systems which employ the MESI protocol.

shelby3 · 2018-07-24T23:46:05Z

This post contains a shocking (at least to me) discovery that Java and Go multi-threading with global heap, tracing GC is entirely incompatible with scaling multi-core. Scaling multi-core is entirely incompatible with multi-threading that shares mutable data between caches. Additionally since tracing GC is incompatible with scaling multi-threading unless each (or small group of) threads has its own separately collected heap, because as I explained that collector which isn’t instantaneous can’t be shared between too many threads because it would pause too many threads and make threads dependent on the good allocation behavior of other threads. So scaling multi-core will require reference counting for immutable objects shared between caches (and the immutability will aid in probing and collecting acyclic references, while not stalling any threads during such cycles detection and collection).

The GC of objects which aren’t shared between caches must be congruent with a viable general purpose programming model. Intel’s Xeon Phi (aka Larrabee) suffered from not sufficiently addressing these requirements². One option is Rust’s zero overhead 100% stack allocation with its lifetime and exclusive borrowing model providing safe reentrancy. Code would be grouped to be run only by hardware threads sharing the same cache (to avoid write-back and cache-to-cache transfer coherency overhead). Another option is similarly to restrict each logical grouping of code to hardware threads sharing the same cache, but employ a separate GC’ed heap for each said grouping. For example, each green thread could have its own separate tracing GC’ed heap. Note this means a M plurality of work stealing green threads per core that has N = # of hyperthreads in M:N threading. However, this latter option lacks Rust’s zero overhead abstraction compile-time provably safe thread reentrancy; thus suffering the race condition (e.g. live and deadlock) errors and runtime overhead of locking. (EDIT: but this deficiency is solved with Actors.)

Additionally for cache-to-cache transfer efficiency and so that lightweight (e.g. green) thread context switches could occur to mask latency of such transfers, then ideally the programming model would enable the thread scheduler to know via software that such a transfer is required.

Serial core performance can’t scale and many-core doesn’t scale with complex cache coherence. At least it doesn’t scale if writable shared data is employed by the software. And if all the software will be modified to not use writable shared data, then the complex cache coherence circuitry is wasted silicon. In that case perhaps a less complex or even software driven coherence could be employed instead. If all cache-to-cache transfers are only needed when passing an Erlang-style, Actor model message from a thread on one cache to a thread on another cache, then the software can tell the hardware the source and destination caches for the transfer. (Note this means a plurality of Actors per core running on top of M work stealing green threads per core that has N = # of hyperthreads in M:N threading). Remember per the prior post that cache-to-cache transfers are needed so as to avoid (and if not read-only or was newly created, then also avoid the write-back to and) the load from main memory which is expensive both in terms of latency and power consumption.

The Rigel design offers some rough estimate of the improvement in computation density (e.g. FLOPS/mm² of silicon) by removing the:

order-of-order speculative execution
deep pipelines
complex cache coherence
cache-to-cache transfers
SSE vectorized SIMD
virtual paging, virtualization, etc (that a GPU doesn’t have either)

It was modeled on a 45 nm process. It has 1024 RISC cores (with two-wide issue superscalar ILP), 15 MB total SRAM cache, operating at 1.2 Ghz with a 100W TDP. Compared to the Intel i7 Lynnfield on a 45 nm process which achieves only 4 CISC cores with 8 MB total SRAM cache operating at 2.8 Ghz with a 95W TDP.

Ignoring latency differences and considering only throughput density¹, I doubt the Lynnfield in 2009 had the 180 instructions in flight simultaneously that the current modern Intel processors has. For one thing, Lynnfield didn’t even have hyperthreading. Let’s presume maybe it had 40 in flight and the Rigel only 2. So unless 1024 ÷ (40 ÷ 2) ÷ 4 = 13 RISC instructions are required to match the work achieved by every CISC instruction, then the Rigel may have a higher compute density than the Lynnfield. Also the in flight instructions on the Intel also include some wasted computation that will be discarded by the speculation.

The Rigel paper discusses some of the other variants² of parallelism throughput density designs which have been attempted.

One of the key distinctions¹ is between the GPU vectorized (aka SIMD) model which is able to mask main memory latency with a massive number of cheap context-switch threads (affording an order-of-magnitude higher memory bandwidth by trading off increased latency) and the non-vectorized (aka MIMD) general purpose programming model which prioritize serialized low-latency and serialized performance at the expense of lower computation density and higher power consumption per work completed (i.e. lower overall efficiency). The latter has an order-of-magnitude lower main memory bandwidth with lower but still high latency (c.f. also) requiring large L3 cache to bring average latency down sufficiently for the serialized performance. As explained in the Rigel paper, the former is not suitable as a fully generalized programming model .

The following is an idea for a variant of a Rigel-esque design (yet with software driven cache-to-cache transfers and cache coherence) that replaces the L3 (and L4) cache with a solution that has comparable latency by masking most of the latency of main memory reads. Lightweight green threads are a suitable paradigm for masking macro-scale latency¹, but they aren’t efficiently preemptive enough to generally mask main memory latency. But if main memory latency will only significantly occur (i.e. except for cache spill over eviction for non-shared objects, which should be minimized by the optimizing programmer) on accessing reference counted shared objects (i.e. shared between caches) which probably mainly comprise reading those reference counted shared objects which are cache-to-cache transfers (perhaps with software driven cache coherency driven by Erlang-style, Actor model message passing). The compiler is enabled to insert a green thread check-point for preemption by the memory controller (which ameliorates the lack of efficiency). Contrary to the incorrect association of green thread context switches to kernel thread/scheduler context switches, green thread context switches are low latency. At a function boundary, they’re only slightly more than the ~30 cycles latency of a jmp instruction because only the SP register has to be saved+restored (given that the other registers and flags are invalidated at the boundary. But in this case the preemption would not be at a function call boundary, so all the registers and flags have to stored and restored, which may double the latency. The memory latency is ~100ns (on CPUs and ~400ns on the CPU) which on a 2.6 Ghz CPU is 260 cycles. But this really needs hardware support that the software can query, so in case the main memory being accessed is already cached (which the software wouldn’t know) then the unnecessary context switch would be avoided. The extra latency for storing and restoring the other registers and flags could sometimes be eliminated by prefetching to the cache at the start of the function containing the the main memory accesses. No registers nor flags are yet in use at the start of a function.

Hardware-based cache coherency (which is required without this software directed coherency we’re contemplating) obfuscates the true costs of the hardware form of coherency.

¹ _{Sufficient threaded parallelism can mask latency so that hardware is not idled. The GPU achieves this with a huge register file (RF) and putting 32 threads in a contiguously context switched warp, so context switching has low overhead (and also reduces the control circuitry needed per execution unit). The modern i7 hyperthreading employs dedicated registers for the hyperthreads which share execution pipeline ports and run simultaneously. Thus hyperthreads although masking some latency are also intertwined with a focus on serial performance. Green thread work stealing which can run on modern CPUs without hardware support employs compiler generated checkpointed pre-emption (not the same as full pre-emption) which could probably be made more efficient with dedicated hardware support. Green thread context-switches are reasonably efficient compared to OS threads, and especially where context switched at function boundaries, but only suitable for masking software macro-scale latency blocked operations, which does not include low-latency main memory accesses. They may be able to also mask efficiently the latency of waiting on a mutex lock to be freed without hardware assistance.}

_{The GPU trades off latency in every facet for increased parallelism by eliminating synchronization assumptions for caches and global memory, as well as for example reducing the power consumption (also) by designing a much higher 24 cycle write latency (archived) (c.f. also) on the register file (RF) than the registers for each core on a GPU. The RF (and shared memory) is shared by all threads in the block so a thread context switch is very low overhead because unlike the CPU no registers needed to be saved.}

_{The GPU’s shared memory also has a much higher latency than L1 cache on the CPU and doesn’t serve the same function as a CPU cache. Also the shared memory is banked (where each 16 or 32 successive words are successive banks) so that if threads in the same block read a different location from the same bank on the same clock cycle then they will run serially instead of in parallel. All 32 threads of a warp must be unblocked in order for the warp to execute, but even if they are unblocked they aren’t guaranteed to execute uninterrupted. Note, given 32 threads attempting to uniformly distributed randomly read from 32 banked shared memory, probability math informs us that 63% of the threads would run in parallel every cycle. Thus in such a divergent, non-vectorized access pattern would require four (4) times 24 cycles latency for each step forward of the warp (thread block) that requires all 32 threads to access the shared memory, because (0.37 ^ 4) × 32 < 1 so that less than one (1) thread is blocked for each such step (since all threads in the warp must be unblocked in order for the warp to execute). This a reason the GPU is not suitable for general purpose programming.}

_{So on the GPU need to mask latency by either running many more than 8 threads per SM (i.e. “More concurrent threads”) and/or serially queuing up in the RF many copies of each step of an algorithm, before processing the next step for each of those copies (i.e. “More independent instructions (ILP) within a thread”).}

_{There’s no cache coherency on the GPU. Even the GPU which has L1 and L2 cache of main memory (actually L1 may optionally be configured only for caching local memory which is what the RF spills over into), requires a threadfence() for global coherency of writes to main memory. Even communication between warps requires a shuffle paradigm.}

² _{@keean had mentioned to me the Xeon Phi as being an example of multi-core that failed to be as performant per watt as a GPU and failed to match the modern Intel CPUs on serial performance. But that was based on Larrabee which was originally intended to compete with a GPU yet also remain more generally programmable. The Xeon Phi has hardware coherent cache and out-of-order execution! So it’s not surprising that it hasn’t really solved any major need. It isn’t optimally designed for maximum general purpose programming parallelism (c.f. the Actor proposal in the next post) nor for vector, SIMD programming. And of course isn’t optimal for serial algorithms either. Thus it excels at nothing.}

shelby3 · 2018-07-25T05:29:16Z

Actors as the paradigm for parallelism

_{^{(note having very bad whole body inflammatory attack today with a severe headache, forehead on the keyboard fatigue, mental confusion, so please excuse any lapses in my elocution and logic below)}}

@keean and I had discussed in 2016 Actors for example the Concurrency issues thread #17.

@keean and I were recently discussing in private messaging, the read-only sharing via messages for Actors and before that I was initially negative about Actors because they aren’t compatible with a shared memory model exposed in PLs and they don’t inherently resolve all concurrency conflicts (but neither does any paradigm)¹.

Analogous to comments made by others when comparing JVM to Erlang’s VM, I was thinking it’s necessary to have shared memory to for example support a global UTXO hashmap where keys are the SHA256 transaction ids and the values the (pointers to the) a UTXO (i.e. an unspent transaction output) record. There’s a related Q & A How does shared memory vs message passing handle large data structures?. There’s also an explanation of the paradigmatic inferiority of shared memory versus message passing. Message passing also amortizes better (c.f. also) over the long-term and higher scaling. Erlang has proven to be excellent for creating web servers for example.

@keean pointed out why not just make every UTXO record an Actor. I initially thought it would increase the overhead, but as we discussed the details we both ended up realizing there would be no overhead. I contributed the idea that even a hashmap could be modeled with Actors by making each bucket an Actor which contains for example a linked-list of values. So after deriving the bucket address from the key, the message could be sent to the Actor at that address to check for the value in the bucket or add/remove a value to/from the bucket. For example, if the pointer to the linked-list for each bucket is stored in an array and every Actor for those buckets employs the same code for processing an incoming message, then there’s no extra per-bucket storage needed for the Actor. Simply call the Actor’s function with input arguments consisting of the message pointer and the aforementioned bucket address. The objects pointed to for the message and the linked-list are of course immutable. The Actor modifies the linked-list by creating a new copy with the required modifications and then changes the pointer at the bucket address of aforementioned array.

Note when messages will cause the Actor to have side-effects, then the message must be guarded by a mutex lock to prevent Actor function reentrancy. Thus each Actor really should have pure and impure variant of its function. The impure variant blocks on the mutex lock. These mutex locks can be 1 bit for each Actor when the Actors are in an array (because these adjacent Actors allow for a bit field for these mutex lock bits). But note this mutex lock would be required anyway for mutating any shared memory data with multiple threads, so the mutex bit isn’t additional overhead due to Actors.

@keean pointed out that there’s no need to queue the messages and instead employ the mutex lock for blocking. He also pointed out that no need to store a stack pointer (SP) for each Actor because there’s no context-switching going on at the Actor abstraction and each Actor should return to the top-level of its stack when returning from its message processing function. The code sending the message (which may be inside of another Actor’s message processing function) calls the Actor’s function and thus there’s no context-switch required.

In order to simulate a message queue (so that the calling Actor can complete instead of blocking), have the called (impure) Actor queue it and return. The called Actor must have registered itself to be called periodically by a thread pool (or registered as callback such as of another Actor) to process its queue.

The other key insight I had is that these Actors can run atop of green threads. So for example when blocked on the message mutex lock, the green thread can context-switch. @keean initially had the thought that green threads don’t preempt, but I explained (c.f. footnote ¹ in my prior post) they can be made to software “preempt” (aka cooperatively preempt) at key check-points where they can block and Actor message passing calls would be one of those check-points.

If the called Actor is only executable by threads running on another core (or group of cores if more than one core per ^{_{software-driven-coherency}} cache), then green threading can context-switch from the calling Actor (to another Actor on the same core / cache group). Except if the return type of the called Actor is () (aka in C is void) (^{_{presuming that failed delivery of sent message is an exception although note that message delivery is not assured in the Actor model yet that may not prevent an exception on failed or timeout yet the timeout will not be definitive in that it could have been delivered}}) then the calling Actor can continue immediately (presuming delivery is guaranteed) while the sent message is queued by the green threading on the called Actor’s core / cache group. Actors shouldn’t presume any semantics based only on the timing of the return of calling (i.e. sending a message to) another Actor because Actors are supposed to have autonomous state machines so that they can tolerate complex dominoes failure due to indeterminacy over the timing of delivery or failed delivery of message.

Note although we’re describing the model of how Actors work locally on the same CPU, the message passing Actor model also scales across the bus and the network. For example, the UTXO records could be split up across servers in a cluster.

¹ _{There’s no panacea paradigm which will prevent all cases of dead and live locks, i.e. the solution to the dining philosophers problem requires omniscience. Neither the Actor model, immutability, nor Rust’s exclusive mutable borrowing will eliminate all opportunities for dead or live locks. For example, the issuer of a message could be waiting on a response from the recipient of the message in order to complete some operation, yet the recipient of the message could (even if transitively) end up waiting on the issuer before it can complete its operation and reply to the issuer.}

In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis² for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted.

So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap.

AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity.

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counted object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function.

² _{Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete.}

Someone commented and I replied:

that all sounds very Erlang-ish, but I guess they are missing your type system constrains

Erlang has process-level context switches, which is very preemptive (although apparently also cooperative) but very heavy weight overhead. Thus, can’t eliminate the L3 cache for scaling multi-core as I proposed in the prior post. Also AFAIK Erlang doesn’t guarantee the queue-less message passing that @keean and I devised which is really critical to lowering the overhead of Actors to zero as I explained and also critical for implementing software cache coherency as I explained in the prior post. Also Erlang does not have the stack allocation I proposed and instead has automated generational GC which I have pointed out is incompatible with multi-core scaling. Yes Erlang and Actors are the inspiration, but we have proposed significant innovations and yes also the Erlang PL is lacking our ideas for a type system and other features we’d like to have for a complete PL such as modules. Also I think I with Zer0 will likely first target Go as the compile target and VM. Although in this case the reference counted objects (as declared in the code) would actually be garbage collected by Go’s GC on the Go compile-target. The important point is that Zer0 can be compiled in the future to more performant code and will support a multi-core scalable future.

Note by targeting Go initially and cooperative preemption (instead of Erlang BEAM VM’s process-level preemption which is apparently also partially cooperative), we could as compared to Erlang (c.f. also) better integrate SIMD and other assembly-level optimizations without corrupting the runtime as NIFs can do in Erlang. We’d still need to compile cooperative preemption into the SIMD code (c.f. footnote ¹ in my prior post).

The Actor-like design contemplated in this post would rectify the following complaints about Erlang:

^{One thing Erlang isn't really good at: processing big blocks of data.}

^{He means things like decoding mpeg data. There is too much numerical computation which Erlang is not optimized for. If the processing just involves shuffling big blocks of data from one place to another, then Erlang is quite good at it. (Files to TPC Sockets, etc)}

^{You can't update shared blocks of data (there are no pointers in Erlang) and hence data must be shuttled across processes which in turn translate to inefficiencies.}

shelby3 · 2018-07-31T11:40:23Z

In response the Quora question What programming languages do software engineers use in 2018?, this answer is more evidence to think that the proposed Actor model in this thread is the future:

When Salesforce bought Mulesoft for $6.5 billion, every programmer should have sat up and taken notice. This event adds to other recent API-related developments, especially from IBM, that clearly signal that the new programming paradigm is based on APIs and microservices.

Websites are transitioning from standalone collections of web pages to dynamic microservices hosted in the cloud. This simply means that what used to be a website is now a collection of independent programs running in the cloud. These independent programs “talk” with each other and with other services in the cloud via APIs.

For example, the veterinarian where I take my dog has a traditional static website with all the usual information about hours, personnel, specializations, etc. There are links to other veterinary websites where I can create an account and buy dog food, medications, toys, etc.

Now, with the proliferation of APIs, this architecture is transformed: the formerly static website interfaces directly with other websites using APIs, so the partitioning among websites becomes transparent. Soon I will go to my veterinarian’s “website” and all services and products will be available in one place using a single domain name and a single customer account. There will no longer be hyperlinking among individual, standalone websites (at least insofar as the visitor can tell). This integration of the WWW is happening really, really fast.

sighoya · 2018-07-31T13:41:11Z

This might interests you.

shelby3 · 2018-09-18T13:21:49Z

@keean remember I argued that we’d see an endless stream of speculation execution vulnerabilities. Here’s a new vulnerability:

https://www.zdnet.com/article/beyond-spectre-foreshadow-a-new-intel-security-problem/

keean · 2018-09-18T13:36:25Z

@shelby3 The problem is not speculative execution itself, the problem is that the engineers assumed because the results would be thrown away, they did not have to carry out all the security checks on the speculative branch.

I don't think AMD are vulnerable to these new 'Foreshadow' attacks.

jdh30 · 2018-10-19T18:16:07Z

@shelby3 "there’s no bump-pointer heap that I know of that didn’t have the integration overhead of being part of a mark-and-sweep GC scheme"

Cheney semi-space?

shelby3 · 2018-10-21T02:20:41Z

@jdh30 my proposal is the objects which are shared between actors (thus can be longer lived) must be ARC. Thus the bump-pointer heap is only applicable to non-shared objects which presumably are shorter-lived or at least reasonably well correlated w.r.t. lifetimes even if longer lived. If either of those two hypotheses (the young generational or at least correlated generation) are violated, then we need alternatives for the programmer to work around the inefficiency. The entire (local) bump-pointer heap is reclaimed by resetting the bump-pointer at the termination of the actor function which processes an incoming message. Thus there’s no need for mark-and-sweep, because there’s no objects remaining in the heap because the collection cycle is upon termination of the processing of the incoming message.

I have also augmented my original proposal by acknowledging we probably also need the ability to instance region-based memory allocation (aka the Arena pattern) each of which are separately bump-pointer heap allocated. Thus a region is only ever freed all at once with no mark-and-sweep. This increases efficiency for use cases that are compatible with such a model.

I posit that these proposed models in addition to avoiding thread synchronization of collection cycles may cope better in some use cases (because they provide alternative assumptions about duration and cordoning of the generation) with the issue you mentioned in your blog as quoted below about violations of the young generational hypothesis assumption (which you’ve also mentioned can be sometimes ameliorated by eliminating superfluous boxing and allocations):

Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.

Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation.

You noted in another of your blogs that timely young generation collection cycles are important for maximum performance of (what you refer to as ‘symbolic’¹) code that produces a lot of young garbage. So my proposed model above would require some intervention by the programmer to subdivide his algorithm between actors such that collection cycles are frequent enough to maximum performance. But this would conflate concerns because the Pony capabilities actor model is also about data race safe threaded sharing. So thus the need for regions and I think also we need to add the option of regioning by function scope in addition to the proposed by region id.

Will the programmer be capable of this static partitioning of allocation (i.e. is it a deterministic phenomenon) or need AMM to stochastically solve a non-deterministic phenomenon? My thought is that although program demands can be non-deterministic, the algorithms can sometimes be broken down into relatively deterministic parts that are combined non-deterministically. This will depend on whether the algorithm can efficiently and deterministically identify which objects it wants to copy from the short-lived, frequently collected bump pointer allocated region. For algorithms that can’t, then mark-and-sweep is perhaps required to maximize performance. For n-Queens it seems all the young garbage can eliminated by employing a mutable array instead of creating temporary copies, and then zeroing out the rows when backtracking.

Again I want to reiterate that fully AMM seems to not scale to massively multicore which is where the industry is rapidly headed now. The issue is more complicated than just stopping-the-world or not, as I explained up-thread (kudos to @keean for the initial insight) it also involves scaling the CPU’s cache coherence overhead. Pony’s local mark-and-don’t-sweep, global reference counting via message passing AMM model seems to have performance issues. So it seems we need a different model. I’m trying to figure out what the model should be. Note the cache coherency issue is mostly due to sharing between threads, which is mostly resolved by Actors, yet may also have interplay with the GC strategy.

¹ _{Presumably this means immutable objects and pure functional equational reasoning a la Haskell?}

shelby3 · 2018-11-14T02:57:38Z

Sorting out which hardware I will buy as I need to upgrade my i7-4770 desktop and prepare to be portable and mobile (will be relocating and need to do diversify my work venues not always stuck inside my cave, such as coding at the beach or in a cafe):

Parallelization Obsoletes the Desktop Computer

keean · 2018-11-14T07:24:54Z

@shelby3 I think the zenbook Pro is nice, although a 15" zenbook Pro might be better for extended use.

I agree that it makes more sense to rent CPU time, as either cloud instances (like AWS EC2), or even server-less compute time (like AWS Lambda). I normally keep source code in a remote repository to, and use cloud office software (Office 365 or Google docs). I try to keep nothing of value on the computer, so if it gets lost, stolen, or just breaks I don't lose too much.

However I think a decent CPU and GPU are useful, as edge computing is definitely also a factor, because mobile devices are the most used devices to access online services, as such you want to support offline use as you are not always connected to the network. This doesn't have to be the complete functionality, but should allow people to get work done offline. For example edit a document you previously cached locally offline. For big processes you transfer them to a cloud cluster where they can run the background and notify you when complete.

As a general rule I would run interactive processeses locally, and queue batch jobs to run remotely, as well as store shared/long-lived data remotely.

shelby3 · 2018-11-15T11:06:51Z

@keean I don’t need that screen inside the touchpad on the Pro. I would prefer not to have the numeric pad (or in the touchpad instead of¹) on the keyboard. The Pro isn’t yet refreshed for Whiskey Lake and I don’t think I want a hexa-core processor with a 35 TDP in a laptop. So I’m currently thinking the recently refreshed Zenbook classic series, although I really don’t need a GPU so I might just go for a cheaper Vivobook in anticipation of replacing the laptop again in 18 months. The refreshed Vivobook S only has a 42Wh battery though. Frustrating that these laptop manufacturers hold back certain features to force us to buy their most expensive model which has other features we don’t need. I don’t want to buy a Dell because my two experiences with Dell laptops were lemons. And I prefer to buy only in a store where I can test the unit before committing to purchase. The latest generation Thinkpads just aren’t the quality they used to be and also there’s none available retail in the Philippines. I’m in ASUS land.

I agree we still need some compute horsepower locally. My blog is both addressing the current situation and also the future. The eventual future is ubiquitous connectivity. And I think eventually basically nothing will run efficiently locally on the compute resources an individual could afford to own exclusively. There will be peak demand spikes because for example complex AI and data analysis will be incorporated into software. Our programming compilers might become much more complex and do complex (even whole program) analysis.

My basic point for the near-term is for many people it applies to not sacrifice mobility to try to cart around always enough compute resources for every possible software scenario. An eGPU can have much more horsepower and then you can cart that around when you really need it, otherwise leave it on your desktop. For us if we need to run a multi-threaded compiler than we can have a ThreadRipper on our desktop which is even portable, but not mobile. We can connect to it from our laptop when we dock.

Some current workstation and gaming users have needs that can be met with a laptop (e.g. Photoshop certified or AutoCAD certified), but I think the parallelization phenomenon is going to unleash software that simply won’t run on any laptop form factor.

Also mobility has become a focus for me lately not only because I need to travel, but because being stuck inside 14 x 7 is taking its toll on my health (at age 53.5). I need to get outdoors more both for my eyes (the blue light of the monitor may impact sleep patterns and hormones as well) and also to get daily sun exposure for vitamin D. Also for mental health.

Btw, I read today that Intel’s original 10nm process would have been transistor density equivalent to AMD’s 7nm. Apparently though Intel is scaling back the sophistication of their 10nm process so that they can get sufficient yield to ship by late 2019. Thus it may not end up actually equivalent.

¹ _{Although I saw a video which points out that the numeric pad in the touchpad doesn’t work that well because it can’t keep up with very fast typing.}

Changing topic a bit, it seems with recent announcements that AMD is moving towards that sharing data between cores (or at least 8-core chiplets) will have more latency and cost (Infinity Fabric will remain on 14nm for the time being). Perhaps we should deprioritize being able to not copy when sending data between Actors as that will become a nearly useless optimization in the future. This design point may impact the GC issue we have been discussing in the other thread.

Also I read the new Whiskey Lake Intel mobile CPUs have some hardware fixes for Meltdown. I may opt for the i7-8565 with a 15W TDP. Seems to be the sweet spot that retains battery life, mobility and which neither Intel nor AMD will be able to significantly improve upon for at least a year. Will eventually probably also build a ~10L case Ryzen or Threadripper for more local compute resources when I’m at my desk.

https://fuse.wikichip.org/news/1815/amd-discloses-initial-zen-2-details/

http://www.tomshardware.com/answers/id-3824835/noiseless-cooling-system-amd.html

shelby3 · 2018-11-22T03:56:52Z

@jdh30 wrote:

@shelby3 wrote:

there’s no bump-pointer heap that I know of that didn’t have the integration overhead of being part of a mark-and-sweep GC scheme

Cheney semi-space?

I misapplied the terminology. My intention was to state that I know of no bump-pointer heap that isn’t integrated with some form of tracing garbage collection. I conflated the generalized category of a tracing garbage collector with the mark-and-sweep variant of said collector.

Mark-and-don’t-sweep

As I understand it, the mark-and-don’t-sweep variant differs from a mark-and-sweep in that only reachable objects are ever visited (i.e. marked) by the collector. Tracing and marking only occurs when all the memory fragments in free space have been marked as reachable objects by the memory allocator which allocates from those fragments marked unreachable changing the mark to reachable. Some of those objects marked reachable may have hence become unreachable. Then the mark bit’s meaning is flipped to mean unreachable (i.e. free space) and before proceeding, tracing+marking flips the actual (i.e. not its meaning) bit for those objects that are still reachable.

Pony’s mark-and-don’t-sweep stores the mark bits in a separate list instead of with the objects (but tracing of course still has to visit the reachable objects in order to trace the references stored in the objects), which I suppose has the benefit that tracing and marking doesn’t have to wait until all the memory fragments in free space have been marked as reachable objects, because will not need to visit all the objects marked as unreachable in order to flip those mark bits to reachable before the aforementioned flipping of the meaning of the mark bit.

Related to @jdh30’s point about the importance of cache locality, I also read on pg. 3 of Danko Basch et al, “Mark without much Sweep Algorithm for Garbage Collection”:

Lazy sweep can also reduce the overall GC time due to less coalescing of free objects. It also improves locality because the swept area is small and will probably be immediately used by an allocator and mutator (user program is called mutator because it mutates the connections between objects)

Tri-color abstraction

For readers who are learning about tracing garbage collection, I find the equivalence of the tri-color abstraction easy to understand in the context of Cheney’s algorithm. Cheney’s algorithm is an improvement of the naive algorithm because it reduces the overhead for updating the addresses for references to reachable (i.e. grey or black) objects when they are moved (i.e. copied before from-space is deallocated) to the to-space.

shelby3 · 2018-11-28T10:38:28Z

Tying all our discussions into a summary and some additional insights:

Message-based concurrency (and parallelism)

Massively multicore + shared memory doesn’t scale

shelby3 · 2018-12-03T10:01:16Z

I wrote on the blog Actors are not a good concurrency model:

The blog post is correct but as others have noted that for CSP-like channels abstracted as functions within return values that [are] futures, it’s possible to make them composable. And of course these can declared pure if they don’t have any side-effects.

Also I agree with @‍Ulf Wiger that composability exists on many levels of semantics in a program. Low-level referential transparency isn’t everything about composability. The vaunted equational reasoning of Haskell is a huge price to pay for making so [many] other facets of programming arguably worse.

@‍Chris Quenelle, willy-nilly threading breaks multicore scalability:

https://gitlab.com/shelby/Zer0/issues/11#sharing-doesnt-scale

And we will have 64 core (128 hyperthreads) AMD processors in 2019.

@‍martin, threading synchronization destroys multicore scaling due to Amdahl’s law.

I forgot to point out explicitly that side-effects occur even in referentially transparent programs, as I explained elsewhere:

Although avoiding locking by sharing only immutable state doesn’t prevent higher-order semantic bugs including deadlocks and livelocks because mutable state dependencies can be modeled in other ways, there are still mutiple benefits gained by insuring first-level data race safety.

shelby3 · 2018-12-25T02:55:51Z

Remember it was ESR’s blog about SICK algorithms that prompted the discussion that lead to me writing the OP of this thread, and now he @eric-s-raymond has blogged again about his skepticism about exploitable parallelism in software.

And again I am going to disagree with some (not all) of his reasoning (and respectfully with @keean if @keean hasn’t yet come around to my perspective).

I will respond to some specific excerpts from his blog as follows.

In the terms of an earlier post on semantic locality, parallel methods seem to be applicable mainly when the data has good locality. And they run best on hardware which – like like the systolic-array processors at the heart of GPUs – is designed to support only “near” communication, between close-by elements.

By contrast, writing software that does effective divide-and-conquer for input with bad locality on a collection of general-purpose (Von Neumann architecture) computers is notoriously difficult.

We can sum this up with a heuristic: Your odds of being able to apply parallel-computing techniques to a problem are inversely proportional to the degree of irreducible semantic nonlocality in your input data.

Agreed. Parallelism requires the ability to subdivide the data space.

Another limit on parallel computing is that some important algorithms can’t be parallelized at all – provably so. In the blog post where I first explored this territory I coined the term “SICK algorithm”, with the SICK expanded to “Serial, Intrinscally – Cope, Kiddo!” Important examples include but are not limited to: Dijkstra’s n-least-paths algorithm; cycle detection in directed graphs (with implications for 3-SAT solvers); depth first search; computing the nth term in a cryptographic hash chain; network-flow optimization.

Bad locality in the input data is implicated here, too, especially in graph- and tree-structure contexts. Cryptographic hash chains can’t be parallelized because their entries have to be computed in strict time order – a strictness which is actually important for validating the chain against tampering.

There’s a blocking rule here: You can’t parallelize if a SICK algorithm is in the way.

The crux of my rebuttal has been and remains that the quoted is a narrow-minded framing of a perspective on what is actually blocking parallelism.

We as creative humans who adapt to our resource environment will necessarily lift our paradigms such that we overcome those SICK limitations. Let me give an example to illustrate how this is so.

For example someone refuted Eric on Dijkstra’s shortest path. Who would have thought that Pi would be by now mathematically proven to be computable in parallel and, “The Law of 1998, Why Would Anyone Ever Need 16GB of RAM?”.

Let’s take the necessarily sequential chaining of a cryptographic hash example since reinventing blockchains is my current area of focused research other than my study of programming language design. We subdivide the blockchain into shards instead.

Additionally there’s a more fundamental myopia that plagues this focus on deterministic SICK algorithms— our universe isn’t deterministic. Exact deterministic solutions become less meaningful than stochastic approximations as the context of their application broadens beyond a deterministic partial order to a stochastic reality. Which will become more and more the case in this interconnected world where our applications are not just self-contained in the deterministic space of our personal desktop.

Humans will paradigm-shift the problem set as required. That doesn’t mean there won’t still be legacy paradigms that weren’t well thought out in terms of the new realities about where Moore’s law is still advancing.

I haven’t studied Eric’s Reposurgeon project to know if I would have any insights on whether it’s possible to paradigm-shift that legacy data structure transform. Apparently back references into the DAG are random and deep, so I suppose perhaps some batching, grouping algorithm in order to generate data locality could be contemplated. EDIT: one of the comments on Eric’s blog also had a similar idea.

But I also don’t understand why his DAG couldn’t be subdivided and then have the divisions communicate via message passing. Is the overhead of message passing greater than the performance speedup of parallelizing on multicore? Does the transform algorithm have to run sequentially or can there be parallel transformation pipelining? So as I said, I would need to study his problem set in detail before I could comment intelligently. Even if message passing would work, that wouldn’t make Python’s multi-process model appropriate because message passing between processes has more overhead than between concurrent green threads. Python processes run effectively single-threaded because of the GIL.

Nevertheless my disagreement with his overly pessimistic skepticism remains.

We’re not done. There are at least two other classes of blocker that you will frequently hit.

One is not having the right tools.

Eric seems to not know about the Pony language model.

For most desktop/laptop users the only seriously parallel computing that ever takes place on their computers is in their graphics chips.

Because programmers have not paradigm-shifted yet. How many programmers even know the Pony language model exists. And also Pony has some flaws which need to rectified which I have been writing about. This is why I’m contemplating creating the Co programming language (recently renamed from the original Zer0 name).

UPDATE: A commenter on G+ points out that one interesting use case for multicores is compiling code really quickly. Source for a language like C has good locality – it can be compiled in well-separated units (source files) into object files that are later joined by a linker.

I have been saying this for many months also around here.

EDIT: I am not disagreeing with the point that such massively multicore CPUs will be specialized. In this thread for example, I noted that with the Pony model we could in the future have software cache coherency because extant hardware cache coherency is an Amdahl’s law limitation on scaling massively multicore. But note the hardware model remains a general purpose stack machine and not a GPU-like register-file, vector machine. I have stated in this thread we will likely only need a few general purpose CPUs heavily designed for optimizing serial general purpose algorithms.

EDIT#2: Eric is apparently now understanding that games are one example where massively multicore CPU is utilized (and not just the GPU), and I agree with his statement (quoted below) that the lack of tools (specially the programming language we need, c.f. also and also) has been holding programmers back from making the transition to massively multicore:

The languages/toolchains in which these games are authored in just aren’t sophisticated enough to support serious concurrency on the general-purpose CPUs.

^{META: Tangentially I explain herein why am unable to respond directly on Eric’s blog. That is because he banned me. Apology in advance to @keean for needing to insert this here, but if Eric is going to read this then I want him to know my opinion that he needs be hit upside the head with a metaphorical baseball bat to awaken him from whatever hole he crawled into about the realities of his juvenile banning shit.}

^{Wouldn’t it be more informational if Eric had the courage to face me on a neutral discussion forum like this where he has to actually win all the arguments and not cut me off in the middle of an argument with his censorship. For example for him to try to refute the facts (c.f. also and also). As I remember, it all sort of began circa 2011 when I tried on the 10th year anniversary of 9/11 to do some justice — to for example my heritage as a descendant of Isaac Shelby who fought for the freedoms Eric now enjoys in our country — on his 9/11 blog about what really happened on that day. Some his regulars were vocal (and condescending, brow-beating, ridiculing) about how offended they are by the notion that 9/11 could have been an inside job. Which I guess caused Eric to dislike me. Eric refers to such facts as babbling nonsense. I am also a U.S. citizen and I refuse to tolerate the corruption of our country. And dismayed that he is unwilling to host the necessary transparency. Mr. open source and all. My study of 9/11 is much more solid by now than it was in 2011. In 2011, I didn’t have the knowledge to concisely refute the bellicose interactions on his blog. Eric might think his bans are justified because of those who he thinks derail with off-topic discussion, but his regulars also do this quite often and he allows. And also the discussion I have done was also a natural progression or fundamentally related to the thread subject (but according to Eric’s ignorance, it was all babbling nonsense anyway).}

^{I wonder if ESR has ulterior motives or conflicts-of-interest. Does he need to attract more donations to his Patreon by grandstanding to the anti-semitism political correctness totalitarianism? That doesn’t mean I hate Jews! Of yeah I know he talks a good talk about his min-anarchist ideology, but he also seems to love politics, grandstanding, and censorship. On the one hand, one can have some appreciation, love and respect for his contributions and talents, but then be entirely frustrated by his Herculean ego and nastiness when it manifests in banning discussion and the written venom he spews out towards people whose opinions (e.g. towards Jeff Read) he disagrees with. Btw, I usually ideologically agree with Eric when he ridicules Jeff, but I cringe at the example he sets in terms etiquette and discourse. Especially when it is political and ideologically based discussion (Jeff being an Apple and walled gardens fanboy). At least when Linus does it in email, it is nearly always ridicule about poor quality code which is not an ad hominem attack but rather just being frank about low quality work. That’s a Southern gentleman issue. In the South, speaking to someone was Eric does, should be resolved with a duel (and no this isn’t a veiled threat). You just don’t treat people that way even if you are correct. He would retort that noise isn’t signal, but then he must refute those facts about 9/11 I linked above which he has been censoring (as one example of myriad of signal he censors). That is a way of saying I am person who habitually forgives and forgets to work forward hopefully with harmony. But for some reason that guy can’t tolerate me. Maybe because I’m a Southerner and he is a city slicker Northerner? Every several months or so, I wander over to his blog to see if I can learn anything important. Sometimes I find blogs within my area of knowledge and I want to offer some comments. I am not stalking the man. Cripes I have more important things to do. Now Freeman Dyson, that is a man I can respect. And sadly he is approaching a Centenarian so may not be with us that much longer. Amazing combination of intelligence and also devoid of outwardly nasty ego. Maybe because Freeman doesn’t even put himself in those situations in the first place. After all, Freeman doesn’t need a blog of sycophant regulars to validate his writing. Freeman is busy inside his head and talking with other scientists. As I am busy coding doing my own thing. Anyway for the record, that is why I can’t post my comments on Eric’s blog and instead do so here.}

^{P.S. while the sycophants play with tinkertoys, the plot thickens.}

shelby3 · 2019-01-12T09:21:55Z

As I postulated up-thread, moving towards software driven cache coherency and minimizing the moves of data between cores:

http://www.rexcomputing.com/

keean · 2019-01-12T09:48:29Z

@shelby3 it's great to see people trying to make progress in this area, however I am not hopeful for success. Cache is very regular (that is it is easy to construct the blocks from hardware 'macros' when designing the chip, so in terms of hardware design effort it is cheap compared to complex parts like instruction decode, ALUs amd control logic whilst at the same time producing a very large increase in performance (10 - 1000x). In my experience people want CPU performance more than they want battery life, otherwise they could have a laptop that lasts days by using low power ARM cores. Also comparability with a existing software is important. Look at failures like Transmeta for examole, and that nobody uses x86 emulation on Arm.

You have to consider whether Microsoft's Windows ARM laptop will succeed (probably not) because it cannot run x86 windows software, compared to a chrome book, that can be viewed as a big phone with a keyboard, and can run all android software.

shelby3 · 2019-01-13T11:20:34Z

Your argument is only true for some markets.

Laptop != mobile. Mobile people typically want (at least a day’s) battery life before prioritizing performance. Business laptop (not gamers) people typically want (at least a a few hours) battery life before prioritizing performance. ARM is killing Intel on mobile. ARM is preparing to make a run at the server market.

Rex Computing is I believe targeting specialized supercomputers.

shelby3 · 2019-01-13T11:21:48Z

Rome is not built in a single day. Intel and hardware cache coherency will die in peripheral markets before finally succumbing in other mainstream computing.

keean · 2019-01-13T18:02:17Z

@shelby3 Didn't work for the Platstation2 which had a non-cache architecture, but the PS3 (and PS4) went back to a traditional architecture because it was too difficult to program for the PS2.

shelby3 · 2019-01-13T19:45:07Z

We need a PL that supports the model and is popular. Rome is not built in a day. We are actively developing Co now.

Also the PS2 did not have enough cores to make it worthwhile. When the s/w cache systems have 20X more cores per watt (and thus per TDP limitation), the incentive increases.

keean · 2019-01-13T19:50:04Z

@shelby3 software control of scratch memory is a pain. The problem is that it makes it the compilers job to control the cache, but the compiler cannot anticipate runtime scheduling. It may save power, but it won't work any where near as well as a hardware cache that can account for the actual runtime scheduling of instructions and interrupts.

shelby3 · 2019-01-13T20:22:38Z

@keean did you forget my upthread point that with “Actors” (actually CSP), the management of the cache coherency between cores can be managed in s/w at the juncture of the sending of messages. Removing that h/w responsibility will save power and with the right PL then it will not be hard to program.

Again I reiterate that this requires thinking about global memory as partitioned into “Actors”.

I agree with you that cache set associativity and load/store (i.e. not coherency between cores) has to still be managed in h/w because that is very difficult to manage in s/w generally. But that is not where all the power consumption gains are coming form. Did you fail to read, remember, or assimilate the post I linked upthread:

Massively multicore + shared memory doesn’t scale

That clearly explains that partitioning the global memory for exclusive access to some core(s) drastically reduces the power consumption:

Analogous to how Internet connectivity is much slower than memory bus connectivity, core interconnectivity bandwidth (and latency) won’t scale proportionally with the increase in cores. So software has to written to minimize unnecessary communication between cores so that computing hardware can be designed for the distributed memory variant of MIMD. Just as with the design of the client-server paradigm, multicore software must be designed to minimize distant (i.e. non-local, not cached) memory resource sharing by keeping needed data close to the core that will work with it. For example, the power consumption is ~6X greater (2 vs. 11 pJ/b) for inter-chip versus on-chip Infinity Fabric interconnections— a ratio that will worsen because […]

shelby3 · 2019-01-13T20:28:02Z

I emailed Rex Computing asking them to read my prior post.

shelby3 · 2019-01-13T21:31:55Z

I agree with you that cache set associativity and load/store (i.e. not coherency between cores) has to still be managed in h/w because that is very difficult to manage in s/w generally. But that is not where all the power consumption gains are coming form.

Apparently compiler driven methods are approaching h/w controlled methods:

https://ece.umd.edu/~barua/udayakumaran-TECS-2006.pdf

Note that paper seems to support the fact that most of the power consumption gains are not from eliminating the h/w mgmt of the cache:

Studies [Banakar et al. 2002] have shown that scratch-pad’s use 34% lesser area, consume 40%
lower power than a cache of the same capacity. These savings are significant since the onchip cache typically consumes 25-50% of the processor’s area and energy consumption, a fraction that is increasing with time [Banakar et al. 2002].

Unfortunately I can’t read the Rex Computing paper as it is behind and paywall and I refuse to pay.

shelby3 · 2020-05-18T22:54:45Z

To Share With Lifetimes, or Not Share At All?

shelby3 · 2020-09-15T08:19:21Z

Massively multicore has arrived…

From private message:

I am not sure what you are refuting, these new ARM chips have a speculative out-of-order architecture. What they don't have is hyperthreading, but that is just a way to try and increase thread level parallelism. New ARM cores have added these features to enable them to enter their server space. Cortex-A15 has speculative execution for servers, Cortex-A5 does not.

The bolded has been my prediction and my salient point.

I repeat my upthread arguments. The constant speedup of OoOE pales in comparison to the exponential speedup from maximizing the number of cores. And (wasted[1]) power consumption is a huge if not the #1 factor now, which impact the design of the silicon because power dissipation is a huge problem and impacts the degrees-of-freedom w.r.t. positioning of components on the silicon. OoOE is power-consumption inefficient and growing more insignificant as a facet of the expanding whole (as I anticipated), especially as they discover they can tradeoff more cores and lower power consumption for at least the intensity of the OoOE if not entirely the ILP (instruction level parallelism) entirely at some point in the future.

[1] Wasteful because it is speculative and thus computes some fraction of the whole which is discarded. Attempted mitigations to Sprectre will increase the waste. There will never exist a 100% solution to every possible side-channel attack that can be devised and thus speculative execution will eventually be accepted to be a security bug and discarded. Go ahead and try to debate me on this point.

Another factor is that green threads are becoming the new paradigm for massive concurrency and these essentially render OoOE less effective because they frequently purge the register file and instruction pipeline. As the number of green threads proliferate into the billions and trillions such that user-layer context switches become more frequent and granular, then OoOE may become a net negative when factoring in the power and die area costs relative to the thread parallelism that could be gained by repurposing the silicon and lower the TDP which impacts the holistic economies-of-design.

And even if I am incorrect about ILP disappearing or diminishing from the silicon, it is a fact that it is diminishing as a portion of any speedup and scaling of Moore’s law. If Moore’s law stops, computing dies. So whether it becomes deprecated or not, ILP is now becoming entirely irrelevant to the future of computing. And this was simply obvious to me.

Given that green threading context switches attempt to align with purges of the instruction pipeline (e.g. branches and function calls) to minimize costs, perhaps ILP/OoOE will become to some extent more explicitly guided by software (statically or dynamically)? Why should the hardware need to attempt to interpret some static knowledge that a loop will not exit (and the runtime will not context switch) until it has looped N times wherein N is known dynamically before the loop begins? There’s semantic knowledge that the hardware is blind to, which it may benefit from, which was one of my key theses.

To expound a bit on why Spectre can never be entirely fixed, let me first quote from the research paper:

Spectre attacks involve inducing a victim to speculatively perform operations that would not occur during correct program execution and which leak the victim’s confidential information via a side channel to the adversary.

So the vulnerability is due to the fact that the speculation is not random and is in fact driven by the runtime data that is going to be pilfered with the attack. You may presume you can foreclose every footprint of the speculation execution available to side-channels (e.g. other cores) but in general this is impossible because timing attacks can even be listened to by means outside the CPU entirely! Wrap your mind around that. So the attacker does even need to find a footprint within the CPU itself. The mere presence of speculative execution is sufficient. This is why cryptographic algorithms are designed to run at constant speed regardless of the inputs, so that they are immune to timing analysis.

@keean wrote on Jun 6, 2018

@shelby3 Threadripper!

https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh

It's still speculative out of order with up to 32 cores.

Ampere announces 128-core Arm server processor

Ampere – the semiconductor startup founded by former Intel executive Renee James and not to be confused with the new line of Nvidia cards – just introduced a 128-core, Arm-based server processor

[…]

While Intel and AMD have gone with hyperthreading, allowing for two threads per core, Ampere went for single-threaded cores. Ampere says that as it adds cores, performance scales linearly, according to the Stress-NG benchmark.

“You'd think everyone must scale [as they add cores], but they don't. AMD gets bottlenecked from system level functions. A thread is not the same as a core. You are sharing resources among multiple threads. Scalability was the prime reason we went with single-threaded cores,” Wittich says.

Amazon Launches a Killer ARM Server Chip With the Graviton2

Amazon also launched its own Graviton CPU back in 2018, but the CPU wasn’t all that great a deal. Graviton2 looks to be a different story.

The Graviton2 is based on ARM’s Neoverse N1, with a slightly lower clock than ARM specified (2.5GHz). Each Graviton2 core offers 128KB of L1 (64K i-cache, 64K d-cache), and a 1MB L2. The entire 64-core CPU is backed up with a 32MB L3. The Neoverse N1 is a derivative of the Cortex-A76 and Amazon is using ARM’s CMN-600 mesh interconnect. There are 8 memory channels supporting DDR4-3200.

Graviton2 is a completely different product than Graviton, according to Anandtech, which recently reviewed it against competing solutions from Intel and AMD.

[…]

It’s Official: NVIDIA To Acquire Arm For $40 Billion

Over the last half-decade NVIDIA has undergone significant growth – both in regards to revenue and market capitalization – thanks in big part to NVIDIA’s newfound success in the server and datacenter market with their deep learning accelerators. While the company is well off of its 52-week high that it set earlier this month, NVIDIA now has a sizable $330B market cap that they are leveraging to make this deal possible.

And according to the company, it’s that success in the server market that is driving their interest in and plans for Arm. NVIDIA expects the server market to remain a high-growth opportunity, and that by acquiring Arm they can leverage Arm’s recent success with Neoverse and other server products to reach an even bigger chunk of that market.

[…]

Ultimately, the Arm deal will be a significant shift in the industry, one far bigger than the $40 billion price tag indicates. In one form or another, Arm and its IP are at the heart of billions of chips sold every year, a reach that few other companies can match. For NVIDIA, acquiring Arm will cement their position as a top-tier chip designer, all the while giving them an even larger technology portfolio to go after the server market, and new opportunities to sell IP to other vendors.

Nvidia’s $40 billion Arm acquisition is about bringing AI down from the cloud

A big deal that could be a big deal

As Huang put it: “AI is moving from the cloud to the edge, where smart sensors connected to AI computers can speed checkouts, direct forklifts, orchestrate traffic, save power. In time, there will be trillions of these small autonomous computers, powered by AI, connected to massively powerful cloud data centers in every corner of the world.”
"Nvidia supplies the data centers, Arm supplies the mobile chips"

As the creator of the GPUs used in many AI data centers, Nvidia can supply the first half of this equation, while Arm, designer of cheap and energy-efficient mobile chips, takes care of the second. It doesn’t take a genius to work out there’s potentially profitable overlap between these two businesses. (Though it does take $40 billion to force that overlap into existence.)

[…]

Fittingly enough, Nvidia’s ability to offer extra capitalization is, in part, only due to interest in artificial intelligence. As journalist Alistair Barr noted on Twitter, back in 2016 Nvidia was worth $30 billion, but its valuation has since increased to over $300 billion because of demand generated for its GPUs by machine learning (among other things). So while demand for AI motivated this deal, it also made it possible in the first place.

Nvidia's Arm Deal Would Make It the Center of the Chip World

Combining the two chipmakers would unite leaders in two big tech trends—artificial intelligence and mobile computing.

In recent years, Nvidia has ridden one of the biggest waves in technology, selling chips needed to build increasingly clever artificial intelligence algorithms. Now, the company plans to catch another big swell—mobile computing—with a $40 billion acquisition of Arm, which designs the chips found in virtually all smartphones.

The deal would reshape the chip industry overnight—putting Nvidia at the center of much of the action.

[…]

The two companies are dominant in AI and mobile, but Intel remains king of the data center. Only recently have some companies, including Amazon, begun adapting Arm’s designs to make more efficient new data center chips.

Data center sales are a fast-growing part of Nvidia’s business, generating $3 billion in revenue last year, up from growing from $830 million in 2016. In April, the company acquired Mellanox, an Israeli company that makes hardware transmitting data between chips in cloud computing systems, for $6.9 billion.

In an interview with industry analyst Patrick Moorhead, Nvidia CEO Jensen Huang said adding Arm to Nvidia would help sell more Arm chips to data centers. “What will change is the rate of our road map,” Huang said. “We know for sure that data centers and clouds are clamoring for the Arm microprocessor, the Arm CPU.”

In coming years, more computing is likely to drift into the cloud, especially as more companies make use of AI for business applications and as 5G wireless networks create new possibilities for sharing data and applications.

[…]

“We're about to enter a phase where we're going to create an internet that is thousands of times bigger than the internet that we enjoy today. A lot of people don't realize this,” Huang told Moorhead. “We would like to create a computing company for this age of AI.”

Mike Demler, an analyst at the Linley Group, which specializes in microchip technology, says this needs to be a long-term strategy, as Arm designs are just starting to find their way into data centers.

Demler believes Nvidia’s plan may be to combine Arm’s general purpose chips with its own specialized AI ones, an approach that could improve the capabilities of both. “Nvidia’s intent is to proliferate a new computing model by having a tighter integration of CPU and AI architectures,” he says.

Remember I noted that ARM supports “tagged pointers,” which can make null pointers, tagged unions, existential quantification dynamic dispatch, ARC and/or interior pointers (i.e. those that point with an allocated block of memory) more efficient. I wrote:

Elimination of tracing and ARC removes the extra overhead tradeoffs required for “interior” pointers into the body of a data structure. But doesn’t remove the overhead issue for those persist which require tracing or ARC.

I wrote on Sep 10, 2018

@keean regarding our prior discussion about superscalar versus massively multi-core, the 32 core (64 logical thread) AMD Threadripper seems to only excel at massively parallelized tasks which are not memory (cache coherency) bound:

[…]

I wrote on Oct 20, 2018:

I have also augmented my original proposal by acknowledging we probably also need the ability to instance region-based memory allocation (aka the Arena pattern) each of which are separately bump-pointer heap allocated. Thus a region is only ever freed all at once with no mark-and-sweep. This increases efficiency for use cases that are compatible with such a model.

I posit that these proposed models in addition to avoiding thread synchronization of collection cycles may cope better in some use cases (because they provide alternative assumptions about duration and cordoning of the generation) with the issue you mentioned in your blog as quoted below about violations of the young generational hypothesis assumption (which you’ve also mentioned can be sometimes ameliorated by eliminating superfluous boxing and allocations):

Generational collection works well when the generational hypothesis holds but struggles when values survive the nursery only to become unreachable soon afterwards. This corresponds to common allocation patterns such as cycling values through mutable queues or caches and filling hash tables.
Imagine repeatedly enqueuing and dequeuing values on a queue. The lifetimes of the values are proportional to the length of the queue. Thus, this provides a simple way to quantify the performance overhead of generational garbage collection. If boxed values are enqueued and dequeued on OCaml's built-in mutable Queue data structure then the time taken per element jumps by around a factor of 2-3.5 when the elements reachable from the queue exceed the size of the nursery and, thus, most survive to the old generation rather than being collected efficiently in the young generation.

You noted in another of your blogs that timely young generation collection cycles are important for maximum performance of (what you refer to as ‘symbolic’1) code that produces a lot of young garbage. So my proposed model above would require some intervention by the programmer to subdivide his algorithm between actors such that collection cycles are frequent enough to maximum performance. But this would conflate concerns because the Pony capabilities actor model is also about data race safe threaded sharing. So thus the need for regions and I think also we need to add the option of regioning by function scope in addition to the proposed by region id.

Will the programmer be capable of this static partitioning of allocation (i.e. is it a deterministic phenomenon) or need AMM to stochastically solve a non-deterministic phenomenon? My thought is that although program demands can be non-deterministic, the algorithms can sometimes be broken down into relatively deterministic parts that are combined non-deterministically. This will depend on whether the algorithm can efficiently and deterministically identify which objects it wants to copy from the short-lived, frequently collected bump pointer allocated region. For algorithms that can’t, then mark-and-sweep is perhaps required to maximize performance. For n-Queens it seems all the young garbage can eliminated by employing a mutable array instead of creating temporary copies, and then zeroing out the rows when backtracking.

Again I want to reiterate that fully AMM seems to not scale to massively multicore which is where the industry is rapidly headed now. The issue is more complicated than just stopping-the-world or not, as I explained up-thread (kudos to @keean for the initial insight) it also involves scaling the CPU’s cache coherence overhead. Pony’s local mark-and-don’t-sweep, global reference counting via message passing AMM model seems to have performance issues. So it seems we need a different model. I’m trying to figure out what the model should be. Note the cache coherency issue is mostly due to sharing between threads, which is mostly resolved by Actors, yet may also have interplay with the GC strategy.

I wrote on Nov 15, 2018:

Changing topic a bit, it seems with recent announcements that AMD is moving towards that sharing data between cores (or at least 8-core chiplets) will have more latency and cost (Infinity Fabric will remain on 14nm for the time being). Perhaps we should deprioritize being able to not copy when sending data between Actors as that will become a nearly useless optimization in the future. This design point may impact the GC issue we have been discussing in the other thread.

@keean wrote on Jan 12, 2019:

@shelby3 it's great to see people trying to make progress in this area, however I am not hopeful for success. Cache is very regular (that is it is easy to construct the blocks from hardware 'macros' when designing the chip, so in terms of hardware design effort it is cheap compared to complex parts like instruction decode, ALUs amd control logic whilst at the same time producing a very large increase in performance (10 - 1000x). In my experience people want CPU performance more than they want battery life, otherwise they could have a laptop that lasts days by using low power ARM cores. Also comparability with a existing software is important. Look at failures like Transmeta for examole, and that nobody uses x86 emulation on Arm.

One of the reasons for the improved cost for ARM servers is the lower heat due to lower power consumption.

I will end up being correct. I usually am proven correct over the long-term.

shelby3 mentioned this issue May 22, 2018

Error handling #40

Open

shelby3 mentioned this issue Jun 1, 2018

Concurrency #17

Open

shelby3 mentioned this issue Jun 26, 2018

Pointers #38

Open

shelby3 mentioned this issue Jul 31, 2018

Subtyping #8

Open

shelby3 mentioned this issue Jun 2, 2019

Possible ASIC design tevador/RandomX#11

Closed

This was referenced Jun 26, 2020

new wd-40 test #51

Open

[reconstructed] WD-40 (for reducing Rust with the next mainstream language) #35 #50

Closed

This was referenced Sep 10, 2020

Macros and DSLs #31

Open

WD-40 Issues thread was accidentally deleted by @keean #49

Open

This was referenced Sep 15, 2020

V Concurrency Considerations vlang/v#3814

Closed

Proper support for distributed computing, parallelism and concurrency vlang/v#1868

Closed

Parallelism #41

Parallelism #41

Comments

shelby3 commented May 22, 2018 • edited Loading

shelby3 commented May 22, 2018 • edited Loading

keean commented May 22, 2018

shelby3 commented May 23, 2018 • edited Loading

Note AFAIK that most GPUs cards include a lot of shared memory with a very fast, wide bus. So perhaps they’re already close to what is optimal even for large working sets.

shelby3 commented May 23, 2018 • edited Loading

keean commented May 23, 2018

shelby3 commented May 23, 2018 • edited Loading

shelby3 commented May 24, 2018

keean commented May 24, 2018

shelby3 commented May 24, 2018 • edited Loading

shelby3 commented May 27, 2018

shelby3 commented Jun 2, 2018

sighoya commented Jun 6, 2018

keean commented Jun 6, 2018

shelby3 commented Jun 7, 2018

shelby3 commented Jun 7, 2018 • edited Loading

keean commented Jun 7, 2018

keean commented Jun 7, 2018

shelby3 commented Jun 21, 2018 • edited Loading

keean commented Jun 21, 2018

shelby3 commented Jun 22, 2018

shelby3 commented Jul 9, 2018 • edited Loading

shelby3 commented Jul 23, 2018 • edited Loading

shelby3 commented Jul 24, 2018 • edited Loading

shelby3 commented Jul 25, 2018 • edited Loading

Actors as the paradigm for parallelism

shelby3 commented Jul 31, 2018 • edited Loading

sighoya commented Jul 31, 2018

shelby3 commented Sep 18, 2018

keean commented Sep 18, 2018

jdh30 commented Oct 19, 2018

shelby3 commented Oct 21, 2018 • edited Loading

shelby3 commented Nov 14, 2018

keean commented Nov 14, 2018 • edited Loading

shelby3 commented Nov 15, 2018 • edited Loading

shelby3 commented Nov 22, 2018 • edited Loading

Mark-and-don’t-sweep

Tri-color abstraction

shelby3 commented Nov 28, 2018

shelby3 commented Dec 3, 2018 • edited Loading

shelby3 commented Dec 25, 2018 • edited Loading

shelby3 commented Jan 12, 2019

keean commented Jan 12, 2019

shelby3 commented Jan 13, 2019

shelby3 commented Jan 13, 2019

keean commented Jan 13, 2019

shelby3 commented Jan 13, 2019

keean commented Jan 13, 2019

shelby3 commented Jan 13, 2019

shelby3 commented Jan 13, 2019

shelby3 commented Jan 13, 2019

shelby3 commented May 18, 2020

shelby3 commented Sep 15, 2020 • edited Loading

Ampere announces 128-core Arm server processor

Amazon Launches a Killer ARM Server Chip With the Graviton2

It’s Official: NVIDIA To Acquire Arm For $40 Billion

Nvidia’s $40 billion Arm acquisition is about bringing AI down from the cloud

Nvidia's Arm Deal Would Make It the Center of the Chip World

shelby3 commented May 22, 2018 •

edited

Loading

shelby3 commented May 22, 2018 •

edited

Loading

shelby3 commented May 23, 2018 •

edited

Loading

shelby3 commented May 23, 2018 •

edited

Loading

shelby3 commented May 23, 2018 •

edited

Loading

shelby3 commented May 24, 2018 •

edited

Loading

shelby3 commented Jun 7, 2018 •

edited

Loading

shelby3 commented Jun 21, 2018 •

edited

Loading

shelby3 commented Jul 9, 2018 •

edited

Loading

shelby3 commented Jul 23, 2018 •

edited

Loading

shelby3 commented Jul 24, 2018 •

edited

Loading

shelby3 commented Jul 25, 2018 •

edited

Loading

shelby3 commented Jul 31, 2018 •

edited

Loading

shelby3 commented Oct 21, 2018 •

edited

Loading

keean commented Nov 14, 2018 •

edited

Loading

shelby3 commented Nov 15, 2018 •

edited

Loading

shelby3 commented Nov 22, 2018 •

edited

Loading

shelby3 commented Dec 3, 2018 •

edited

Loading

shelby3 commented Dec 25, 2018 •

edited

Loading

shelby3 commented Sep 15, 2020 •

edited

Loading