-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelism #41
Comments
You mentioned this VLIW comparison and then VLSI/VHDL parallelism in our past discussion in the Concurrency thread. And ironically you made the same conflation of “low and high-level parallelism” which I will be correcting you on again herein. I agree there’s more information at runtime relative to compile-time approach which is parallelizing the same algorithm. But your thought process is incorrectly applied to what I wrote. You conflated low-level runtime, with high-level compile-time. You must compare apples to apples. Your point is true for low-level runtime versus low-level compile-time. I did not write “more information”. I wrote “more semantic information”. The distinction apparently means nothing to you but it should. The “more information” at runtime is limited in its character by the semantics expressed in the code. The higher level semantic information enables (the programmer, not the compiler!) to make changes in the algorithm which are impossible for the low-level speculative pipelining parallelism to achieve, because the low-level transformations cannot change the high-level algorithm. A high-level algorithm designed to achieve parallelism can obtains orders-of-magnitude scale of performance; whereas, the low-level speculative pipelining is only achieving much less than an order-of-magnitude increase and this increase is not scaling by Moore’s law over time because the CPU designers have hit a limitation of nature which can’t scale. Advocating the futility of fighting against nature is inane, idiotic, myopic, etc.. In addition to my slamdunk point above, your point about runtime information can also be applied by a compiler+runtime at a higher semantic level which the low-level pipelining can’t do. For example, with pure functions a Another tangential point is that when you claim that it’s too difficult for programmers to extract parallelism, you are ignore the points made in the ACM article that extant PLs lack the proper designs to extract the parallelism that they could plausibly do without too much programmer intervention, such as the aforementioned pure function example. But my point has been also that ingenuous programmers will find even more clever algorithms for parallelism. That ACM article argued that C is no longer a low-level language because due to lacking these facilities for the compiler to automatically extract higher-level parallelism from the code which the low-level speculative pipelining cannot extract, then CPU designers are forced to create a virtual microcode pipelining hardware inside the CPU which C does not even expose to the programmer. IOW that parallelism is the only way to continue to scale Moore’s law and C encourages CPU designers to instead not scale and waste most of the transistors on trying to squeeze diminishing incremental improvements in serial performance. That is a recipe for disaster with the future of computing becoming stagnant and technology which relies on computing failing to improve. Thus it behooves us as PL designers to address this issue.
Because they were running serial algorithms. No where have I claimed that running serial algorithms on parallel hardware will be faster than running serial algorithms on speculative pipelining. Thus you shouldn’t write this irrelevant information which doesn’t even apply to my point. My point was about the need to parallelize our performance critical code paths, otherwise our programs will not scale. This is a statement of fact, which none of your obfuscating BS has refuted.
Me too I will also choose speculative parallelism because of the evidence. At a higher semantic level. Which is for example how 3-SAT is made practical for some working sets. And you will go with evidence? Then why do you ignore the evidence that future improvements in serial speedups are dying? And why am I repeating this for the 4th or 5th time? I simply can’t believe you’re this slow minded because you’ve demonstrated to me in many instances that you have a sharp intellect. Why the cognitive dissonance here? Is it because and do you lack divergent thinking? Is it some obstinance issue? Are you preoccupied? Are you being lazy and not really thinking it out and relying on past conclusions you had made instead of reanalyzing them in light of my points? |
And 3-SAT is not general computation, it is a very specific algorithm, for a class of problems, but it looks very different from general computation. The general part of most programs is a event loop that queues sequential processes, that necessarily follow one another, or are connected by streams in the pipelined case. The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already. I agree that a new language has to handle parallelism. There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels. However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks. |
@keean wrote:
ftfy. Sequentially ordered != sequential process. There may be parallelism within the individual processes each of which are sequentially ordered. I know you know that nature is fractal, so there can be parallelism within serialism and vice versa. Also the events in the event loop which are in the performance critical path are typically not sequentially ordered and can be performed in any order or even in parallel in some cases. The GUI events which are ordered are typically not performance critical (we do not need to scale the performance of GUIs by orders-of-magnitude).
I have not looked at what are the limitations of the designs of existing GPUs. Seems that the ones built into the CPU don’t have the fast GBs of dual-ported main memory that dedicated GPU cards have? And what are the issues with exchanging work between CPU threads and GPU threads? Are the cost barriers high enough that the two processors pretty much can only send messages to each other? I know you’re reasonably knowledgeable about the extant hardware designs, so I will defer to your knowledge on this. But yeah, maybe our compiler should be doing more to leverage that GPU (and also the CPU’s multicore) hardware which often sits idle.
I think my recent point can be recapitulated/reconstituted as the more parallelism we can push for in code, then the more incentive CPU designers will have to prioritize high-level parallelism instead of low-level pipelining. Hopefully resulting in scaling trending back to Moore’s law progress. My point is we should do all we can and reduce dependence on the low-level speculative pipelining in our design decisions. For example, such as where you argued in the Error Handling thread that exceptions as return values would not be an extra performance cost because of the low-level speculative pipelining. But that design decision if taken would have further cemented the need for low-level speculative pipelining and helped kill future scaling. And upon further analysis of your proposal independent of the concerns with reliance on speculative pipelining, it became clear we didn’t need to accept that design decision.
Not just parallelism but also we need to diminish our reliance of the model of main memory as having latency that is low enough to keep up with a serial processor. Because this requires expending the transistor budget on complex deleterious caching mechanisms. The ACM article also points other things we can improve to reduce for example the reliance on cache coherency, which is necessary in order to enable hardware designers to prioritize parallel computation units instead of the very costly (inefficient!) model of low-latency main memory. For example that ACM article pointed out that if our PL allows for the compiler+runtime to reformat arrays of structures to structures of arrays, without intervention from the programmer. Also I wrote:
Not always true. Again nature is fractal. |
Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype. And the corollary is that expending it to speed up serial algorithms that are in the performance critical path of the application futile non-scaling. |
I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower. Having 4 cores does not make up for this, just look at the failure of the Power Architecture. PowerPC was IBMs attempt to follow that approach, lots of simpler cores. It was tried in the playstation2, and the added complexity of programming the thing drove developers away, leading to the dominance of the XBox, as it provided a proper PC like CPU with caches and super-scalar-out-of-order processing. The ideas you and that article are proposing are not new, they have been tried, and they failed the test of the free market. |
@keean wrote:
Do you not understand what I intend by “are not in the performance critical path of the application”? It can’t be slower in any relevant way, if it is not in the performance critical path. Whoop-de-doo if you make your highly responsive GUI more responsive and nobody can discern the speedup, because it is not code in the performance critical path. Law of diminishing returns aka marginal (non-uniform) utility. Another corollary is that only 5% (or maybe up to 20%) of your application’s code is typically in the performance critical path, although there are exceptions. Thus the less stellar programmers will still have a lot of work they’re qualified to do that can be serial algorithms. Btw, one (or maybe 2) core in the CPU could retain that speculative pipelining and the other 8, 16, or 24 (and increasing in the future) could then not waste their transistor budget on marketing hype. Has AMD already designed a CPU like this, such as with most of the transistor budget for the GPU on the same die? |
Yet another example of parallelism I think the compiler can extract without intervention from the programmer. |
It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical. |
@keean wrote:
I did not propose to employ parallelism in code that is not in the throughput performance critical path. Thus latency would not be impacted in the typically up to 95% of the code of a program that is not in the throughput performance critical path. Since that 95% of the code can only run serially then it can’t leverage the other cores. Thus one or two cores with the speculative pipelining is sufficient (on the desktop and servers will eventually discard the speculative pipelining entirely once we improve the software). The rest of the transistor budget can be allocated to parallelism which is the only way to scale. You could argue that running multiple copies of the program such as for a webserver where each request runs a separate copy of the program, thus would need every core to have speculative pipelining in order to minimize latency. But I would argue that a 50% reduction in CPU-bounded latency is not worth killing the scaling (by Moore’s Law) of the number of cores and thus number of requests that can be served per CPU. This is why Xeon (which sacrifices clock rate for more cores) instead of desktop Haswell/Broadwell architecture runs on most servers. So Xeon is sacrificing latency for parallelism. Also Ryzen is kicking ass on Intel precisely because they made the right tradeoff for where computing is headed, “Ryzen CPUs offered stronger multi-threaded performance and weaker single-threaded performance relative to comparable Intel CPUs.” Sparc was too geared towards parallelism too soon. The software had not caught up yet. AMD is probably hitting the sweet spot for the transition phase. They decided to emphasize their strengths in design and parallelism (GPUs) and outsource the foundry (eventually leveraging the coming strength of China in foundries!). AMD is following Android’s model and Intel is following Apple’s model, so I expect AMD to end up with more market share than Intel unless they fuck up again to snatch defeat from the jaws of victory. Also AMD has rising capital to expend because of the rising demand for GPUs for mining cryptocurrencies. But eventually something like Sparc or GPUs will come back much stronger in the future when the software adapts to more parallelism. And look who continues Sparc now. Japan! The Asians are going to kick ass in the future. Also the latency is typically not bounded by the CPU but rather bounded by the persistent storage media, e.g. SSDs. So there is sometimes (or maybe even typically?) ample overhead to allow for parallelism and discard the speculative pipelining without increasing latency, although this can vary in each situation. You could attempt to make the argument that latency of the CPU and persistent storage are additive, but smart coders know to mask the CPU latency in the persistent storage latency with interleaving. Ditto the masking the CPU throughput and latency with memory latency via interleaving computation with read/write accesses. |
I farted:
|
Some parallelism discussion in the Concurrency issues thread #17. |
@shelby, Do you know how the ram capacity of a normal PC Board will increase in future? Massive multi core increase for the future is one good thing but to run multiple instances simultaneously wee need also more ram capacity. |
@shelby3 Threadripper! https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh It's still speculative out of order with up to 32 cores. |
@keean wrote:
64 hyperthreads on a desktop! Wow. Yeah I had realized that removing the speculative out-of-order doesn’t gain much. But the point remains that further increases due to Moore’s law are going into cores now. There’s no more that can be squeezed out of OoO exection. Remember I wrote recently in the Concurrency issues thread #17:
|
@sighoya I don’t know the plans about DRAM. I presume they’ll scale everything up. It’s the latency which can’t keep pace. But with massive concurrency then we can coalesce main memory accesses as GPUs do.
Found this:
|
@shelby3 On a CPU you might lauch parallel threads, and then wait for them all to complete (a rendezvous) on a GPU you know all the tasks finish at the same time. So a GPU is going to be more efficient for loop parallelisation (as long as its a big enough chunk of work to overcome the cost of launching the task and collecting the results). A CPU is going to be more efficient at having one thread per service connection dealing with client connections etc. |
@shelby3 A good parallel language should have different syntactic structures to represent these different kinds of parallelism, so the programmer is clear how things will execute and how much they will cost (the principle of visibility, and not paying for what you don't use). You could automake this with heuristics, but then the programmer has to play chase the heuristic, where small seemingly inconsequential code changes can tip the balance and result in unpredictable changes in performance. |
@keean made me aware of this: Microsoft is working on a new paradigm for increased parallelism: Now Microsoft ports Windows 10, Linux to homegrown CPU Explicit data graph execution - Wikipedia Also remember I predicted this earlier in this discussion about the death of speculative superscalar architecture and shared caching between threads: OpenBSD disables Intel’s hyper-threading over CPU data leak fears https://www.itwire.com/security/83347-openbsd-chief-fix-for-new-intel-cpu-bug.html And remember that there is no way to disable timing information (can merely increment a variable in a loop): https://gruss.cc/files/fantastictimers.pdf The only 100% safe way to stop code from software timing is to make all code run in a continuously randomly changing speed of clock, which isn’t desirable nor realistic. OS preemptive threading may to some extent foil the use of timers which rely on incrementing a variable and non-preempted execution, but the devil is in the details. I don’t know the minimum interval needed to measure a timing attack with software timers. And if the OS preempts too frequently then overhead increases, which eats into the performance that was supposed to be gained from those hardware performance features (speculative execution and shared caches) which make the timing attacks possible. Thus I’m not sure that removing hardware timers is a complete solution. Also some software needs high-precision timers. Thus I still propose that in the future we will need the option to have code request that it be run only on cores with that vulnerable performance enhancements turned off and/or to dictate which other code can run along with it in the other hyperthreads which share the same vulnerable performance enhancements. Code that prioritizes security over maximum performance or which employs suitable parallelism may want to turn those vulnerable performance enhancements off. EDIT: if you want some flavor of why correlation attacks can be so insidious and we should not think we can so easily squash them just by removing a hardware counter. |
@shelby3 I am not sure, I think for most people performance is more important than security. I mean you can still exploit "rowhammer" on DRAM, but people are still using DRAM :-) |
As I wrote previously, I think eventually we will end up with a way for applications to turn it off on a per core basis so that applications can decide whether they need to prioritize security (and battery life) over serial performance. |
I want to make my stance clear because I learned in private messages that some misunderstood my stance. In my post that responded to Eric Raymond’s blog post and claim about the invariant of SICK algorithms requiring serial performance and the discussion here in Github that has ensued, I didn’t intend to imply that we will ameliorate mathematical complexity theory. Rather my stance is that we will replace the need for those algorithms by solving problems with different strategies which can be parallelized and that do not require those algorithms. My stance is that we have no choice and humans necessarily rise to the challenge and are ingenuous when the market demands them to be. We may also find algorithmic strategies to apply to existing SICK algorithms which leverage parallelism, but that wasn’t my central thesis. |
@keean can you concur or correct my thought that §4. Challenge #1: Traffic on pg.3 of the paper Why On-Chip Cache Coherence is Here to Stay (which also appeared in a journal) contains an error in logic:
Although they’re correct that the bus traffic overhead of the invalidation messages is constant relative to the traffic of the read misses, the cost of coherency for per-core private caches is that every write invalidation is by definition going to cause an additional read miss for each core that still needs access to anything (that wasn’t even invalid but was included) in the block (aka cache line) that was invalidated. IOW, the overall increase in traffic cost for per-core caches due to write invalidation, doesn’t scale with the number of cores. Or more simply stated that sharing mutable data between cores that have private caches doesn’t scale. EDIT: I received an email reply from one of the authors of that paper. He concurred but pointed out that my point is only true when there exists the “unusual situation” of “false sharing” of “unrelated data … collocated on the same cache block, leading to needless coherence traffic.” His opinion is that “good data placement, false sharing shouldn't be a big problem.” I responded that I disagree because there will be cases where the record of related fields being written to also needs to be read again, so it’s not necessarily due to false sharing and thus not necessarily unusual or uncommon. Thus I continue to believe that the discovery by mostly @keean with my follow-up insights is very important. However I will say I don’t have any statistics or functioning models to check the frequency of this issue and its quantitative impact. I don’t necessarily disagree with the title of the paper though once writable shared data is disallowed in the system. There’s still some coherency required… If we adopt @keean’s suggestion in the Pointers #38 issues thread that all data shared between threads must be immutable (aka read-only), then my realization is that the only time that per-core caches need to be invalidated is when an immutable data structure has been deallocated and the memory it occupied is allocated anew for some other data. Because the invalid data remaining the cache will not be accessed by any program which respects the allocation until the same area in shared memory has been allocated anew. The allocation could optimize this further (so that other data in the same block which is still valid as cached is not prematurely invalidated) by refusing to allocate anew the memory that the deallocated immutable data structure formerly occupied until all of the surrounding memory which comprises an entire block (regardless whether that entire block is loaded into any per-core cache) that will be invalidated is also deallocated. Thus the invalidation broadcasts (to the per-core caches) would only happen when the entire block in shared memory is no longer in use and has accepted a new allocation. This does conflate software and hardware though. One way to implement it is to have a system call for invalidating blocks and rely on software to manage this correctly for its virtual memory pages. The conclusion section of that paper states:
They actually failed to consider the holistic issue, which architectural simulations under scaling would presumably have illuminated. They considered each read miss independently without considering (and thus implicitly presuming not) that write invalidation drives more read misses thus amplifying traffic.
They also failed to consider the case of an Erlang-like model wherein all shared data is immutable. P.S. I emailed the authors of that paper a link to this post. Additionally immutable shared data can facilitate less complexity for cache-to-cache transfers as explained in 3) Cache-to-Cache Transfers of subsection B. Coherence Protocol Independent Design Considerations in §III. IMPLEMENTATION CONSIDERATIONS on pg. 7 of the research paper An Evaluation of Snoop-Based Cache Coherence Protocols:
And cache-to-cache transfers provide power-savings as explained in subsection F. Energy Efficiency of Cache Coherence Protocols of §II. SURVEY OF EXISTING CACHE COHERENCE PROTOCOLS on pg. 5 of the same research paper. And also explained in §The case for cache coherency of the article How Cache Coherency Impacts Power, Performance. Read-only shared data is actually faster and uses less power on existing systems which employ the MESI protocol. |
This post contains a shocking (at least to me) discovery that Java and Go multi-threading with global heap, tracing GC is entirely incompatible with scaling multi-core. Scaling multi-core is entirely incompatible with multi-threading that shares mutable data between caches. Additionally since tracing GC is incompatible with scaling multi-threading unless each (or small group of) threads has its own separately collected heap, because as I explained that collector which isn’t instantaneous can’t be shared between too many threads because it would pause too many threads and make threads dependent on the good allocation behavior of other threads. So scaling multi-core will require reference counting for immutable objects shared between caches (and the immutability will aid in probing and collecting acyclic references, while not stalling any threads during such cycles detection and collection). The GC of objects which aren’t shared between caches must be congruent with a viable general purpose programming model. Intel’s Xeon Phi (aka Larrabee) suffered from not sufficiently addressing these requirements2. One option is Rust’s zero overhead 100% stack allocation with its lifetime and exclusive borrowing model providing safe reentrancy. Code would be grouped to be run only by hardware threads sharing the same cache (to avoid write-back and cache-to-cache transfer coherency overhead). Another option is similarly to restrict each logical grouping of code to hardware threads sharing the same cache, but employ a separate GC’ed heap for each said grouping. For example, each green thread could have its own separate tracing GC’ed heap. Note this means a Additionally for cache-to-cache transfer efficiency and so that lightweight (e.g. green) thread context switches could occur to mask latency of such transfers, then ideally the programming model would enable the thread scheduler to know via software that such a transfer is required. Serial core performance can’t scale and many-core doesn’t scale with complex cache coherence. At least it doesn’t scale if writable shared data is employed by the software. And if all the software will be modified to not use writable shared data, then the complex cache coherence circuitry is wasted silicon. In that case perhaps a less complex or even software driven coherence could be employed instead. If all cache-to-cache transfers are only needed when passing an Erlang-style, Actor model message from a thread on one cache to a thread on another cache, then the software can tell the hardware the source and destination caches for the transfer. (Note this means a plurality of Actors per core running on top of The Rigel design offers some rough estimate of the improvement in computation density (e.g. FLOPS/mm² of silicon) by removing the:
It was modeled on a 45 nm process. It has 1024 RISC cores (with two-wide issue superscalar ILP), 15 MB total SRAM cache, operating at 1.2 Ghz with a 100W TDP. Compared to the Intel i7 Lynnfield on a 45 nm process which achieves only 4 CISC cores with 8 MB total SRAM cache operating at 2.8 Ghz with a 95W TDP. Ignoring latency differences and considering only throughput density1, I doubt the Lynnfield in 2009 had the 180 instructions in flight simultaneously that the current modern Intel processors has. For one thing, Lynnfield didn’t even have hyperthreading. Let’s presume maybe it had 40 in flight and the Rigel only 2. So unless 1024 ÷ (40 ÷ 2) ÷ 4 = 13 RISC instructions are required to match the work achieved by every CISC instruction, then the Rigel may have a higher compute density than the Lynnfield. Also the in flight instructions on the Intel also include some wasted computation that will be discarded by the speculation. The Rigel paper discusses some of the other variants2 of parallelism throughput density designs which have been attempted. One of the key distinctions1 is between the GPU vectorized (aka SIMD) model which is able to mask main memory latency with a massive number of cheap context-switch threads (affording an order-of-magnitude higher memory bandwidth by trading off increased latency) and the non-vectorized (aka MIMD) general purpose programming model which prioritize serialized low-latency and serialized performance at the expense of lower computation density and higher power consumption per work completed (i.e. lower overall efficiency). The latter has an order-of-magnitude lower main memory bandwidth with lower but still high latency (c.f. also) requiring large L3 cache to bring average latency down sufficiently for the serialized performance. As explained in the Rigel paper, the former is not suitable as a fully generalized programming model . The following is an idea for a variant of a Rigel-esque design (yet with software driven cache-to-cache transfers and cache coherence) that replaces the L3 (and L4) cache with a solution that has comparable latency by masking most of the latency of main memory reads. Lightweight green threads are a suitable paradigm for masking macro-scale latency1, but they aren’t efficiently preemptive enough to generally mask main memory latency. But if main memory latency will only significantly occur (i.e. except for cache spill over eviction for non-shared objects, which should be minimized by the optimizing programmer) on accessing reference counted shared objects (i.e. shared between caches) which probably mainly comprise reading those reference counted shared objects which are cache-to-cache transfers (perhaps with software driven cache coherency driven by Erlang-style, Actor model message passing). The compiler is enabled to insert a green thread check-point for preemption by the memory controller (which ameliorates the lack of efficiency). Contrary to the incorrect association of green thread context switches to kernel thread/scheduler context switches, green thread context switches are low latency. At a function boundary, they’re only slightly more than the ~30 cycles latency of a Hardware-based cache coherency (which is required without this software directed coherency we’re contemplating) obfuscates the true costs of the hardware form of coherency. 1 Sufficient threaded parallelism can mask latency so that hardware is not idled. The GPU achieves this with a huge register file (RF) and putting 32 threads in a contiguously context switched warp, so context switching has low overhead (and also reduces the control circuitry needed per execution unit). The modern i7 hyperthreading employs dedicated registers for the hyperthreads which share execution pipeline ports and run simultaneously. Thus hyperthreads although masking some latency are also intertwined with a focus on serial performance. Green thread work stealing which can run on modern CPUs without hardware support employs compiler generated checkpointed pre-emption (not the same as full pre-emption) which could probably be made more efficient with dedicated hardware support. Green thread context-switches are reasonably efficient compared to OS threads, and especially where context switched at function boundaries, but only suitable for masking software macro-scale latency blocked operations, which does not include low-latency main memory accesses. They may be able to also mask efficiently the latency of waiting on a mutex lock to be freed without hardware assistance. The GPU trades off latency in every facet for increased parallelism by eliminating synchronization assumptions for caches and global memory, as well as for example reducing the power consumption (also) by designing a much higher 24 cycle write latency (archived) (c.f. also) on the register file (RF) than the registers for each core on a GPU. The RF (and shared memory) is shared by all threads in the block so a thread context switch is very low overhead because unlike the CPU no registers needed to be saved. The GPU’s shared memory also has a much higher latency than L1 cache on the CPU and doesn’t serve the same function as a CPU cache. Also the shared memory is banked (where each 16 or 32 successive words are successive banks) so that if threads in the same block read a different location from the same bank on the same clock cycle then they will run serially instead of in parallel. All 32 threads of a warp must be unblocked in order for the warp to execute, but even if they are unblocked they aren’t guaranteed to execute uninterrupted. Note, given 32 threads attempting to uniformly distributed randomly read from 32 banked shared memory, probability math informs us that 63% of the threads would run in parallel every cycle. Thus in such a divergent, non-vectorized access pattern would require four (4) times 24 cycles latency for each step forward of the warp (thread block) that requires all 32 threads to access the shared memory, because So on the GPU need to mask latency by either running many more than 8 threads per SM (i.e. “More concurrent threads”) and/or serially queuing up in the RF many copies of each step of an algorithm, before processing the next step for each of those copies (i.e. “More independent instructions (ILP) within a thread”). There’s no cache coherency on the GPU. Even the GPU which has L1 and L2 cache of main memory (actually L1 may optionally be configured only for caching local memory which is what the RF spills over into), requires a 2 @keean had mentioned to me the Xeon Phi as being an example of multi-core that failed to be as performant per watt as a GPU and failed to match the modern Intel CPUs on serial performance. But that was based on Larrabee which was originally intended to compete with a GPU yet also remain more generally programmable. The Xeon Phi has hardware coherent cache and out-of-order execution! So it’s not surprising that it hasn’t really solved any major need. It isn’t optimally designed for maximum general purpose programming parallelism (c.f. the Actor proposal in the next post) nor for vector, SIMD programming. And of course isn’t optimal for serial algorithms either. Thus it excels at nothing. |
Actors as the paradigm for parallelism(note having very bad whole body inflammatory attack today with a severe headache, forehead on the keyboard fatigue, mental confusion, so please excuse any lapses in my elocution and logic below) @keean and I had discussed in 2016 Actors for example the Concurrency issues thread #17. @keean and I were recently discussing in private messaging, the read-only sharing via messages for Actors and before that I was initially negative about Actors because they aren’t compatible with a shared memory model exposed in PLs and they don’t inherently resolve all concurrency conflicts (but neither does any paradigm)1. Analogous to comments made by others when comparing JVM to Erlang’s VM, I was thinking it’s necessary to have shared memory to for example support a global UTXO hashmap where keys are the SHA256 transaction ids and the values the (pointers to the) a UTXO (i.e. an unspent transaction output) record. There’s a related Q & A How does shared memory vs message passing handle large data structures?. There’s also an explanation of the paradigmatic inferiority of shared memory versus message passing. Message passing also amortizes better (c.f. also) over the long-term and higher scaling. Erlang has proven to be excellent for creating web servers for example. @keean pointed out why not just make every UTXO record an Actor. I initially thought it would increase the overhead, but as we discussed the details we both ended up realizing there would be no overhead. I contributed the idea that even a hashmap could be modeled with Actors by making each bucket an Actor which contains for example a linked-list of values. So after deriving the bucket address from the key, the message could be sent to the Actor at that address to check for the value in the bucket or add/remove a value to/from the bucket. For example, if the pointer to the linked-list for each bucket is stored in an array and every Actor for those buckets employs the same code for processing an incoming message, then there’s no extra per-bucket storage needed for the Actor. Simply call the Actor’s function with input arguments consisting of the message pointer and the aforementioned bucket address. The objects pointed to for the message and the linked-list are of course immutable. The Actor modifies the linked-list by creating a new copy with the required modifications and then changes the pointer at the bucket address of aforementioned array. Note when messages will cause the Actor to have side-effects, then the message must be guarded by a mutex lock to prevent Actor function reentrancy. Thus each Actor really should have pure and impure variant of its function. The impure variant blocks on the mutex lock. These mutex locks can be 1 bit for each Actor when the Actors are in an array (because these adjacent Actors allow for a bit field for these mutex lock bits). But note this mutex lock would be required anyway for mutating any shared memory data with multiple threads, so the mutex bit isn’t additional overhead due to Actors. @keean pointed out that there’s no need to queue the messages and instead employ the mutex lock for blocking. He also pointed out that no need to store a stack pointer (SP) for each Actor because there’s no context-switching going on at the Actor abstraction and each Actor should return to the top-level of its stack when returning from its message processing function. The code sending the message (which may be inside of another Actor’s message processing function) calls the Actor’s function and thus there’s no context-switch required. In order to simulate a message queue (so that the calling Actor can complete instead of blocking), have the called (impure) Actor queue it and return. The called Actor must have registered itself to be called periodically by a thread pool (or registered as callback such as of another Actor) to process its queue. The other key insight I had is that these Actors can run atop of green threads. So for example when blocked on the message mutex lock, the green thread can context-switch. @keean initially had the thought that green threads don’t preempt, but I explained (c.f. footnote 1 in my prior post) they can be made to software “preempt” (aka cooperatively preempt) at key check-points where they can block and Actor message passing calls would be one of those check-points. If the called Actor is only executable by threads running on another core (or group of cores if more than one core per software-driven-coherency cache), then green threading can context-switch from the calling Actor (to another Actor on the same core / cache group). Except if the return type of the called Actor is Note although we’re describing the model of how Actors work locally on the same CPU, the message passing Actor model also scales across the bus and the network. For example, the UTXO records could be split up across servers in a cluster. 1 There’s no panacea paradigm which will prevent all cases of dead and live locks, i.e. the solution to the dining philosophers problem requires omniscience. Neither the Actor model, immutability, nor Rust’s exclusive mutable borrowing will eliminate all opportunities for dead or live locks. For example, the issuer of a message could be waiting on a response from the recipient of the message in order to complete some operation, yet the recipient of the message could (even if transitively) end up waiting on the issuer before it can complete its operation and reply to the issuer. In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis2 for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted. So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap. AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity. Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counted object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function. 2 Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete. Someone commented and I replied:
Erlang has process-level context switches, which is very preemptive (although apparently also cooperative) but very heavy weight overhead. Thus, can’t eliminate the L3 cache for scaling multi-core as I proposed in the prior post. Also AFAIK Erlang doesn’t guarantee the queue-less message passing that @keean and I devised which is really critical to lowering the overhead of Actors to zero as I explained and also critical for implementing software cache coherency as I explained in the prior post. Also Erlang does not have the stack allocation I proposed and instead has automated generational GC which I have pointed out is incompatible with multi-core scaling. Yes Erlang and Actors are the inspiration, but we have proposed significant innovations and yes also the Erlang PL is lacking our ideas for a type system and other features we’d like to have for a complete PL such as modules. Also I think I with Zer0 will likely first target Go as the compile target and VM. Although in this case the reference counted objects (as declared in the code) would actually be garbage collected by Go’s GC on the Go compile-target. The important point is that Zer0 can be compiled in the future to more performant code and will support a multi-core scalable future. Note by targeting Go initially and cooperative preemption (instead of Erlang BEAM VM’s process-level preemption which is apparently also partially cooperative), we could as compared to Erlang (c.f. also) better integrate SIMD and other assembly-level optimizations without corrupting the runtime as NIFs can do in Erlang. We’d still need to compile cooperative preemption into the SIMD code (c.f. footnote 1 in my prior post). The Actor-like design contemplated in this post would rectify the following complaints about Erlang:
|
In response the Quora question What programming languages do software engineers use in 2018?, this answer is more evidence to think that the proposed Actor model in this thread is the future:
|
This might interests you. |
@keean remember I argued that we’d see an endless stream of speculation execution vulnerabilities. Here’s a new vulnerability: https://www.zdnet.com/article/beyond-spectre-foreshadow-a-new-intel-security-problem/ |
@shelby3 The problem is not speculative execution itself, the problem is that the engineers assumed because the results would be thrown away, they did not have to carry out all the security checks on the speculative branch. I don't think AMD are vulnerable to these new 'Foreshadow' attacks. |
@shelby3 "there’s no bump-pointer heap that I know of that didn’t have the integration overhead of being part of a mark-and-sweep GC scheme" Cheney semi-space? |
@jdh30 my proposal is the objects which are shared between actors (thus can be longer lived) must be ARC. Thus the bump-pointer heap is only applicable to non-shared objects which presumably are shorter-lived or at least reasonably well correlated w.r.t. lifetimes even if longer lived. If either of those two hypotheses (the young generational or at least correlated generation) are violated, then we need alternatives for the programmer to work around the inefficiency. The entire (local) bump-pointer heap is reclaimed by resetting the bump-pointer at the termination of the actor function which processes an incoming message. Thus there’s no need for mark-and-sweep, because there’s no objects remaining in the heap because the collection cycle is upon termination of the processing of the incoming message. I have also augmented my original proposal by acknowledging we probably also need the ability to instance region-based memory allocation (aka the Arena pattern) each of which are separately bump-pointer heap allocated. Thus a region is only ever freed all at once with no mark-and-sweep. This increases efficiency for use cases that are compatible with such a model. I posit that these proposed models in addition to avoiding thread synchronization of collection cycles may cope better in some use cases (because they provide alternative assumptions about duration and cordoning of the generation) with the issue you mentioned in your blog as quoted below about violations of the young generational hypothesis assumption (which you’ve also mentioned can be sometimes ameliorated by eliminating superfluous boxing and allocations):
You noted in another of your blogs that timely young generation collection cycles are important for maximum performance of (what you refer to as ‘symbolic’1) code that produces a lot of young garbage. So my proposed model above would require some intervention by the programmer to subdivide his algorithm between actors such that collection cycles are frequent enough to maximum performance. But this would conflate concerns because the Pony capabilities actor model is also about data race safe threaded sharing. So thus the need for regions and I think also we need to add the option of regioning by function scope in addition to the proposed by region id. Will the programmer be capable of this static partitioning of allocation (i.e. is it a deterministic phenomenon) or need AMM to stochastically solve a non-deterministic phenomenon? My thought is that although program demands can be non-deterministic, the algorithms can sometimes be broken down into relatively deterministic parts that are combined non-deterministically. This will depend on whether the algorithm can efficiently and deterministically identify which objects it wants to copy from the short-lived, frequently collected bump pointer allocated region. For algorithms that can’t, then mark-and-sweep is perhaps required to maximize performance. For n-Queens it seems all the young garbage can eliminated by employing a mutable array instead of creating temporary copies, and then zeroing out the rows when backtracking. Again I want to reiterate that fully AMM seems to not scale to massively multicore which is where the industry is rapidly headed now. The issue is more complicated than just stopping-the-world or not, as I explained up-thread (kudos to @keean for the initial insight) it also involves scaling the CPU’s cache coherence overhead. Pony’s local mark-and-don’t-sweep, global reference counting via message passing AMM model seems to have performance issues. So it seems we need a different model. I’m trying to figure out what the model should be. Note the cache coherency issue is mostly due to sharing between threads, which is mostly resolved by Actors, yet may also have interplay with the GC strategy. 1 Presumably this means immutable objects and pure functional equational reasoning a la Haskell? |
Sorting out which hardware I will buy as I need to upgrade my i7-4770 desktop and prepare to be portable and mobile (will be relocating and need to do diversify my work venues not always stuck inside my cave, such as coding at the beach or in a cafe): |
@shelby3 I think the zenbook Pro is nice, although a 15" zenbook Pro might be better for extended use. I agree that it makes more sense to rent CPU time, as either cloud instances (like AWS EC2), or even server-less compute time (like AWS Lambda). I normally keep source code in a remote repository to, and use cloud office software (Office 365 or Google docs). I try to keep nothing of value on the computer, so if it gets lost, stolen, or just breaks I don't lose too much. However I think a decent CPU and GPU are useful, as edge computing is definitely also a factor, because mobile devices are the most used devices to access online services, as such you want to support offline use as you are not always connected to the network. This doesn't have to be the complete functionality, but should allow people to get work done offline. For example edit a document you previously cached locally offline. For big processes you transfer them to a cloud cluster where they can run the background and notify you when complete. As a general rule I would run interactive processeses locally, and queue batch jobs to run remotely, as well as store shared/long-lived data remotely. |
@keean I don’t need that screen inside the touchpad on the Pro. I would prefer not to have the numeric pad (or in the touchpad instead of¹) on the keyboard. The Pro isn’t yet refreshed for Whiskey Lake and I don’t think I want a hexa-core processor with a 35 TDP in a laptop. So I’m currently thinking the recently refreshed Zenbook classic series, although I really don’t need a GPU so I might just go for a cheaper Vivobook in anticipation of replacing the laptop again in 18 months. The refreshed Vivobook S only has a 42Wh battery though. Frustrating that these laptop manufacturers hold back certain features to force us to buy their most expensive model which has other features we don’t need. I don’t want to buy a Dell because my two experiences with Dell laptops were lemons. And I prefer to buy only in a store where I can test the unit before committing to purchase. The latest generation Thinkpads just aren’t the quality they used to be and also there’s none available retail in the Philippines. I’m in ASUS land. I agree we still need some compute horsepower locally. My blog is both addressing the current situation and also the future. The eventual future is ubiquitous connectivity. And I think eventually basically nothing will run efficiently locally on the compute resources an individual could afford to own exclusively. There will be peak demand spikes because for example complex AI and data analysis will be incorporated into software. Our programming compilers might become much more complex and do complex (even whole program) analysis. My basic point for the near-term is for many people it applies to not sacrifice mobility to try to cart around always enough compute resources for every possible software scenario. An eGPU can have much more horsepower and then you can cart that around when you really need it, otherwise leave it on your desktop. For us if we need to run a multi-threaded compiler than we can have a ThreadRipper on our desktop which is even portable, but not mobile. We can connect to it from our laptop when we dock. Some current workstation and gaming users have needs that can be met with a laptop (e.g. Photoshop certified or AutoCAD certified), but I think the parallelization phenomenon is going to unleash software that simply won’t run on any laptop form factor. Also mobility has become a focus for me lately not only because I need to travel, but because being stuck inside 14 x 7 is taking its toll on my health (at age 53.5). I need to get outdoors more both for my eyes (the blue light of the monitor may impact sleep patterns and hormones as well) and also to get daily sun exposure for vitamin D. Also for mental health. Btw, I read today that Intel’s original 10nm process would have been transistor density equivalent to AMD’s 7nm. Apparently though Intel is scaling back the sophistication of their 10nm process so that they can get sufficient yield to ship by late 2019. Thus it may not end up actually equivalent. ¹ Although I saw a video which points out that the numeric pad in the touchpad doesn’t work that well because it can’t keep up with very fast typing. Changing topic a bit, it seems with recent announcements that AMD is moving towards that sharing data between cores (or at least 8-core chiplets) will have more latency and cost (Infinity Fabric will remain on 14nm for the time being). Perhaps we should deprioritize being able to not copy when sending data between Actors as that will become a nearly useless optimization in the future. This design point may impact the GC issue we have been discussing in the other thread. Also I read the new Whiskey Lake Intel mobile CPUs have some hardware fixes for Meltdown. I may opt for the i7-8565 with a 15W TDP. Seems to be the sweet spot that retains battery life, mobility and which neither Intel nor AMD will be able to significantly improve upon for at least a year. Will eventually probably also build a ~10L case Ryzen or Threadripper for more local compute resources when I’m at my desk. https://fuse.wikichip.org/news/1815/amd-discloses-initial-zen-2-details/ http://www.tomshardware.com/answers/id-3824835/noiseless-cooling-system-amd.html |
@jdh30 wrote:
I misapplied the terminology. My intention was to state that I know of no bump-pointer heap that isn’t integrated with some form of tracing garbage collection. I conflated the generalized category of a tracing garbage collector with the mark-and-sweep variant of said collector. Mark-and-don’t-sweepAs I understand it, the mark-and-don’t-sweep variant differs from a mark-and-sweep in that only reachable objects are ever visited (i.e. marked) by the collector. Tracing and marking only occurs when all the memory fragments in free space have been marked as reachable objects by the memory allocator which allocates from those fragments marked unreachable changing the mark to reachable. Some of those objects marked reachable may have hence become unreachable. Then the mark bit’s meaning is flipped to mean unreachable (i.e. free space) and before proceeding, tracing+marking flips the actual (i.e. not its meaning) bit for those objects that are still reachable. Pony’s mark-and-don’t-sweep stores the mark bits in a separate list instead of with the objects (but tracing of course still has to visit the reachable objects in order to trace the references stored in the objects), which I suppose has the benefit that tracing and marking doesn’t have to wait until all the memory fragments in free space have been marked as reachable objects, because will not need to visit all the objects marked as unreachable in order to flip those mark bits to reachable before the aforementioned flipping of the meaning of the mark bit. Related to @jdh30’s point about the importance of cache locality, I also read on pg. 3 of Danko Basch et al, “Mark without much Sweep Algorithm for Garbage Collection”:
Tri-color abstractionFor readers who are learning about tracing garbage collection, I find the equivalence of the tri-color abstraction easy to understand in the context of Cheney’s algorithm. Cheney’s algorithm is an improvement of the naive algorithm because it reduces the overhead for updating the addresses for references to reachable (i.e. grey or black) objects when they are moved (i.e. copied before from-space is deallocated) to the to-space. |
Tying all our discussions into a summary and some additional insights: |
I wrote on the blog Actors are not a good concurrency model:
I forgot to point out explicitly that side-effects occur even in referentially transparent programs, as I explained elsewhere:
|
Remember it was ESR’s blog about SICK algorithms that prompted the discussion that lead to me writing the OP of this thread, and now he @eric-s-raymond has blogged again about his skepticism about exploitable parallelism in software. And again I am going to disagree with some (not all) of his reasoning (and respectfully with @keean if @keean hasn’t yet come around to my perspective). I will respond to some specific excerpts from his blog as follows.
Agreed. Parallelism requires the ability to subdivide the data space.
The crux of my rebuttal has been and remains that the quoted is a narrow-minded framing of a perspective on what is actually blocking parallelism. We as creative humans who adapt to our resource environment will necessarily lift our paradigms such that we overcome those SICK limitations. Let me give an example to illustrate how this is so. For example someone refuted Eric on Dijkstra’s shortest path. Who would have thought that Pi would be by now mathematically proven to be computable in parallel and, “The Law of 1998, Why Would Anyone Ever Need 16GB of RAM?”. Let’s take the necessarily sequential chaining of a cryptographic hash example since reinventing blockchains is my current area of focused research other than my study of programming language design. We subdivide the blockchain into shards instead. Additionally there’s a more fundamental myopia that plagues this focus on deterministic SICK algorithms— our universe isn’t deterministic. Exact deterministic solutions become less meaningful than stochastic approximations as the context of their application broadens beyond a deterministic partial order to a stochastic reality. Which will become more and more the case in this interconnected world where our applications are not just self-contained in the deterministic space of our personal desktop. Humans will paradigm-shift the problem set as required. That doesn’t mean there won’t still be legacy paradigms that weren’t well thought out in terms of the new realities about where Moore’s law is still advancing. I haven’t studied Eric’s Reposurgeon project to know if I would have any insights on whether it’s possible to paradigm-shift that legacy data structure transform. Apparently back references into the DAG are random and deep, so I suppose perhaps some batching, grouping algorithm in order to generate data locality could be contemplated. EDIT: one of the comments on Eric’s blog also had a similar idea. But I also don’t understand why his DAG couldn’t be subdivided and then have the divisions communicate via message passing. Is the overhead of message passing greater than the performance speedup of parallelizing on multicore? Does the transform algorithm have to run sequentially or can there be parallel transformation pipelining? So as I said, I would need to study his problem set in detail before I could comment intelligently. Even if message passing would work, that wouldn’t make Python’s multi-process model appropriate because message passing between processes has more overhead than between concurrent green threads. Python processes run effectively single-threaded because of the GIL. Nevertheless my disagreement with his overly pessimistic skepticism remains.
Eric seems to not know about the Pony language model.
Because programmers have not paradigm-shifted yet. How many programmers even know the Pony language model exists. And also Pony has some flaws which need to rectified which I have been writing about. This is why I’m contemplating creating the Co programming language (recently renamed from the original Zer0 name).
I have been saying this for many months also around here. EDIT: I am not disagreeing with the point that such massively multicore CPUs will be specialized. In this thread for example, I noted that with the Pony model we could in the future have software cache coherency because extant hardware cache coherency is an Amdahl’s law limitation on scaling massively multicore. But note the hardware model remains a general purpose stack machine and not a GPU-like register-file, vector machine. I have stated in this thread we will likely only need a few general purpose CPUs heavily designed for optimizing serial general purpose algorithms. EDIT#2: Eric is apparently now understanding that games are one example where massively multicore CPU is utilized (and not just the GPU), and I agree with his statement (quoted below) that the lack of tools (specially the programming language we need, c.f. also and also) has been holding programmers back from making the transition to massively multicore:
META: Tangentially I explain herein why am unable to respond directly on Eric’s blog. That is because he banned me. Apology in advance to @keean for needing to insert this here, but if Eric is going to read this then I want him to know my opinion that he needs be hit upside the head with a metaphorical baseball bat to awaken him from whatever hole he crawled into about the realities of his juvenile banning shit. Wouldn’t it be more informational if Eric had the courage to face me on a neutral discussion forum like this where he has to actually win all the arguments and not cut me off in the middle of an argument with his censorship. For example for him to try to refute the facts (c.f. also and also). As I remember, it all sort of began circa 2011 when I tried on the 10th year anniversary of 9/11 to do some justice — to for example my heritage as a descendant of Isaac Shelby who fought for the freedoms Eric now enjoys in our country — on his 9/11 blog about what really happened on that day. Some his regulars were vocal (and condescending, brow-beating, ridiculing) about how offended they are by the notion that 9/11 could have been an inside job. Which I guess caused Eric to dislike me. Eric refers to such facts as babbling nonsense. I am also a U.S. citizen and I refuse to tolerate the corruption of our country. And dismayed that he is unwilling to host the necessary transparency. Mr. open source and all. My study of 9/11 is much more solid by now than it was in 2011. In 2011, I didn’t have the knowledge to concisely refute the bellicose interactions on his blog. Eric might think his bans are justified because of those who he thinks derail with off-topic discussion, but his regulars also do this quite often and he allows. And also the discussion I have done was also a natural progression or fundamentally related to the thread subject (but according to Eric’s ignorance, it was all babbling nonsense anyway). I wonder if ESR has ulterior motives or conflicts-of-interest. Does he need to attract more donations to his Patreon by grandstanding to the anti-semitism political correctness totalitarianism? That doesn’t mean I hate Jews! Of yeah I know he talks a good talk about his min-anarchist ideology, but he also seems to love politics, grandstanding, and censorship. On the one hand, one can have some appreciation, love and respect for his contributions and talents, but then be entirely frustrated by his Herculean ego and nastiness when it manifests in banning discussion and the written venom he spews out towards people whose opinions (e.g. towards Jeff Read) he disagrees with. Btw, I usually ideologically agree with Eric when he ridicules Jeff, but I cringe at the example he sets in terms etiquette and discourse. Especially when it is political and ideologically based discussion (Jeff being an Apple and walled gardens fanboy). At least when Linus does it in email, it is nearly always ridicule about poor quality code which is not an ad hominem attack but rather just being frank about low quality work. That’s a Southern gentleman issue. In the South, speaking to someone was Eric does, should be resolved with a duel (and no this isn’t a veiled threat). You just don’t treat people that way even if you are correct. He would retort that noise isn’t signal, but then he must refute those facts about 9/11 I linked above which he has been censoring (as one example of myriad of signal he censors). That is a way of saying I am person who habitually forgives and forgets to work forward hopefully with harmony. But for some reason that guy can’t tolerate me. Maybe because I’m a Southerner and he is a city slicker Northerner? Every several months or so, I wander over to his blog to see if I can learn anything important. Sometimes I find blogs within my area of knowledge and I want to offer some comments. I am not stalking the man. Cripes I have more important things to do. Now Freeman Dyson, that is a man I can respect. And sadly he is approaching a Centenarian so may not be with us that much longer. Amazing combination of intelligence and also devoid of outwardly nasty ego. Maybe because Freeman doesn’t even put himself in those situations in the first place. After all, Freeman doesn’t need a blog of sycophant regulars to validate his writing. Freeman is busy inside his head and talking with other scientists. As I am busy coding doing my own thing. Anyway for the record, that is why I can’t post my comments on Eric’s blog and instead do so here. P.S. while the sycophants play with tinkertoys, the plot thickens. |
As I postulated up-thread, moving towards software driven cache coherency and minimizing the moves of data between cores: |
@shelby3 it's great to see people trying to make progress in this area, however I am not hopeful for success. Cache is very regular (that is it is easy to construct the blocks from hardware 'macros' when designing the chip, so in terms of hardware design effort it is cheap compared to complex parts like instruction decode, ALUs amd control logic whilst at the same time producing a very large increase in performance (10 - 1000x). In my experience people want CPU performance more than they want battery life, otherwise they could have a laptop that lasts days by using low power ARM cores. Also comparability with a existing software is important. Look at failures like Transmeta for examole, and that nobody uses x86 emulation on Arm. You have to consider whether Microsoft's Windows ARM laptop will succeed (probably not) because it cannot run x86 windows software, compared to a chrome book, that can be viewed as a big phone with a keyboard, and can run all android software. |
Your argument is only true for some markets. Laptop != mobile. Mobile people typically want (at least a day’s) battery life before prioritizing performance. Business laptop (not gamers) people typically want (at least a a few hours) battery life before prioritizing performance. ARM is killing Intel on mobile. ARM is preparing to make a run at the server market. Rex Computing is I believe targeting specialized supercomputers. |
Rome is not built in a single day. Intel and hardware cache coherency will die in peripheral markets before finally succumbing in other mainstream computing. |
@shelby3 Didn't work for the Platstation2 which had a non-cache architecture, but the PS3 (and PS4) went back to a traditional architecture because it was too difficult to program for the PS2. |
We need a PL that supports the model and is popular. Rome is not built in a day. We are actively developing Co now. Also the PS2 did not have enough cores to make it worthwhile. When the s/w cache systems have 20X more cores per watt (and thus per TDP limitation), the incentive increases. |
@shelby3 software control of scratch memory is a pain. The problem is that it makes it the compilers job to control the cache, but the compiler cannot anticipate runtime scheduling. It may save power, but it won't work any where near as well as a hardware cache that can account for the actual runtime scheduling of instructions and interrupts. |
@keean did you forget my upthread point that with “Actors” (actually CSP), the management of the cache coherency between cores can be managed in s/w at the juncture of the sending of messages. Removing that h/w responsibility will save power and with the right PL then it will not be hard to program. Again I reiterate that this requires thinking about global memory as partitioned into “Actors”. I agree with you that cache set associativity and load/store (i.e. not coherency between cores) has to still be managed in h/w because that is very difficult to manage in s/w generally. But that is not where all the power consumption gains are coming form. Did you fail to read, remember, or assimilate the post I linked upthread: That clearly explains that partitioning the global memory for exclusive access to some core(s) drastically reduces the power consumption:
|
I emailed Rex Computing asking them to read my prior post. |
Apparently compiler driven methods are approaching h/w controlled methods: https://ece.umd.edu/~barua/udayakumaran-TECS-2006.pdf Note that paper seems to support the fact that most of the power consumption gains are not from eliminating the h/w mgmt of the cache:
Unfortunately I can’t read the Rex Computing paper as it is behind and paywall and I refuse to pay. |
Massively multicore has arrived… From private message:
The bolded has been my prediction and my salient point. I repeat my upthread arguments. The constant speedup of OoOE pales in comparison to the exponential speedup from maximizing the number of cores. And (wasted[1]) power consumption is a huge if not the #1 factor now, which impact the design of the silicon because power dissipation is a huge problem and impacts the degrees-of-freedom w.r.t. positioning of components on the silicon. OoOE is power-consumption inefficient and growing more insignificant as a facet of the expanding whole (as I anticipated), especially as they discover they can tradeoff more cores and lower power consumption for at least the intensity of the OoOE if not entirely the ILP (instruction level parallelism) entirely at some point in the future. [1] Wasteful because it is speculative and thus computes some fraction of the whole which is discarded. Attempted mitigations to Sprectre will increase the waste. There will never exist a 100% solution to every possible side-channel attack that can be devised and thus speculative execution will eventually be accepted to be a security bug and discarded. Go ahead and try to debate me on this point. Another factor is that green threads are becoming the new paradigm for massive concurrency and these essentially render OoOE less effective because they frequently purge the register file and instruction pipeline. As the number of green threads proliferate into the billions and trillions such that user-layer context switches become more frequent and granular, then OoOE may become a net negative when factoring in the power and die area costs relative to the thread parallelism that could be gained by repurposing the silicon and lower the TDP which impacts the holistic economies-of-design. And even if I am incorrect about ILP disappearing or diminishing from the silicon, it is a fact that it is diminishing as a portion of any speedup and scaling of Moore’s law. If Moore’s law stops, computing dies. So whether it becomes deprecated or not, ILP is now becoming entirely irrelevant to the future of computing. And this was simply obvious to me. Given that green threading context switches attempt to align with purges of the instruction pipeline (e.g. branches and function calls) to minimize costs, perhaps ILP/OoOE will become to some extent more explicitly guided by software (statically or dynamically)? Why should the hardware need to attempt to interpret some static knowledge that a loop will not exit (and the runtime will not context switch) until it has looped N times wherein N is known dynamically before the loop begins? There’s semantic knowledge that the hardware is blind to, which it may benefit from, which was one of my key theses. To expound a bit on why Spectre can never be entirely fixed, let me first quote from the research paper:
So the vulnerability is due to the fact that the speculation is not random and is in fact driven by the runtime data that is going to be pilfered with the attack. You may presume you can foreclose every footprint of the speculation execution available to side-channels (e.g. other cores) but in general this is impossible because timing attacks can even be listened to by means outside the CPU entirely! Wrap your mind around that. So the attacker does even need to find a footprint within the CPU itself. The mere presence of speculative execution is sufficient. This is why cryptographic algorithms are designed to run at constant speed regardless of the inputs, so that they are immune to timing analysis.
Ampere announces 128-core Arm server processor
[…]
Amazon Launches a Killer ARM Server Chip With the Graviton2
[…] It’s Official: NVIDIA To Acquire Arm For $40 Billion
[…]
Nvidia’s $40 billion Arm acquisition is about bringing AI down from the cloud
[…]
Nvidia's Arm Deal Would Make It the Center of the Chip World
[…]
[…]
Remember I noted that ARM supports “tagged pointers,” which can make null pointers, tagged unions, existential quantification dynamic dispatch, ARC and/or interior pointers (i.e. those that point with an allocated block of memory) more efficient. I wrote:
I wrote on Sep 10, 2018
I wrote on Oct 20, 2018:
I wrote on Nov 15, 2018:
One of the reasons for the improved cost for ARM servers is the lower heat due to lower power consumption. I will end up being correct. I usually am proven correct over the long-term. |
Unfortunately this discussion started in Error handling thread.
And the discussion derived from a post I made about C no longer being a low-level language because of its mismatch to the fact that all scaling must come from parallelism.
I hope that discussion will continue here instead.
How can we ever find any of our 1000s of comments of past discussion if they are buried in unrelated threads.
The text was updated successfully, but these errors were encountered: