Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can (non-standard) stacks be preserved with implicit PTC, and does this matter? #6

Open
littledan opened this issue Apr 14, 2016 · 13 comments

Comments

@littledan
Copy link
Member

One concern raised at the March 2016 TC39 meeting was how/whether it's bad if stack frames were lost as part of implicit PTC. Although this had been considered in the past and TC39 consensus still came down on the side of implicit PTC, it may warrant new consideration now, both because the DevTools teams of several browsers are being brought into the discussion in a way which they hadn't been before, and because @erights is raising the possibility of standardizing error stacks.

The problem: In a naive PTC implementation, for all tail calls, even those which are not part of a loop, and which may be present in legacy code, the caller stack frame is completely eliminated. This is visible in

  • DevTools debugging/profiling
  • programmatically in error.stack (which is sometimes used for remote debugging).

Possible solutions

  • JSC has the "ShadowChicken" strategy of maintaining a shadow stack which maintains ECMAScript stack frames, which can be garbage collected. I'm not sure whether this is planned to be on all the time or just when developer tools are open. @msaboff , could you correct any mistakes in my explanation here and give more information?
  • Other algorithms?
  • Explicit syntax for tail calls might not require any ShadowChicken-like strategies because developers would be specifically marking their tail calls, which would mean that they wouldn't have the expectation of maintaining them on the stack. Existing code would keep its current (non-standardized) stacks.

Size of the impact
It's unclear how important it is to developers to not have stack frames elided. @jeffmo mentioned at the March 2016 TC39 meeting that Facebook uses collected stack frames for debugging, and would prefer if these stacks maintained their current semantics. It would be great to have feedback from other web developers about how/whether this would impact them as well. If, for example, a ShadowChicken-like strategy comes with a performance regression which is acceptable for DevTools but not on-all-the-time, then this would affect use cases like @jeffmo's.

@littledan
Copy link
Member Author

I'm curious whether other implementors and users find ShadowChicken to be an acceptable strategy.

@bmeurer from V8 team expressed this concern about ShadowChicken:

[ShadowChicken] is a best effort approach, which means you might be lucky or not. The shadow stack is only kept alive as long as the GC thinks it's not worth collecting it (or at least parts of it). So the debugging experience depends on GC timing, GC heuristics and memory pressure. This for example can lead to weird situations where you hit a problem in your code and see tail deleted frames in the console stack trace, then you start debugging, but now you no longer see the tail deleted frames because for DevTools we have to allocate mirror objects and other things which triggered flushing of the shadow stack. This might only hit you in 0.1% of the cases, but it makes debugging an unpredictable and therefore unpleasant experience, especially when you don't have (and don't want to have) the insight into the VM to understand why this happens. And even if you have that, you're probably not in the mood to care about VM internals when you have to debug your code.
Well I think [whether ShadowChicken has high performance overhead for always-on use] depends. If we allocate the shadow stack as part of our heap then it's probably too heavy, esp. since you'd need to change code generation to make sure it becomes safe to allocate on tail call (which is not the case currently). If it's malloc'ed somewhere else, then it might be easier and cheaper, but more involved in other places, i.e. linking this information to the parent frame. We'd have to measure. My feeling is that from a user perspective it's more predictable to have it on all the time (i.e. similar allocation profile when debugging, better stack trace on console, etc.). But for functional programmers that actually want tail calls and use them instead of loops, that means a LOT of overhead. It'd probably render PTC useless for them because of performance (which raises the question why it's there at all if we go that way).

@concavelenz
Copy link

I wanted to add that Google Web products also collect stack traces from the field and that our optimizer likely causes more the average number of PTC calls. So implicit PTC is worrisome

@ofrobots
Copy link

I am very worried by implicit tail call elision, from a Node.js / server side perspective. Post-mortem debugging and in-production profiling are important use cases on the server side.

It is very common practice for server side applications to do profiling in production. This entails gathering stack traces and associating them events being profiled (e.g. cpu ticks, allocations, etc.). TCE would reduce the efficacy of such profiling techniques and make it harder for people to understand performance issues in their production code. Similarly, post-mortem debugging is common practice on the server side. Upon errors you take a core-dump, restart the service, and analyze the dumps offline / en-masse. Having tail calls elided by default would make it hard to identify root-causes of problems.

Approaches like ShadowChicken are a no-go for me because they cost performance for a 'usually works' solution.

Implicit tail call elision means that I don't get an option to make a trade-off. I just get a degraded experience on debugability and profile-ability. With an explicit opt-in syntax, I can at least make that trade-off and not write code (or use libraries) with tail call elision.

/cc some folks from @nodejs/post-mortem as they might have opinions on by-default tail-call elision: @yunong @davepacheco @mhdawson.

@davepacheco
Copy link

For what it's worth, the ShadowChicken strategy described above would not be very useful for users of DTrace and core-file-based postmortem debugging for Node.js.

Node.js today (since 0.6.7) supports the DTrace ustack helper mechanism, which allows DTrace to construct JavaScript stack traces from running Node programs without any assistance from the program itself. This is necessary in order to, for example, profile stack traces where a Node.js program goes to sleep, wakes up, issues I/O, or otherwise interacts with the operating system -- or just to profile it. This mechanism relies on there being a relatively straightforward way to translate stack frames on the native stack into JavaScript ones. I've written a lot more about how this mechanism works elsewhere.

There are a couple of tools for debugging Node.js programs from core files. I develop mdb_v8, but there's also llnode and IBM's IDDE. Using core files, of course, we cannot assume any assistance from the running program. mdb_v8 currently reconstructs JavaScript stack traces, most JavaScript values, and closure variables. It could potentially make use of shadow table representing the "real" stack, but support for that could be arbitrarily complicated, depending on how it's implemented. Anyway, we'd always be enabling such a feature, which seems to defeat the purpose of the TCE.

Optimized tail calls have always been a potential source of problem with debugging and profiling tools, and the impact varies depending on the problem. But given the option, particularly for JavaScript, I would generally disable this optimization for production use.

@mknichel
Copy link

mknichel commented May 5, 2016

To elaborate on the comment from @concavelenz - maintaining stack frames in error.stack is very important for Google applications. We collect errors and analyze them on the server, and it's a critical path for fixing problems found in production. A fix only for devtools would not be sufficient for our use case. Similarly, it sounds like the ShadowChicken strategy might not be suitable as well.

@littledan
Copy link
Member Author

@erights I'd be interested to hear your thoughts on this.

@erights
Copy link

erights commented May 5, 2016

I have the same discomfort with ShadowChicken. Causeway is a post-mortem distributed debugger that I worked on a long time ago (together with @cocoonfx , Tyler Close, and others). See the Related Work section of Causeway: A message-oriented distributed debugger for some other relevant old work on distributed debuggers. Causeway and some of these others could cope with partial information in the sense that some sites participating in the distributed computation may elect not to also participate in the distributed accumulation of logs. But it would have been very weird for the information collected at participating sites to be non-deterministic depending on its local memory pressure.

Whether we do PTC and ShadowChicken or not, we should adopt STC regardless. Even with PTC and ShadowChicken, the advantages of adopting the explicit syntactic marker of STC are

  • A static error if you (the programmer) were wrong about whether a particular position is in fact tail position.
  • The syntactic marker should waive the accumulation of stack trace info for the marked calls, since the programmer is explicitly indicating that they do not consider these calls to be stacking calls. This helps eliminate noise from the stack trace.

STC has these advantages whether used instead or in addition to PTC.

@erights
Copy link

erights commented May 5, 2016

One possible way to reconcile a ShadowChicken-like approach with the points that @concavelenz , @mknichel , and Causeway make is to say that, in the context where logs are accumulated and reported, that the info that GC would painfully drop under ShadowChicken is first logged, and then deleted from the local heap. This relieves local memory pressure without losing the utility of this info for instrumentation. This would be a difficult bit of engineering and makes the non-determinacy of GC yet more visible. At least at the present time, I am not advocating it. But it is worth pointing out as a possible way of resolving this tension.

@littledan
Copy link
Member Author

If stack frame data is logged before being deleted through GC, it still constitutes linear rather than constant resource usage, and goes against the spirit of proper tail calls. The ES spec doesn't say anywhere that things are held in memory rather than disk :)

@erights
Copy link

erights commented May 5, 2016

it still constitutes linear rather than constant resource usage, and goes against the spirit of proper tail calls.

I agree. Hopefully people will find this persuasive.

@saambarati
Copy link

saambarati commented May 17, 2016

There is a lot of misinformation in this thread about how ShadowChicken works and how other theoretical shadow stack implementations could work. I'm going to speak to how we implemented our shadow stack to in hopes of moving discussion away from arguing against a theoretical shadow stack implementation.

The main arguments I see here against ShadowChicken is that its efficacy is dependent on heap memory pressure and that it uses linear memory. This is not how we on the JSC team designed our shadow stack. Firstly, our stack pruning algorithm is not tied to GC or heap memory pressure. Secondly, JSC's implementation of a shadow stack does not use linear resources. We will keep the last N tail deleted frames in our shadow stack. For example, if your program contains M naturally occurring frames, JSC will keep between M and M+N frames in its shadow stack when the inspector is open. We chose N=128 because we think it provides a good balance between the two main debugging scenarios:

  • Your program is not written to take advantage of PTC. This means you may have a few places in your program where you do perform a PTC. It's highly unlikely that you will have more than 128 these on the stack at any given time. So with N=128, when users of JSC use the inspector, they will see all tail deleted frames and will be able to interact with them as if they were real machine frames.
  • Your program is written to take advantage of PTC. This probably means you're writing some form of a loop using PTC recursion of some form. If you pause in the debugger, it's doubtful that you would want to see more than 128 frames. It's even possible you would not want to see any tail deleted frames.

Again, we chose N=128 because we think it strikes a good balance. We may change this number as we see fit. But the argument that an implementation of a shadow stack implicitly requires linear memory and needs to be cleaned up by GC is provably wrong.

@littledan
Copy link
Member Author

I was not referring to ShadowChicken being a linear memory user but rather Mark's suggestion of logging tail-optimized frames being linear. Sorry for my misunderstanding; I thought it was based on GC because Chicken's TCO is, but this is an interesting-sounding design to use a separate mechanism to ensure limited resource usage. With this design, is the resulting stack deterministic?

Nevertheless, the underlying concern about missing stack frames in Error.stack remains. We have heard in this thread as well as in #6 that developers depend on this stack, and from this thread, we know that 4-5% of stack frames on some popular websites would go missing. The lack of bug reports has been mentioned, but a missing stack frame would not result in a page load failing, but instead in more difficulty debugging something off-line later, eventually resulting in a worse user experience from other bugs not being diagnosed and fixed. Not all of the developer experience happens with devtools open.

@yunong
Copy link

yunong commented Jun 6, 2016

At Netflix, we leverage JS heavily, both on the client (most of our clients use JS) and on the server (our website and data access platform run Node.js).

It's critical to our business that we have access to tools that lets us easily observe and debug our services and clients in production.

For our server deployments, CPU profiling via perf which allows us to sample JS stack frames on production instances without modifications to the Node binary is vital. Profiling allows us to quickly identify performance problems and resolve them -- which is paramount when we're serving 75+ million customers.

We also use @davepacheco's post-mortem debugging toolchain mdb_v8 in production. The ability to capture a core dump on a live process and then inspect it offline is quite valuable. This technique lets us examine the complete stack frame of the process, as well as all of the function arguments for each frame.

Production is a tricky environment since we rarely have the luxury to switch toolchains or restart the process. Sometimes restarting the process itself means we've lost the state which caused the issue in the first place. Part of the reason we've chosen Node is that it comes with an excellent suite of observable and debuggable tools. We rely on them every day. Losing the ability to sample complete stack frames or see them in a core dump would mean losing essential telemetry -- something we're reluctant to compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants