Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring sem code to reduce sources of truth #16

Open
saem opened this issue Feb 1, 2021 · 4 comments
Open

Refactoring sem code to reduce sources of truth #16

saem opened this issue Feb 1, 2021 · 4 comments

Comments

@saem
Copy link

saem commented Feb 1, 2021

I suspect everything I suggest here isn't particular new or ground breaking and might in fact be what's already intended as the direction. I know IC/Packed AST and revisiting code will naturally encourage parts of it. I'm merely writing it down as presently I don't see this type of refactor discussed.

I'll start by presenting the initial idea, if on principle it sounds like it's close, then start with very narrow changes to see what they actually look like. Then revise this as more is learned and plans change.

Initial Idea

I've read #5 but this is a broader change to the sem layer. I'll use semexprs.semExpr code's handling of nkCall, nkInfix, nkPrefix, nkPostfix, nkCommand, nkCallStrLit: as a motivating example. It's used in a number of areas and for surprising things, it's broken in subtle and difficult to debug ways.

One such surprising thing is the change in nim-lang/Nim#16840 causes js/t11166.nim to fail as the experimental dot operator macro ends up recursing in a cycle trying to apply the dot operator to broken code. The dot operator return type is made up of undefined symbols, when it goes to resolve it semTypeExpr delegates to semExpr, which delegates to directly or transitively to qualifiedLookup, semDirectOp, builtInFieldAccess, etc... each of these keep establishing some ground turth, then call a delegate throwing away much of that analysis. There is no source of truth and the same issue needs to be handled all over the place, but carefully as existing context is unclear.

That pattern of some proc in sem learning a lot internally, and then calling another proc that either doesn't take into account (ommission bug) or regenerates that information (inconsistent/multiple source of truth) makes informal reasoning rather difficult.

Back to the example, semTypeNode is called by the semProcAux which knows that expression should provide a return type ultimately, but that gets lost especially by the time we hit semExpr and friends. Since they're all rather general it's now difficult to know if any changes are safe as one needs to reason about far more cases than before and it's unclear what will be impacted.

Starting Guiding Principles

I think code should be refactored to reduce the sources of truth, starting with:

  • semExpr should do as little analysis as possible, instead stick to dispatch based on node and context information
  • reduce the amount of redundant analysis and push more of the results into the AST and Context, semProcAux for example should flag the node in the return type position with that expectation and in turn that information should be used downstream
  • where redundant querying is required extract that into shared procs as today, but have caching controlled by the requester (this is already the case, thought it was worth emphasizing)

Impact

This will lead to a small set of top level semProcs covering early analysis similar to as today, but they'll fan out to a wider set of specialized sem procs and/or ones that will read more linearly with "flatter" branching logic. The AST and context will likely grow as more metatdata and pushdown analysis will live there.

Undoubtedly there will be some "fun" challenges that come up, presently the redundant analysis that's done allows callers of those procs to ignore certain lifetime concerns, such as a the fidelity of information available. This will likely lead to a number of annoying bugs as code shifts and the correct pre/post-conditions are re-established.

Long term better informal reasoning and reduced interdependency of procs should not only help existing compiler devs, but also reduce the barrier to entry for part-time compiler devs.

@saem
Copy link
Author

saem commented Feb 1, 2021

For those curious about the motivation I've been poking at sem for a over a month, even with the debugger informal reasoning remains a challenge. The information loss and continual reconstitution has meant that not only is it unclear where a problem should be solved but many solutions are partial. Additionally, at any time information loss occurs vast amounts of the language feature space needs to be considered, made more difficult with the various combinations of experimental flags. Anyway, I'd like to improve things not just for myself but for others presently working on the compiler and those yet to come.

@Araq
Copy link
Member

Araq commented Feb 1, 2021

I don't agree too much with your analysis but it's worth a try. IMO information that is passed around via side channels (aka not part of PNode etc) just makes it harder and harder to specify what a "sem"ed Nim AST should be, which is important for the evolution of Nim's macro system. Where we can simplify and fix bugs is the conflation of type checking with symbol lookups. With nkSymChoice there is no reason why we cannot do symbol lookups before type checking.

@Araq
Copy link
Member

Araq commented Feb 1, 2021

Also, there is little reason why addr/deref computations needs to be done at all, the backends could do these. Or even if they are required why it's not done in sempass2 which is generally cleaner.

@saem
Copy link
Author

saem commented Feb 6, 2021

Rereading what I originally wrote I didn't emphasize the AST being the preferred choice for this information, but even then I would revise it as I learn more of these things.

I think much of what I'm describing is basically a crappy version of attribute grammars, I suppose I need to read some lecture notes or book on the subject and figure it out a little more.

It does jive with keeping as much as possible in the AST, but it pushes a lot more information than is currently being pushed into the node. Of course, data in the node could be materialized or a computation on the data available in that node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants