Skip to content

Latest commit

 

History

History
407 lines (324 loc) · 15 KB

DESIGN.md

File metadata and controls

407 lines (324 loc) · 15 KB

Ormolu

This document represents our discussions and plans at early stages of development, it is no longer being updated.

This document describes design of a new formatter for Haskell source code. It also includes recommendations for future implementers.

We set for the following goals (mostly taken from brittany):

  • Preserve the meaning of the formatted functions when no CPP is used;
  • Make reasonable use of screen space;
  • Use linear space and computation time on the size of the input;
  • Preserve comments;
  • Be idempotent.

Analysis of the existing solutions

In order to design a new formatter we need to study the existing solutions so we can borrow the good bits and avoid making the same mistakes.

Brittany

Brittany builds on top of ghc-exactprint—a library that uses parser of GHC itself for parsing and thus it guarantees that at least parsing phase is bug-free (which is admittedly the cause of majority of bugs in other projects, see below).

After parsing, Haskell AST and a collection of annotations are available. The annotations are there because Haskell AST doesn't provide enough information to reconstruct source code (for example it doesn't include comments). The AST and the annotations are converted into a BriDoc value. A BriDoc value is a document representation like the Doc from the pretty or the wl-pprint libraries.

Brittany implements its own document type in an attempt to find a satisfactory rendering of the source code that fits a page-width constraint. Because of this, a BriDoc value represents a collection of many candidate layouts (i.e. renderings) of the source code.

This collection is pruned until it contains a single layout. The structure of the chosen layout is then adjusted to leave it in a form which can be easily traversed to produce the final rendering.

Brittany invests the majority of its implementation to manage the BriDoc values. Given that the amount of possible layouts is exponential, the representation is clever enough to fit them in linear space. There are multiple ways to build a BriDoc, not all of which fit in linear space. So care is necessary to keep memory bounded.

The compexities of the BriDoc structure, together with the lack of documentation, make Brittany at least challenging to maintain.

Hindent

Hindent uses haskell-src-exts for parsing like all older projects. haskell-src-exts does not use parser of GHC itself, and is a source of endless parsing bugs. Hindent is affected by these upstream issues as well as Stylish Haskell and Haskell formatter (see below). This already makes all these projects unusable with some valid Haskell source code, but let's continue studying Hindent anyway.

Hindent is quite different from Brittany in the sense that it does not attempt to build a document representation to render afterwards, instead it just prints the parsed code straight away. This means that the 70-80% of what the code does is a printing traversal.

Hindent code is easier to read and debug. Pretty-printing functions are very straightforward. If there is a bug (in pretty-printer, not in parser which Hindent cannot control), it's easy to fix AFAIU.

Hindent is also notable for its ability to handle CPP and inputs that do not constitute complete modules. It splits input stream into so-called “code blocks” recognizing CPP macros and then only pretty-prints “normal code” without touching CPP directives. After that CPP is inserted between pretty-printed blocks of source code. The approach fails when CPP breaks code in such a way that separate blocks do not form valid Haskell expressions, see this for example.

Looking at the bug tracker there are many bugs. Part of them is because of the use of haskell-src-exts, the other part is because the maintainer doesn't care (anymore?) and doesn't fix them. Well it's as simple as that, with any sort of commercial backing the bugs in pretty printer would be fixed long time ago.

Stylish Haskell

Stylish Haskell also uses haskell-src-exts and suffers from the same upstream problems. I haven't studied the transformations it performs, but it looks like it transforms the parsed source code partially by manipulating AST and partially by manipulating raw text (e.g. to drop trailing whitespace from each line). CPP Macros are just filtered out silently as a preprocessing step before feeding the code to haskell-src-exts.

Stylish Haskell is not so invasive as the other formatters and most reported bugs are about parsing issues and CPP. As I understand it, people mostly use it to sort their import lists.

Haskell formatter

Haskell formatter is an older project that didn't get much traction. It also uses haskell-src-ext and also tries to do manipulations on the parsed AST. The issue tracker doesn't have many issues probably because it never got popular enough (only 15 stars on GitHub). All the issues are about upstream problems with haskell-src-exts.

Proposed design

This section describes a solution that combines all the good things from the projects above.

Parsing and CPP

It is clear that ghc-exactprint is better than haskell-src-exts, so we should use that. If we go with ghc-exactprint though, we'll need to specify which parser to use, e.g. the parser that parses whole module or the one which parsers declarations/expressions/etc. It seems that in general it's better to use the parser for modules because it should work with all input files containing complete modules, while with other parsers it's impossible to guess what they'll be called on.

CPP

We allow CPP directives in the input, but we forgo the goal to preserve the meaning of the formatted functions in that case. Instead of supporting CPP better, we hope for a solution to replace CPP to do conditional compilation.

There are the following challenges when formatting a module with CPP:

  • GHC parser won't accept anything but a valid, complete module. Therefore, formatting the Haskell code between CPP directives is not an option.

  • Ignoring the CPP directives and formatting the Haskell code can change its meaning. An example follows.

Let's suppose that we want to format the following program:

$ cat test.hs
{-# LANGUAGE CPP #-}
main = print (g && f1)
  where
        f1 = h
          where
            h = True
#ifdef C1
g = g1
  where
    g1 = g2
      where
        g2 = False
#else
        g = True
#endif

#ifndef C1
g = False
#endif

$ runhaskell test.hs
True

At the time of this writing, formatting this program with Hindent or Ormolu produces the same output we would get if the CPP directives were considered comments:

$ ormolu --version
ormolu 0.0.5.0 HEAD fc64eada5c4da7a5b07d2872e253671b48aec115
using ghc-lib-parser 8.10.1.20200412

$ ormolu --mode inplace test.hs

$ cat test.hs
{-# LANGUAGE CPP #-}

main = print (g && f1)
  where
    f1 = h
      where
        h = True
#ifdef C1
g = g1
  where
    g1 = g2
      where
        g2 = False
#else
        g = True
#endif

#ifndef C1
g = False
#endif

$ runhaskell test.hs
False

Running the formatter causes the output of the program to change from True to False when C1 is not defined.

A solution could be to make the formatter more careful with CPP directives, constraining how directives can be inserted in Haskell code to avoid changing the meaning by reformatting. But this would introduce additional complexity, and the problem would need to be solved repeatedly for every tool out there which wants to parse Haskell modules. If CPP is replaced with some language extension or mechanism to do conditional compilation, all tools will benefit from it.

Printing

Just pretty-printing code (following the approach of Hindent) seems sane. It is straightforward and when complemented with enough tests (see the section about testing below), it should work all right.

Implementation can be just a Reader monad producing something like text builder. The context of Reader can store current indentation and configuration options.

As the pretty-printing library we can use Outputable (and SDoc) from the ghc package itself (at least for pretty-printing basic things like floating point literals and the like). The benefit is that AST components that we'll want to print are already instances of Outputable, so we'll get correct renderings for free.

In order to keep the output of the formatter simple, fast and correct, we introduce the following rule. The pretty-printing code can be in control of every formatting choice, except for two, which are left to the programmer:

  1. location of comments (comments are going to be attached to specific syntactic entities, so moving an entity will move its comment too),
  2. line breaking.

Regarding (2), the idea is that given any syntactic entity, the programmer has a choice:

  1. write it on one line, or
  2. write it on two lines or more.

If (1), then everything is kept in one line. If (2), i.e. a line break appears somewhere in the concrete syntax tree (CST), then additional line breaks are introduced everywhere possible in parent nodes, but not in any sibling or children nodes.

Examples:

-- Stays as is.
data T = A | B

data T
  = A | B
-- Is reformatted to:
data T
  = A
  | B

-- Stays as is.
map :: (a -> b) -> [a] -> [b]

foldr :: (a -> b -> b) ->
  b -> [a] -> [b]
-- Is reformatted to:
foldr ::
  (a -> b -> b) ->
  b ->
  [a] ->
  [b]

t = let x = foo bar
                      baz
  in foo bar baz
-- Is reformatted to:
t =
  let x =
        foo
          bar
          baz
   in foo far baz

Crucially, no effort is made to fit within reasonable line lengths. That's up to the programmer. Style will still be consistent for everyone in every other aspect, and that's what counts.

Configuration

We are not allowing to configure any aspect of the formatter. A module might be used in multiple projects, and we prefer to have it formatted the same in all of them.

See this this post by Chris Done (the author of Hindent) which says that as long as the default style is conventional and good it doesn't really matter how code gets formatted. Consistency is more important.

Handling of language extensions

Some language extensions affect how parsing is done. We are going to deal with those in two ways:

  • When language pragmas are present in source file, we must parse them before we run the main parser (I guess) and they should determine how the main parsing will be done.
  • There also should be configuration file that may enable other language extensions to be used on all files.
  • Later we could try to locate Cabal files and fetch the list of extensions that are enabled by default from there.

Testing

It should be possible to add tests incrementally as we develop pretty-printing code and new issues are discovered. For each Haskell module that we want to test, we perform the following steps:

  1. Given input snippet of source code parse it and pretty print it.
  2. Parse the result of pretty-printing again and make sure that AST is the same as AST of original snippet module span positions. We could make this part of a self-check in the formatter.
  3. Check the output against expected output. Thus all tests should include two files: input and expected output.
  4. Check that running the formatter on the output produces the same output again (the transformation is idempotent).

In order to grow our testsuite, we would borrow test cases from test suites of existing libraries like Brittany and Hindent. Then we may add test cases for opened issues that Brittany and Hindent have.

When we're confident enough, we can start “mining” new issues by running the program on real source code from Hackage till we don't get new issues anymore. For every issue that we find this way, a test case should be added.

Functionality of executable

  • In all cases the program should test if the produced AST is the same as the one we originally parsed and if it differs, an error should be displayed suggesting reporting this on our issue tracker.
  • Check mode: return non-zero exit code if any transformations would be applied.
  • Modification in place and printing of formatted code to stdout.
  • A flag for version/commit information.
  • An option to specify location of config file.
  • Options to specify parameters that come from config files on command line instead (currently this is just dynamic options enabled by default, such as language extensions).

Why not contribute to/fork HIndent or Brittany?

We want to simultaneously optimize three goals:

  1. simplicity of implementation,
  2. efficiency,
  3. predictable and readable output that doesn't overuse vertical spacing.

Hindent aims for (1) and (2) by still producing something palatable in (3). Brittany gives up on (1) but goes a long way towards (3) and presumably does OK on (2). Ormolu goes for (1), (2) and (3), by outsourcing the hard part of (3) to the user. Ormolu is less normative than Brittany, and less normative than Hindent, but arguably stills achieves consistent style.

Forking or contributing to Hindent is not an option because if we replace haskell-src-exts with ghc (or ghc-exact-print) then we'll have to work with a different AST type and all the code in Hindent will become incompatible and there won't be much code to be re-used in that case. It is also possible that we'll find a nicer way to write pretty-printer.

Examples

A list of formatting examples can be found here.