Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topiary is pretty slow on formatting OCaml codebases #525

Closed
Niols opened this issue Jun 20, 2023 · 4 comments
Closed

Topiary is pretty slow on formatting OCaml codebases #525

Niols opened this issue Jun 20, 2023 · 4 comments
Labels
P1 critical: next release type: bug

Comments

@Niols
Copy link
Member

Niols commented Jun 20, 2023

Describe the bug

This issue is unrelated to #519. Topiary takes about half a second to format OCaml files, even trivial ones. This is reasonable when formatting one file but it becomes very frustrating when formatting a whole code base.

To Reproduce

For instance, on a relatively small codebase, with the latest version compiled in release mode:

$ git clone https://github.com/colis-anr/morsmall
Cloning into 'morsmall'...
remote: Enumerating objects: 1335, done.
remote: Counting objects: 100% (385/385), done.
remote: Compressing objects: 100% (161/161), done.
remote: Total 1335 (delta 228), reused 340 (delta 199), pack-reused 950
Receiving objects: 100% (1335/1335), 393.89 KiB | 1.21 MiB/s, done.
Resolving deltas: 100% (812/812), done.

$ cd morsmall

$ find -name '*.ml' | wc -l
16

$ find -name '*.ml' -exec wc -l {} +
   26 ./src/ASTUtils.ml
   58 ./src/morsmall.ml
  214 ./src/AST.ml
  244 ./src/utilities/testParser.ml
   46 ./src/equality/located.ml
   46 ./src/equality/nonLocated.ml
   46 ./src/printer/debugNonLocated.ml
   49 ./src/printer/json.ml
  346 ./src/printer/safe.ml
   46 ./src/printer/debug.ml
   49 ./src/printer/jsonNonLocated.ml
   37 ./src/location.ml
  875 ./src/CST_to_AST.ml
  134 ./tests/golden/golden.ml
  215 ./tests/qcheck/generator.ml
  138 ./tests/qcheck/qcheck.ml
 2569 total

$ time find -name '*.ml' -print -exec topiary -if {} \;
./src/ASTUtils.ml
./src/morsmall.ml
./src/AST.ml
./src/utilities/testParser.ml
./src/equality/located.ml
./src/equality/nonLocated.ml
./src/printer/debugNonLocated.ml
./src/printer/json.ml
./src/printer/safe.ml
./src/printer/debug.ml
./src/printer/jsonNonLocated.ml
./src/location.ml
./src/CST_to_AST.ml
./tests/golden/golden.ml
./tests/qcheck/generator.ml
./tests/qcheck/qcheck.ml

real	0m6.976s
user	0m6.880s
sys	0m0.090s

Expected behavior

I would expect Topiary to be reasonably fast to format a whole code base (maybe a few seconds). This could be achieved either by having Topiary be very fast (probably under 100 ms) on separate files or by having Topiary able to process several files at once.

Environment

  • OS name + version: NixOS
  • Version of the code: 738a178

Additional context

The situation has already been improved a bit by #523 and its 40% speedup. However, one would need a much more drastic speedup per run to achieve something nicer.

If this helps, I'd imagine a command line interface of:

topiary --in-place FILE [FILE ...]

(or -i for short) handling at least one file (and failing if none were passed), or:

topiary --input-file FILE --output-file FILE

(or -f and -o for short) handling exactly one file and defaulting to stdin and stdout without any flags.

Such an interface would also make for a much easier use in find/xargs and in pre-commit or similar utilities that expect formatters to handle several files at once.

@Niols
Copy link
Member Author

Niols commented Jun 21, 2023

In the stand-up meeting of today we mentioned the idea to have topiary be able to take several --input-file and that would then write everything in the one output file (potentially stdout), concatenating things. However, this would be maybe a bit counter intuitive but it would also behave very badly with a potential future multi-threaded version of Topiary.

@Niols
Copy link
Member Author

Niols commented Jun 21, 2023

Somehow, this reminds me of sed -i that accepts an extension for backing-up files. Typically, one could write sed -i.bak ... to get files to be copied into *.bak files before applying formatting. I am not sure it would make sense for Topiary, though, but I thought I'd throw it in the mix.

As another random idea, if we really wanted to be able to have --input-file and --output-file work with several files at once, we could allow alternating them:

topiary -f file1.ml -o file1-fmt.ml -f file2.ml -o file2-fmt.ml ...

or maybe have --output-file able to take a pattern or something, like:

topiary -f file1.ml -f file2.ml -o '{}-fmt.ml'

but I am really not convinced by this. Again, just throwing them in here.

@aspiwack
Copy link
Member

I think it's time to think about subcommands.

There could be a subcommand topiary format which takes a list of files and format them in place. This is the “standard” behaviour of a formatter. This can be multithread, or single thread. But could share compiled queries.

Then we could have topiary dev (not sure it's the best name), with the current interface. Because things like --visualise or the playground.sh script don't necessarily make sense in the multi-file setting.

Would this avoid the complications?

@nbacquey
Copy link
Member

I think this has been solved by subcommands. In the same repository https://github.com/colis-anr/morsmall, we now have:

$ time topiary format $(find -name '*.ml')

real    0m0.795s
user    0m1.058s
sys     0m0.026s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 critical: next release type: bug
Projects
None yet
Development

No branches or pull requests

4 participants