Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor linting code #805

Closed
4 tasks done
ewels opened this issue Dec 3, 2020 · 5 comments
Closed
4 tasks done

Refactor linting code #805

ewels opened this issue Dec 3, 2020 · 5 comments
Assignees
Labels

Comments

@ewels
Copy link
Member

ewels commented Dec 3, 2020

The code for nf-core lint is pretty central to the nf-core community and is heavily used across all pipelines. It has grown over time and is now due a substantial refactor / rewrite. As I see it, the main points are:

  • Use text-based test names instead of numeric
    • Test numbers seemed simple at the start, but it means that you can't insert a new test in a logical spot in the middle of the code without disrupting all numbers that follow
    • Can use short IDs, maybe correspond to function name?
  • Break up some of the larger lint tests
    • Some of the tests have grown quite a bit over time and a single test can encompass a lot of sub-checks (eg. check_nextflow_config()) - it would be good to break these up into nice small chunks
  • Look for a nf-core linting yaml config file in the pipeline (eg. .nf-core-lint.yml)
    • Allow any test to be ignored, eg. check_docker: false
    • Allow test-specific sub-tests to be ignored, where applicable. eg. check_files_exist: { ignore: ['README.md'] }
    • Other nice customisation stuff wherever we can think of it.. eg. checking for a string other than nf-core in manifest.name config etc.. (could include global options?)
    • Can model this on how markdownlint rules work - eg. this config file
    • Generally I like hiding dotfiles like this away somewhere so as not to pollute the repo root directory. But I think it's important that this one is super visible so that it's easy to see which rules are being ignored. So maybe .nf-core-lint.yml in the root is a good idea here.
    • Good logging about what's being skipped will be important, maybe also in reporting at end / GitHub comment etc?
  • Rewrite the python tests (test_lint.py) to be much more self-isolated
    • Make each test independent to the rest, as much as possible
    • Remove the need to maintain the lint_examples directories
    • Get rid of the MAX_PASS_CHECKS thing, which is super annoying
    • Will probably need to refactor functions in lint.py to return a set of results, instead of modifying self

Although of course it would be nice to try to break this work up into a bunch of smaller PRs, I suspect that a lot of it will have to be done all in one go as everything is tied together currently. One of the main aims of all of the above is to make the linting code a lot more atomic and thus easier to update / add to.

This issue supplants #389

@apeltzer
Copy link
Member

apeltzer commented Dec 3, 2020

@szymonwieloch this might be of interest to you!

@KevinMenden KevinMenden self-assigned this Dec 4, 2020
@KevinMenden
Copy link
Contributor

Had a closer look at lint.py now and thought about how to address the mentioned issues.

Text-based test names

Should be as easy as switching the numbers by text-based keys (also in the website, of course).
I would furthermore suggest to define a CHECK_ID variable at the very beginning of each check function and use that throughout the function definition. Easier to read and easier to maintain/change.

Break up larger lint tests

I went through all the tests to see what could be broken up. The check_nextflow_config() function can be broken up in sub-checks. Quite a number of check functions will fall away completely or be shortened, as we don't need to check for conda and docker anymore. Considering this I could not find another check function that would be suitable to divide in sub-functions, at least in my opinion they are short enough :)

That is if we decide to outsource the conda/docker checks to the module linting. The module lint function could then be called by the lint.py for every module, just parsing through all the modules. I'm pretty sure you guys have been thought about this before and have an opinion? :)

Linting configuration file

I think it would be nice to have a field in the nf-core-lint.yml with the key being the same as the text-based key used for the check function - just to stay consistent. It would just need a boolean for whether to run the test or not, and could then have check-specific parameters, like whether a file should be ignored or not, etc. That would be relatively straight forward to implement I guess. The .yml can also be automatically read when creating the PipelineLint object.

Checking whether to run the function could be done by either cleaning the list of tests prior to looping through it, or by checking in the functions themselves. Although I think the former would be cleaner.

rewrite python tests

I think this is the trickiest one.
The information stored in MAX_PASS_CHECKS could be stored in the lint object, or in the check functions, and it could be dynamically calculated so we don't need to hardcode it (just a counter, I guess).

Regarding making it more self-isolated, I assume you mean that all tests are using the same directories for testing, right? Because otherwise I can't see how the tests depend on one another.
Not sure what's the best way to fix this. On thing I could think of is to dynamically make a directory using the template, which we anyway have - this would probably be a one-liner. Then a few lines to add the specific features we need this directory to have/not have in order for the tests to behave like we want them. And then delete again after the test (I think there's a clean up function for such things with python tests). No need for keeping lint_examples any longer.

Could of course modify the functions to return results and then store this in a dict. Not sure about the benefit of that though? 🤔
Could also modify the functions to take a path as argument, so you can call them without creating a PipelineLintobject before. Also not sure if that would improve the code.

Will have a bit more thought about it and then get to work, probably on Monday.

@ewels
Copy link
Member Author

ewels commented Dec 5, 2020

Sorry all, I couldn't help myself and got started. See #809 👈🏻

@ewels
Copy link
Member Author

ewels commented Dec 5, 2020

Ok had a bit of fun today with this and have made a start on #809. Still more to do if anyone gets a chance though.

I was finding the lint.py file super massive and impossible to navigate, so I broke up all of the tests into their own file. I think this makes the code far easier to navigate and read. As a bonus it makes the sphinx docs rendering simpler / cleaner (can have all of the core lint.py helper code on one file and other tests get their own page + URL).

I don't think that there is any way around the grunt work for the tests. It's just a case of going through them all and rewriting. I made a start on a few and already found some presumably long-standing problems where tests were being ignored incorrectly. So I think that some of these won't have been running for ages in some cases (eg. if you're not running from within the pipeline dir). So just because we have code coverage doesn't mean that the tests are necessarily doing anything useful 😓

I did pretty much what you suggested for the tests @KevinMenden - wrote a setUp() function that creates a new pipeline in a temporary directory using nf_core.create. These files can then be used to test the lint functions. As it's dynamically created from the code it should stay in line with the tests without having to update minimal examples as well.

Because I refactored the lint functions to return a dict instead of appending to lists in the main object, it is now pretty easy to just check the results of that one test function. This is what I meant about them being more isolated - before all results were kept globally, so if you added one test anywhere it would mess up the expected counts for all lint test pytests. Now you can inspect the returned results from each individual test which is much better.

@ewels
Copy link
Member Author

ewels commented Feb 17, 2021

I think we can close this issue now. It's quite broad but I think that most of it has been implemented and merged now.

Main outstanding thing is that Test coverage has taken a big drop and it would be nice to try to bring it back up if we can at some point.

@ewels ewels closed this as completed Feb 17, 2021
@ewels ewels mentioned this issue Jul 5, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants