Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating new CLI for process_pr_entropy, referenced in GitHub action entropy-check.yml #95

Merged
merged 26 commits into from
Aug 14, 2024

Conversation

willdavidson05
Copy link
Member

@willdavidson05 willdavidson05 commented Aug 13, 2024

Description

This PR introduces a new function, process_pr_entropy, in the file processing_repositories.py, which can be referenced via the CLI. This function generates a detailed and informative entropy report that will be utilized within our custom GitHub Action.

To test this you can run poetry run almanack process_pr_entropy --repo_path" " --pr_branch" " --main_branch " "

I'm struggling to create a test case for this implementation since now the use case is more complex(i.e. branches) I can no longer use my test repositories.

I appreciate any comments or feedback!

Closes #92

What is the nature of your change?

  • Content additions or updates (adds or updates content)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own contributions.
  • I have commented my content, particularly in hard-to-understand areas.
  • I have made corresponding changes to related documentation (outside of book content).
  • My changes generate no new warnings.
  • New and existing tests pass locally with my changes.
  • I have added tests that prove my additions are effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@willdavidson05 willdavidson05 marked this pull request as ready for review August 14, 2024 15:55
Copy link
Contributor

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I see that the bot triggered and produced an entropy report, very cool to see that!

I left some refactoring comments if you'd like to address them, but I think this PR is good enough to accept, considering that it works.

from typing import Any, Dict


def compute_pr_data(repo_path: str, pr_branch: str, main_branch: str) -> Dict[str, Any]:
Copy link
Contributor

@falquaddoomi falquaddoomi Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function's body seems awfully similar to compute_repo_data() above; perhaps you could add most_recent_commit and oldest_commit parameters to compute_repo_data(), with it defaulting to the first and last commit if unspecified, and then use that in compute_pr_data()?

I'm thinking you'd modify compute_repo_data() like so:

def compute_repo_data(repo_path: str, most_recent_commit: pygit2.Commit=None, oldest_commit: pygit2.Commit=None) -> None:
    # ...
    
    # Retrieve the list of commits from the repository
    commits = get_commits(repo)
    most_recent_commit = commits[0] if most_recent_commit is None else most_recent_commit
    first_commit = commits[-1] if oldest_commit is None else oldest_commit
    
    # ...

Then, you can invoke it from compute_pr_data() like so:

def compute_pr_data(repo_path: str, pr_branch: str, main_branch: str) -> Dict[str, Any]:
    try:
        # ...
    
        # Get the most recent commits on each branch
        pr_commit = repo.get(pr_ref.target)
        main_commit = repo.get(main_ref.target)
    
        result = compute_repo_data(repo_path, pr_commit, main_commit)
        
        return {
            "pr_branch": pr_branch,
            "main_branch": main_branch,
            "total_entropy_introduced": result["total_normalized_entropy"],
            "number_of_files_changed": result["number_of_files"],
            "entropy_per_file": result["file_level_entropy"],
            "commits": result["time_range_of_commits"]
        }
        
      except Exception as e:
          # If processing fails, return an informative error
          return {"pr_branch": pr_branch, "main_branch": main_branch, "error": str(e)}

There is a small downside in that you have to parse the repo twice; IMHO that isn't a big deal, but if you want to avoid that, there are a number of ways you could do it. You could, for example, have compute_repo_data() take a repo object, then write another small method to deal with taking in a path and constructing repo object, which would then call compute_repo_data(repo). Alternatively, you could pull the entropy calculations out of compute_repo_data() into another function and call it from both compute_repo_data() and compute_pr_data().

I'd also suggest using consistent names with your variables and dictionary keys. For example, in compute_repo_data(), you use normalized_total_entropy as a variable name, then assign it to the dict key total_normalized_entropy. Another example: in compute_pr_data() you call the file-level entropy in the results dict entropy_per_file, but in compute_repo_data it's file_level_entropy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment @falquaddoomi! I definitely agree that there is some serious overlap between those two functions, and this refactoring is needed. Going to add this to a new issue. I got things running but ran into some small troubles with test cases and other small errors. With such a small deadline before my presentation, I figure it makes more sense to save this for later on.

Copy link


================================================================================
                      Software Information Entropy Report                       
================================================================================

Repository information:
┌────────────────────────────┬─────────────────────────────────────┐
│ Repository Path            │ /home/runner/work/almanack/almanack │
├────────────────────────────┼─────────────────────────────────────┤
│ Total Normalized Entropy   │ 0.0022                              │
├────────────────────────────┼─────────────────────────────────────┤
│ Number of Commits Analyzed │ 74                                  │
├────────────────────────────┼─────────────────────────────────────┤
│ Files Analyzed             │ 99                                  │
├────────────────────────────┼─────────────────────────────────────┤
│ Time Range of Commits      │ 2024-03-05 to 2024-08-14            │
└────────────────────────────┴─────────────────────────────────────┘

Top 5 files with the most entropy:
┌───────────────────────────────────────────────────────────────────────────────────────────┬──────────────────────┐
│ File Name                                                                                 │   Normalized Entropy │
├───────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────┤
│ poetry.lock                                                                               │               0.063  │
├───────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────┤
│ src/book/seed-bank/pubmed-github-repositories/visualize-pubmed-repo-sofware-entropy.ipynb │               0.0296 │
├───────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────┤
│ package-lock.json                                                                         │               0.0289 │
├───────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────┤
│ src/book/seed-bank/pubmed-github-repositories/gather-pubmed-repos/generate_data.py        │               0.0049 │
├───────────────────────────────────────────────────────────────────────────────────────────┼──────────────────────┤
│ src/book/garden-circle/contributing.md                                                    │               0.0045 │
└───────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────┘


{"repo_path": "/home/runner/work/almanack/almanack", "total_normalized_entropy": 0.0022092836141351982, "number_of_commits": 74, "number_of_files": 99, "time_range_of_commits": ["2024-03-05", "2024-08-14"], "file_level_entropy": {".alexignore": 8.253835527495845e-05, ".github/ISSUE_TEMPLATE/bug.yml": 0.0023349605455284246, ".github/ISSUE_TEMPLATE/config.yml": 8.253835527495845e-05, ".github/ISSUE_TEMPLATE/feature.yml": 0.002104028600972535, ".github/PULL_REQUEST_TEMPLATE.md": 0.001192097486247306, ".github/actions/install-node-env/action.yml": 0.0004861877577156774, ".github/actions/install-python-env/action.yml": 0.0009649693275020158, ".github/release-drafter.yml": 0.0007006579175230309, ".github/workflows/deploy-book.yml": 0.0011640435303269991, ".github/workflows/draft-release.yml": 0.000760353478909068, ".github/workflows/entropy-check.yml": 0.0015224212096459065, ".github/workflows/pre-commit-checks.yml": 0.0007006579175230309, ".github/workflows/publish-pypi.yml": 0.0008780354118013581, ".github/workflows/pytest-tests.yml": 0.0011640435303269991, ".gitignore": 0.0038960387682738305, ".linkcheckerrc.ini": 0.00039092240958147305, ".pre-commit-config.yaml": 0.002436493933654525, ".vale.ini": 0.0004861877577156774, "CITATION.cff": 0.0020262326268331316, "CONTRIBUTING.md": 0.00011971843019093978, "LICENSE": 0.0009071320875416535, "LICENSE.txt": 0.0009071320875416535, "README.md": 0.0005484609658173223, "pa11y.json": 0.00025940473583026404, "package-lock.json": 0.028881584389224866, "package.json": 0.0001909446166991543, "poetry.lock": 0.0630431314476572, "pyproject.toml": 0.0035884199150774234, "src/almanack/__init__.py": 0.000760353478909068, "src/almanack/book.py": 0.001895588192916369, "src/almanack/processing/calculate_entropy.py": 0.0026626727258898795, "src/almanack/processing/compute_data.py": 0.004036672559927364, "src/almanack/processing/git_operations.py": 0.0038489770323008043, "src/almanack/processing/processing_repositories.py": 0.0013586388649582597, "src/almanack/reporting/cli.py": 0.0007899786269540005, "src/almanack/reporting/report.py": 0.0019218189751533945, "src/book/_config.yml": 0.0012757374677957243, "src/book/_static/custom.css": 0.00045469869242032595, "src/book/_toc.yml": 0.0007899786269540005, "src/book/assets/640px-Forgard2-003.gif": 0.0, "src/book/assets/640px-Rundes_Fenster_mit_Gitter.jpeg": 0.0, "src/book/assets/Sundial_2916_HDR.jpeg": 0.0, "src/book/assets/almanack-influencing-software.png": 0.0, "src/book/assets/software-gardening-logo.png": 0.0, "src/book/assets/software-lifecycle.png": 0.0, "src/book/assets/xkcd_dependency.png": 0.0, "src/book/garden-circle/contributing.md": 0.0044998300432500145, "src/book/garden-circle/garden-circle.md": 0.00011971843019093978, "src/book/garden-circle/garden-map.md": 0.0013586388649582597, "src/book/garden-lattice/garden-lattice.md": 0.001247942242637372, "src/book/introduction.md": 0.0022327628564206368, "src/book/references.bib": 0.004292441258266434, "src/book/seed-bank/pubmed-github-repositories/gather-pubmed-repos/generate_data.py": 0.004932515854003226, "src/book/seed-bank/pubmed-github-repositories/gather-pubmed-repos/generate_github_enriched_data.py": 0.003919534779730899, "src/book/seed-bank/pubmed-github-repositories/gather-pubmed-repos/pubmed_github_links.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/gather-pubmed-repos/pubmed_github_links_with_github_data.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/gather-software-information-entropy.ipynb": 0.004199734287477557, "src/book/seed-bank/pubmed-github-repositories/images/pubmed-lines-of-code-and-time.png": 0.0, "src/book/seed-bank/pubmed-github-repositories/images/pubmed-stars-and-forks.png": 0.0, "src/book/seed-bank/pubmed-github-repositories/images/pubmed-stars-and-open-issues.png": 0.0, "src/book/seed-bank/pubmed-github-repositories/images/software-information-entropy-forks.png": 0.0, "src/book/seed-bank/pubmed-github-repositories/images/software-information-entropy-gh-stars.png": 0.0, "src/book/seed-bank/pubmed-github-repositories/images/software-information-entropy-open-issues.png": 0.0, "src/book/seed-bank/pubmed-github-repositories/images/software-information-entropy-top-5-langs.png": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_1.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_10.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_11.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_12.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_13.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_14.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_15.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_16.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_17.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_18.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_19.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_2.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_20.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_3.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_4.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_5.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_6.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_7.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_8.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/repository_analysis_results/repository_analysis_results_batch_9.parquet": 0.0, "src/book/seed-bank/pubmed-github-repositories/visualize-pubmed-repo-sofware-entropy.ipynb": 0.029622799158067335, "src/book/seed-bank/seed-bank.md": 0.00011971843019093978, "src/book/software-forest/software-forest.md": 0.0012757374677957243, "src/book/verdant-sundial/verdant-sundial.md": 0.00130345069321151, "styles/config/vocabularies/almanack/accept.txt": 0.0006705735698113297, "tests/conftest.py": 0.003155272598916929, "tests/data/almanack/repo_setup/create_repo.py": 0.002181404246431413, "tests/data/almanack/repo_setup/insert_code.py": 0.0016837773193039043, "tests/data/jupyter-book/sandbox.md": 0.0005484609658173223, "tests/test_almanack.py": 0.001107660493871501, "tests/test_build.py": 0.0019218189751533945, "tests/test_calculate_entropy.py": 0.0016837773193039043, "tests/test_compute_data.py": 0.0010793262195818538, "tests/test_git_operations.py": 0.002811830176828594, "tests/test_processing_repositories.py": 0.0011358987077730264}}

@willdavidson05
Copy link
Member Author

Thank you for the speedy review @falquaddoomi ! The compute file definitely needs some refactoring changes, which I plan to reference in a new issue, just stuck on a short time frame for my last week! Similarly I wanted to create a function that gives the option for different outputs(ex. json, md, etc.), however I did not get around to that, which then needed me to delete a test case. Sorry, I know this probably isn't the best way of doing this, but just crunched on time. Adding back the test cases and refactoring will be done in the next PR!

@willdavidson05 willdavidson05 merged commit f4bbef1 into software-gardening:main Aug 14, 2024
11 checks passed
@willdavidson05 willdavidson05 deleted the CLI-PR branch August 29, 2024 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CLI for PR entropy report
2 participants