Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Quality Checker and Run on GitHub Action #170

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

philcaz
Copy link

@philcaz philcaz commented Oct 20, 2024

Attempt solving #149

Closes #149

Improvement

Added a quality checker for abbreviation csv files that detects errors due to

  • Wrong escape characters
  • Wrong starting abbreviation letters
  • Non-UTF8 characters

and warnings due to

  • Duplicate full name / abbreviation entries
  • Identical abbreviation as original full name
  • Outdated 'Manage.' in abbreviations

Also added a yml file that runs the checker script on GitHub Action, which exits with error code if issues are found with the abbreviation files. A full downloadable error/warning summary is generated at each run (partial reports can be seen in Run Quality Check).
image
image

Limitations:

  • the function check_wrong_beginning_letters(filepath): currently considers abbreviations longer than the full text as invalid, even if the abbreviations seem valid, e.g.
    • Abbrev mismatch full name pattern in ./journals/journal_abbreviations_general.csv at line 424:Full: 'Crystallography', Abbrev: 'Acta Crystallogr., Sect. A: Found. Crystallogr.'
    • Abbrev mismatch full name pattern in ./journals/journal_abbreviations_general.csv at line 1144:Full: 'Metallofizika (Kiev)', Abbrev: 'Metallofizika (Akad. Nauk Ukr. SSR, Otd. Fiz. Astron.)
  • Entries containing multiple errors/warnings are reported multiple times, which increases the total issue count

Phikho-cc and others added 17 commits October 19, 2024 21:29
implement the quality checker that reports errors for
- wrong escape characters
- wrong starting letters
- presence of non-utf-8 characters
and reports warning for
- duplicate entries
- same full forms
- same abbreviations
- outdated 'Manage' abbreviation
prevent the script from stopping by error-triggered exit
- ignore single-name journals with same abbreviation as full name
- generate error summary and deploy checker on GitHub Action
change upload-artifact@v2 to @V3
fix mismatched quality check file name
provide better visualisation of error/warning output
- enhance invalid escape character check
- group full name duplication and abbreviation duplication into same warning
- ignore articles and preposition in check wrong beginning letters
make quality check action exit with code 1 if errors are present
print out error and warning messages on GitHub Action under quality check
Removed force print error summary in quality-check.yml since default GitHub Action console could not accommodate the size of error/warning summary. Partial error message can be seen in Run Quality Check
deleted a redundant quality checker
Add quality checker and set up GitHub Action file to run the checker
the function now considers abbreviations valid if they are similar to full text while not strictly having the same starting letters as the full names
Comment on lines 213 to 222
if filename.endswith(".csv"):
filepath = os.path.join(JOURNALS_FOLDER_PATH, filename)

# Run the checks
check_non_utf8_characters(filepath)
check_wrong_escape(filepath)
check_wrong_beginning_letters(filepath)
check_duplicates(filepath)
check_full_form_identical_to_abbreviation(filepath)
check_outdated_abbreviations(filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to read the CSV file here, so that we can read the file only once and reduce performance-expensive io operations.


# Write to summary file
with open(SUMMARY_FILE_PATH, 'w', encoding='utf-8') as summary_file:
summary_file.writelines(summary_output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@philcaz philcaz Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I've improved the csv loading efficiency but I couldn't upload the error summary to GitHub Action directly because the report size is too large.
image

philcaz and others added 9 commits October 27, 2024 03:36
Content of each abbreviation csv is now loaded into memory once and used by check functions, instead of being read multiple times upon function calls
Attempt to write issue report to GITHUB_STEP_SUMMARY
Fix quality checker name error
Try uploading large error report as artifect
Shorten error/warning message for smaller summary size
Shorten message and provide a more efficient error summary
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add quality checker
4 participants