process_data: add comments, refactor code, and add tests #208
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I tried to add code comments to help understanding the data preprocessing script, e.g., what data are removed, and what are kept for the scoring algorithms. I ended up with refactoring the code and adding tests for the
process_data.py
script. I will start with result validation, then explain the changes I made, and finally compare the logs.Result validation
The results of preprocess_data function are three dataframe,
notes, ratings, noteStatusHistory
. Those dataframe will be the input of scoring algorithms. I validated my results by writing the intermediate dataframe to disk, and compared them wit results from the latest release version (c7db275
) by doing adiff
. The intermediate dataframes are exactly the same.Add code comments and logs to explain what notes/ratings are kept or removed
I think the current code comments/logs can be improved to help the users understand the data preprocessing part. As per my understanding,
_filter_misleading_notes
function, the logic does the following:I added logs to each of the three steps, and showed how the row numbers changed (see the new log output below). I also added test cases for step 3, because the numbers should add up. The tests are here.
Log comparison
I showed the log comparison based on data released on 2024-02-07. The new log shows the row number of dataframe from reading the provided tsv files, and a detailed history of dataframe row changes as we go through Step 1-3 above.
Previous log output from the latest release version (
c7db275
)New log output from my PR