Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor main file/commands and expand dataclass usage #61

Merged
merged 35 commits into from
Jan 21, 2025

Conversation

skyfenton
Copy link
Collaborator

Adds CLI commands for processing data into a csv and triggering a load of the csv into MongoDB.

Also creates dataclasses for input and enriched data (MovieData and EnrichedMovieData) and refactors code in wiki_to_netflix so that we can normalize the netflix data and write data into csvs based on these classes. Additionally, now when we write a list of MovieData objects to a csv file, the first line of the csv will consist of a header naming each column so that the entire file is mostly self-documenting (though, without explicit types). For example, movie_titles.csv writes EnrichedMovieData (matches), so the resulting csv file looks like:

netflix_id,title,year,wikidata_id,genres,director
3,Character,1997,Q928545,drama film; film based on literature,Mike van Diem
5,The Rise and Fall of ECW,2004,Q10381750,None,Kevin Dunn
...

@skyfenton skyfenton linked an issue Jan 17, 2025 that may be closed by this pull request
@skyfenton skyfenton self-assigned this Jan 17, 2025
Copy link
Collaborator

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I think that the ideas here are really good and they put us in a good place to run the code and manage the data.

I think all of the issues I've pointed out are minor and should be easily fixable.

mediabridge/data_processing/wiki_to_netflix.py Outdated Show resolved Hide resolved
mediabridge/dataclasses.py Outdated Show resolved Hide resolved
mediabridge/dataclasses.py Outdated Show resolved Hide resolved
mediabridge/dataclasses.py Outdated Show resolved Hide resolved
mediabridge/db/load.py Outdated Show resolved Hide resolved
mediabridge/db/queries.py Show resolved Hide resolved
mediabridge/main.py Show resolved Hide resolved
mediabridge/db/queries.py Show resolved Hide resolved
@audiodude
Copy link
Collaborator

Also @JamesKohlsRepo , please do leave your own review! I think learning how to review code is a really important skill to have.

@audiodude audiodude changed the title 58 connect wiki to netflixmongo insertionmain file Refactor main file/commands and expand dataclass usage Jan 17, 2025
@jhanley634 jhanley634 self-requested a review January 17, 2025 19:46
Copy link
Collaborator

@jhanley634 jhanley634 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis is the one to keep happy, not me! But my tuppence is: LGTM :shipit:

mediabridge/data_processing/wiki_to_netflix.py Outdated Show resolved Hide resolved
@jhanley634
Copy link
Collaborator

Hullo, @skyfenton. Here's a tiny feature request. In wiki_query(), please delete the surrounding pair of SPACE characters.

f"Could not find movie id {movie.netflix_id}: (' {movie.title} ', {movie.year})"

I thought we needed to add a title .strip() call, based on what the log was telling me, but it turns out it's just an output artifact that should be removed.

BTW, here's a pair of handy f-string syntax tips that might be of interest:

f"Hello {name=}"
f"Goodbye {name:r}"

The first is great for quick debugging -- it gives the identifier followed by its value.
In the second, instead of the usual str() we call repr(), which is boring for integers but it will put quotes around a string.


Also, mypy complains the process() signature isn't quite right.
We should declare num_rows: int | None = typer.Option( ..., given that we sometimes assign None to it.

@jhanley634
Copy link
Collaborator

Ok, I finally found a substantive issue.

This PR 61 introduces a new dataclasses.py module. Please don't do that. At a minimum, rename it to data_classes.py. It causes confusion, and type checking lossage for mypy and pyright, which I eventually noticed and diagnosed.

Consider $ python -c 'import test'. It succeeds silently, due to one of python's standard libraries. A great many people have run afoul of that, naively creating their own test.py and then wondering why import behaves oddly. Same thing here.

@jhanley634 jhanley634 mentioned this pull request Jan 19, 2025
Copy link
Collaborator

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work updating this PR! I think the only thing left is testing, but I guess we've already set the (unfortunate) precedent that we don't write tests, so this is fine as is.

@skyfenton
Copy link
Collaborator Author

Also, mypy complains the process() signature isn't quite right.
We should declare num_rows: int | None = typer.Option( ..., given that we sometimes assign None to it.

Fixed in recent change, maybe we should setup mypy or some other type checker (or maybe at least know about errors even if we don't want them to block)?

This PR 61 introduces a new dataclasses.py module. Please don't do that. At a minimum, rename it to data_classes.py. It causes confusion, and type checking lossage for mypy and pyright, which I eventually noticed and diagnosed.

Good catch! Renamed dataclasses.py to schemas.py instead, which I think is fitting, even more so if we eventually use pydantic.

Also removed spaces from outputs with titles and wrapped with double quotes.

@skyfenton
Copy link
Collaborator Author

skyfenton commented Jan 20, 2025

Thanks for all your work updating this PR! I think the only thing left is testing, but I guess we've already set the (unfortunate) precedent that we don't write tests, so this is fine as is.

I added a couple tests for the flatten_values and wiki_query functions using the new dataclasses if you want to take a look. I discovered the order we get the genres of a movie is non-deterministic, so I sort the list before checking for equality.

Maybe a good goal to set for future testing is some percentage of coverage? Is writing tooling/tests for read csv/write csv/process worth it?

@skyfenton
Copy link
Collaborator Author

skyfenton commented Jan 20, 2025

By the way, I'm gonna try to polish the cli a little more so I'll publish the pr once I'm done. Originally this draft was just to get @JamesKohlsRepo's attention so he can merge in the dataclasses we were working on.

@audiodude
Copy link
Collaborator

Sounds good! Nice work.

@skyfenton skyfenton marked this pull request as ready for review January 20, 2025 07:15
@jhanley634
Copy link
Collaborator

Hmm, this is slightly sad. I'm accustomed to "trust the author!" settings. Oh, well, I'll just reapprove.

Review required

Waiting on 1 reapproval from someone other than the last pusher. Reviews from audiodude and jhanley634 are stale because they were submitted before the most recent code changes.

Merging is blocked
Merging can be performed automatically with 1 approving review.

@jhanley634 jhanley634 self-requested a review January 20, 2025 19:35
Copy link
Collaborator

@jhanley634 jhanley634 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-approving...

mediabridge/schemas/movies_test.py Show resolved Hide resolved
@audiodude
Copy link
Collaborator

I wonder if there is a way to require review/approval, but not re-approval?

@jhanley634
Copy link
Collaborator

jhanley634 commented Jan 20, 2025

I wonder if there is a way to require review/approval, but not re-approval?

Yup. Here are two items. And for background, my personal philosophy is that Authors are smart and good, they mean well, the Author is always right, can always choose to ignore a Reviewer's remark. My role as Reviewer is simply to offer a different perspective on the code that perhaps was never considered. Once a PR has a 👍 thumbs up on it, I wish for Author to be able to quickly e.g. fix a typo and merge to main. We assume giant changes wouldn't piggyback on the initial approval -- they should go in a subsequent feature branch.

I usually click github repo Settings to allow this on Pull Requests:

  • [no] Allow merge commits
  • [yes] Allow squash merging
  • [no] Allow rebase merging

Then everyone who clicks the giant green Merge button gets a Squash, so git annotate history is simplified. I tend to do at least a commit an hour, often with tiny changes, and they will be uninteresting a month from now. Showing the granularity of "introduced feature X" in the log makes it more informative, more easily understood.

To answer your original question, it's on branch protection Rules. Make sure that "Dismiss stale pull request approvals when new commits are pushed" is disabled. Which ensures that previous approvals won't be invalidated when new commits are added to the branch.

@audiodude
Copy link
Collaborator

Okay yup, I found the rule, thanks!

@skyfenton skyfenton merged commit 4cc0c22 into main Jan 21, 2025
4 checks passed
@skyfenton skyfenton deleted the 58-connect-wiki_to_netflixmongo-insertionmain-file branch January 21, 2025 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connect wiki_to_netflix/mongo insertion/main file
3 participants