Refactor main file/commands and expand dataclass usage #61

skyfenton · 2025-01-17T18:19:24Z

Adds CLI commands for processing data into a csv and triggering a load of the csv into MongoDB.

Also creates dataclasses for input and enriched data (MovieData and EnrichedMovieData) and refactors code in wiki_to_netflix so that we can normalize the netflix data and write data into csvs based on these classes. Additionally, now when we write a list of MovieData objects to a csv file, the first line of the csv will consist of a header naming each column so that the entire file is mostly self-documenting (though, without explicit types). For example, movie_titles.csv writes EnrichedMovieData (matches), so the resulting csv file looks like:

netflix_id,title,year,wikidata_id,genres,director
3,Character,1997,Q928545,drama film; film based on literature,Mike van Diem
5,The Rise and Fall of ECW,2004,Q10381750,None,Kevin Dunn
...

mediabridge/data_processing/wiki_to_netflix.py

audiodude

This is great! I think that the ideas here are really good and they put us in a good place to run the code and manage the data.

I think all of the issues I've pointed out are minor and should be easily fixable.

mediabridge/data_processing/wiki_to_netflix.py

mediabridge/dataclasses.py

mediabridge/db/load.py

mediabridge/db/queries.py

mediabridge/main.py

mediabridge/db/queries.py

audiodude · 2025-01-17T18:35:32Z

Also @JamesKohlsRepo , please do leave your own review! I think learning how to review code is a really important skill to have.

jhanley634

Travis is the one to keep happy, not me! But my tuppence is: LGTM

mediabridge/data_processing/wiki_to_netflix.py

jhanley634 · 2025-01-18T00:11:47Z

Hullo, @skyfenton. Here's a tiny feature request. In wiki_query(), please delete the surrounding pair of SPACE characters.

f"Could not find movie id {movie.netflix_id}: (' {movie.title} ', {movie.year})"

I thought we needed to add a title .strip() call, based on what the log was telling me, but it turns out it's just an output artifact that should be removed.

BTW, here's a pair of handy f-string syntax tips that might be of interest:

f"Hello {name=}"
f"Goodbye {name:r}"

The first is great for quick debugging -- it gives the identifier followed by its value.
In the second, instead of the usual str() we call repr(), which is boring for integers but it will put quotes around a string.

Also, mypy complains the process() signature isn't quite right.
We should declare num_rows: int | None = typer.Option( ..., given that we sometimes assign None to it.

jhanley634 · 2025-01-18T02:26:44Z

Ok, I finally found a substantive issue.

This PR 61 introduces a new dataclasses.py module. Please don't do that. At a minimum, rename it to data_classes.py. It causes confusion, and type checking lossage for mypy and pyright, which I eventually noticed and diagnosed.

Consider $ python -c 'import test'. It succeeds silently, due to one of python's standard libraries. A great many people have run afoul of that, naively creating their own test.py and then wondering why import behaves oddly. Same thing here.

…commender

audiodude

Thanks for all your work updating this PR! I think the only thing left is testing, but I guess we've already set the (unfortunate) precedent that we don't write tests, so this is fine as is.

skyfenton · 2025-01-20T02:57:50Z

Also, mypy complains the process() signature isn't quite right.
We should declare num_rows: int | None = typer.Option( ..., given that we sometimes assign None to it.

Fixed in recent change, maybe we should setup mypy or some other type checker (or maybe at least know about errors even if we don't want them to block)?

This PR 61 introduces a new dataclasses.py module. Please don't do that. At a minimum, rename it to data_classes.py. It causes confusion, and type checking lossage for mypy and pyright, which I eventually noticed and diagnosed.

Good catch! Renamed dataclasses.py to schemas.py instead, which I think is fitting, even more so if we eventually use pydantic.

Also removed spaces from outputs with titles and wrapped with double quotes.

skyfenton · 2025-01-20T04:10:49Z

Thanks for all your work updating this PR! I think the only thing left is testing, but I guess we've already set the (unfortunate) precedent that we don't write tests, so this is fine as is.

I added a couple tests for the flatten_values and wiki_query functions using the new dataclasses if you want to take a look. I discovered the order we get the genres of a movie is non-deterministic, so I sort the list before checking for equality.

Maybe a good goal to set for future testing is some percentage of coverage? Is writing tooling/tests for read csv/write csv/process worth it?

skyfenton · 2025-01-20T04:14:23Z

By the way, I'm gonna try to polish the cli a little more so I'll publish the pr once I'm done. Originally this draft was just to get @JamesKohlsRepo's attention so he can merge in the dataclasses we were working on.

audiodude · 2025-01-20T05:05:30Z

Sounds good! Nice work.

…n-file

…ttps://github.com/noisebridge/MediaBridge into 58-connect-wiki_to_netflixmongo-insertionmain-file

jhanley634 · 2025-01-20T19:35:12Z

Hmm, this is slightly sad. I'm accustomed to "trust the author!" settings. Oh, well, I'll just reapprove.

Review required

Waiting on 1 reapproval from someone other than the last pusher. Reviews from audiodude and jhanley634 are stale because they were submitted before the most recent code changes.

Merging is blocked
Merging can be performed automatically with 1 approving review.

jhanley634

re-approving...

mediabridge/schemas/movies_test.py

audiodude · 2025-01-20T19:42:49Z

I wonder if there is a way to require review/approval, but not re-approval?

jhanley634 · 2025-01-20T20:10:51Z

I wonder if there is a way to require review/approval, but not re-approval?

Yup. Here are two items. And for background, my personal philosophy is that Authors are smart and good, they mean well, the Author is always right, can always choose to ignore a Reviewer's remark. My role as Reviewer is simply to offer a different perspective on the code that perhaps was never considered. Once a PR has a 👍 thumbs up on it, I wish for Author to be able to quickly e.g. fix a typo and merge to main. We assume giant changes wouldn't piggyback on the initial approval -- they should go in a subsequent feature branch.

I usually click github repo Settings to allow this on Pull Requests:

[no] Allow merge commits
[yes] Allow squash merging
[no] Allow rebase merging

Then everyone who clicks the giant green Merge button gets a Squash, so git annotate history is simplified. I tend to do at least a commit an hour, often with tiny changes, and they will be uninteresting a month from now. Showing the granularity of "introduced feature X" in the log makes it more informative, more easily understood.

To answer your original question, it's on branch protection Rules. Make sure that "Dismiss stale pull request approvals when new commits are pushed" is disabled. Which ensures that previous approvals won't be invalidated when new commits are added to the branch.

audiodude · 2025-01-21T00:17:00Z

Okay yup, I found the rule, thanks!

skyfenton added 7 commits December 29, 2024 03:59

Move CLI logic for data processing out of main file

ce48e1a

Fixed comparison to use context object

1b92193

Removed formatting from help message to make message consistent

694bf4f

Create dataclasses based on required data attributes

5f68d9a

add load command to main app cli

ee48002

Refactor wiki_to_netflix functions using dataclasses

8a91ce2

Rename netflix_csv for readability (also we had duplicate variables)

cfec05c

skyfenton requested review from audiodude and JamesKohlsRepo January 17, 2025 18:19

skyfenton linked an issue Jan 17, 2025 that may be closed by this pull request

Connect wiki_to_netflix/mongo insertion/main file #58

Closed

skyfenton self-assigned this Jan 17, 2025

skyfenton commented Jan 17, 2025

View reviewed changes

mediabridge/data_processing/wiki_to_netflix.py Outdated Show resolved Hide resolved

audiodude requested changes Jan 17, 2025

View reviewed changes

Change csv output name to distinguish between matches/missing

f6e991f

audiodude changed the title ~~58 connect wiki to netflixmongo insertionmain file~~ Refactor main file/commands and expand dataclass usage Jan 17, 2025

Load rows from matches.csv to insert into Mongo

04ff23d

jhanley634 self-requested a review January 17, 2025 19:46

jhanley634 approved these changes Jan 17, 2025

View reviewed changes

mediabridge/data_processing/wiki_to_netflix.py Outdated Show resolved Hide resolved

Re-add tqdm and move print message to inside process function

066c259

skyfenton commented Jan 17, 2025

View reviewed changes

mediabridge/data_processing/wiki_to_netflix.py Show resolved Hide resolved

Fix flatten_values instance check and join

ea1fae2

audiodude reviewed Jan 17, 2025

View reviewed changes

mediabridge/data_processing/wiki_to_netflix.py Show resolved Hide resolved

mediabridge/data_processing/wiki_to_netflix.py Outdated Show resolved Hide resolved

mediabridge/data_processing/wiki_to_netflix.py Outdated Show resolved Hide resolved

skyfenton added 3 commits January 17, 2025 22:47

Add explicit exc_info parameter for clarity

0c12743

Rename AppContext attr to log_to_file for clarity

8eac102

Save log_to_file flag check from ctx.obj

4888306

skyfenton added 2 commits January 18, 2025 03:13

Update all docstrings and type hints

9c08135

Change .joinpath to / for readability

6291d9f

skyfenton added 5 commits January 18, 2025 22:49

Rename dataclasses.py to schemas.py and rename models directory to re…

a99ea1c

…commender

Use vars instead of __dict__

7c78cc4

Add variable to count total (in case num_rows is None)

9d4b2b6

Fix num_rows type hint

bca5cab

Fix load.py module ordering

e0a957d

jhanley634 mentioned this pull request Jan 19, 2025

62 genres annotation #63

Merged

audiodude approved these changes Jan 19, 2025

View reviewed changes

skyfenton added 2 commits January 20, 2025 00:15

Remove spaces in title outputs and use double quotes for readability

c6c09d6

Remove unused comment for SPARQL debug

766bf18

skyfenton added 3 commits January 20, 2025 03:14

Move schemas.py into movies.py under schemas dir

c13aca1

Add test for wiki_query

870165d

Change to simpler equality check via sorting

d2f762d

skyfenton and others added 5 commits January 20, 2025 06:38

Fix grammar

ab6f787

Clarify how to use the CLI in the README

9cf3c63

Merge branch 'main' into 58-connect-wiki_to_netflixmongo-insertionmai…

8dc6c0d

…n-file

Simplified testing instructions

351c256

Merge branch '58-connect-wiki_to_netflixmongo-insertionmain-file' of h…

e8ff906

…ttps://github.com/noisebridge/MediaBridge into 58-connect-wiki_to_netflixmongo-insertionmain-file

skyfenton marked this pull request as ready for review January 20, 2025 07:15

jhanley634 self-requested a review January 20, 2025 19:35

jhanley634 approved these changes Jan 20, 2025

View reviewed changes

mediabridge/schemas/movies_test.py Show resolved Hide resolved

skyfenton merged commit 4cc0c22 into main Jan 21, 2025
4 checks passed

skyfenton deleted the 58-connect-wiki_to_netflixmongo-insertionmain-file branch January 21, 2025 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor main file/commands and expand dataclass usage #61

Refactor main file/commands and expand dataclass usage #61

skyfenton commented Jan 17, 2025

audiodude left a comment

audiodude commented Jan 17, 2025

jhanley634 left a comment

jhanley634 commented Jan 18, 2025

jhanley634 commented Jan 18, 2025

audiodude left a comment

skyfenton commented Jan 20, 2025

skyfenton commented Jan 20, 2025 •

edited

Loading

skyfenton commented Jan 20, 2025 •

edited

Loading

audiodude commented Jan 20, 2025

jhanley634 commented Jan 20, 2025

jhanley634 left a comment

audiodude commented Jan 20, 2025

jhanley634 commented Jan 20, 2025 •

edited

Loading

audiodude commented Jan 21, 2025

Refactor main file/commands and expand dataclass usage #61

Refactor main file/commands and expand dataclass usage #61

Conversation

skyfenton commented Jan 17, 2025

audiodude left a comment

Choose a reason for hiding this comment

audiodude commented Jan 17, 2025

jhanley634 left a comment

Choose a reason for hiding this comment

jhanley634 commented Jan 18, 2025

jhanley634 commented Jan 18, 2025

audiodude left a comment

Choose a reason for hiding this comment

skyfenton commented Jan 20, 2025

skyfenton commented Jan 20, 2025 • edited Loading

skyfenton commented Jan 20, 2025 • edited Loading

audiodude commented Jan 20, 2025

jhanley634 commented Jan 20, 2025

jhanley634 left a comment

Choose a reason for hiding this comment

audiodude commented Jan 20, 2025

jhanley634 commented Jan 20, 2025 • edited Loading

audiodude commented Jan 21, 2025

skyfenton commented Jan 20, 2025 •

edited

Loading

skyfenton commented Jan 20, 2025 •

edited

Loading

jhanley634 commented Jan 20, 2025 •

edited

Loading