Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction matrix of user #10

Merged
merged 26 commits into from
Dec 6, 2024
Merged

Interaction matrix of user #10

merged 26 commits into from
Dec 6, 2024

Conversation

siddz415
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

I think you might want to think about writing the code that "glues" this all together. Instead of just writing the library functions, what does it looks like to call these functions and produce the final interaction matrix?

data/movie_titles.txt Outdated Show resolved Hide resolved
data/mv_0000001.txt Outdated Show resolved Hide resolved
data/mv_0000002.txt Outdated Show resolved Hide resolved
data_processing/load_data.py Outdated Show resolved Hide resolved
data_processing/load_data.py Outdated Show resolved Hide resolved
data_processing/process_data.py Outdated Show resolved Hide resolved
data_processing/load_data.py Outdated Show resolved Hide resolved
data_processing/process_data.py Outdated Show resolved Hide resolved
@audiodude
Copy link
Collaborator

Hey @siddz415 , just wanted to check in if you're able to respond to the comments in this PR? If you're too busy with other stuff that's totally cool. We were thinking we would re-assign the PR if there's not any motion before the meeting on Nov 7. Let us know if you're still interested in working on it.

Copy link
Collaborator

@jhanley634 jhanley634 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using latin-1 encoding in load_movie_titles() seems odd, but perhaps we're loading an ancient file which is not in utf8. It's all commented code in any case, so we should probably just delete those lines before merging to main.

The split() on , comma suggests that we might prefer to import csv.

In lightfm_recommendation.py and main.py we have a lot of code up at module level. Recommend burying it within def main(): or whatever. That way it will be safe for some future unit test to import it without side effects. The current code, which lacks a __main__ guard, is innocuous enough. But I'm concerned it will encourage folks to add stuff to this code base which is hard for a "second caller" (such as a unit test) to invoke. Such issues could be addressed in the current PR prior to a merge, or in a subsequent PR.

@audiodude
Copy link
Collaborator

Using latin-1 encoding in load_movie_titles() seems odd, but perhaps we're loading an ancient file which is not in utf8. It's all commented code in any case, so we should probably just delete those lines before merging to main.

The file is from 2007, so it's completely possible that it's not UTF-8. It's also possible that it doesn't have any non-ASCII characters anyway, so it might not matter.

The split() on , comma suggests that we might prefer to import csv.

These text files are not CSV, they are only CSV-like. Individual fields can have commas in them. Specifically movie_titles.txt often looks like:

1023, 1994, Forrest Gump
1024, 2001, Planes, Trains and Automobiles
...

In lightfm_recommendation.py and main.py we have a lot of code up at module level. Recommend burying it within def main(): or whatever. That way it will be safe for some future unit test to import it without side effects. The current code, which lacks a __main__ guard, is innocuous enough. But I'm concerned it will encourage folks to add stuff to this code base which is hard for a "second caller" (such as a unit test) to invoke. Such issues could be addressed in the current PR prior to a merge, or in a subsequent PR.

I agree. Better to put things in a main function now and figure out how to call it later. We can always rename main() to create_matrix() or whatever. I think a lot of this comes from the fact that we haven't really discussed how to organize files/modules/libraries/functions, so things are sort of in random places right now.

@jhanley634
Copy link
Collaborator

This PR is drifting toward being moribund. It is still mergeable at this point, though it proposes adding more than 4 million lines of CSV data, which for git is not quite a Best Practice.

It might be helpful to split Data from Code, merge the code, and move on to other issues and feature requests.

@audiodude
Copy link
Collaborator

Thanks for updating the branch @jhanley634.

We discussed early on in our in-person meetings that our strategy should not be to commit the Netflix prize data. That's the reason we're exploring the other PR that automatically downloads it.

This has already been addressed in this comment: #10 (comment)

@cocomittens
Copy link
Collaborator

Ok so I updated it to use a sparse matrix and removed the extra variable (right now its getting the files by calling list_rating_files in create_interaction_matrix, but could also get the files in main like it is currently and change the parameter to that instead of the directory. )

But not pushing the changes yet cause want to make sure I'm not breaking anything (it throws an error, but also doesn't seem to work for different reasons without my changes). So just to make sure,
The create_interaction_matrix function is supposed to go through both of the mv_01 and mv_02 files, right?

But then when going through the file, this line
user_id, rating, _ = line.strip().split(",")
throws an error because it's looking for 3 arguments, and those files only have 1 or 2 per line (with none of them seeming to be in the format of an id, rating). Is this intended to be the movie_titles file? (I would assume not because that doesnt have the ratings?) Is there supposed to be another file somewhere that contains the rating information like this, or is that supposed to be created somehow from the information in these files?

It looks like currently, these are the files being used in that function?
movie_data = list_rating_files(movie_titles_file)
Which I don't believe would theoretically work if I'm understanding it correctly, as the argument to that function is a directory not a file, and seems to be intended to find and return the mv_01 and mv_02 files? Or am I missing something?

@cocomittens
Copy link
Collaborator

cocomittens commented Nov 23, 2024

I guess my greater question is, what are in these mv files exactly?
The first one looks to be a movie_id: , followed by a bunch of user ids? (Presumably those that rated that movie?)
Then the second one is movie_id: , then user ids and dates?
Where is the rating (presumably on a 1-5 scale?) supposed to come from?

main.py Outdated Show resolved Hide resolved
@audiodude
Copy link
Collaborator

I guess my greater question is, what are in these mv files exactly? The first one looks to be a movie_id: , followed by a bunch of user ids? (Presumably those that rated that movie?) Then the second one is movie_id: , then user ids and dates? Where is the rating (presumably on a 1-5 scale?) supposed to come from?

The mv_*.txt files are in the form:

MOVIE_ID:
USER_ID,RATING,DATE_OF_RATING**
USER_ID_2,RATING,DATE_OF_RATING
USER_ID_3,RATING,DATE_OF_RATING**

There is a file for each movie id, and the id in the file name corresponds to the id of the movie that is referenced.

@cocomittens
Copy link
Collaborator

cocomittens commented Nov 23, 2024

I guess my greater question is, what are in these mv files exactly? The first one looks to be a movie_id: , followed by a bunch of user ids? (Presumably those that rated that movie?) Then the second one is movie_id: , then user ids and dates? Where is the rating (presumably on a 1-5 scale?) supposed to come from?

The mv_*.txt files are in the form:

MOVIE_ID: USER_ID,RATING,DATE_OF_RATING** USER_ID_2,RATING,DATE_OF_RATING USER_ID_3,RATING,DATE_OF_RATING**

There is a file for each movie id, and the id in the file name corresponds to the id of the movie that is referenced.

Ok thats good to know, TBH I will push my changes then for now cause it should theoretically work if thats true...
However the files that I deleted from this PR sadly dont follow that format for some reason, not sure where the rest of it went?
But where can I actually get this data?
I see that it is in the same place as the other data

@audiodude
Copy link
Collaborator

Okay we're clearly working on this at the exact same moment, which probably isn't a great idea. Please pull. I consolidated all of the files and put them in a sane place.

@cocomittens
Copy link
Collaborator

cocomittens commented Nov 23, 2024

Okay we're clearly working on this at the exact same moment, which probably isn't a great idea. Please pull. I consolidated all of the files and put them in a sane place.

Will do! Hopefully didnt create some kind of merge conflict, but yeah will make sure to avoid any future interactions

main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
@audiodude audiodude requested a review from cocomittens December 6, 2024 03:07
@audiodude audiodude merged commit 35709da into main Dec 6, 2024
4 checks passed
@audiodude audiodude deleted the interaction-matrix-of-user branch December 6, 2024 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants