-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write a file of embeddings to Parquet from eval recs generation #124
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me for doing the saving, but will leave the code in an incompatible state for doing the measurements and automating the run — up to you if you want to fix that in this PR or a second PR.
I like the design of specifying an output directory, this will give us future flexibility to add more output files.
Things that are needed to finish wiring into measurement:
dvc.yaml
— change therecommend-
stages to specify an output directory, and correct the dependency and output files.- update
measure.py
to take an input directory, and look forrecommendations.parquet
in it. dvc.yaml
— update themeasure-
stages as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add the embeddings.parquet
files to outs
in dvc.yaml
(so DVC will track / push / pull them), but otherwise looks good. One further comment on a now vs. later consistency refactor.
This grabs the candidate embeddings from each pipeline execution and builds up a cache of embeddings over the course of the eval run. Once the eval run finishes, it writes them to a two column Parquet file where the first column is the article id and the second is the corresponding embedding.