Write a file of embeddings to Parquet from eval recs generation #124

karlhigley · 2024-11-01T15:05:51Z

This grabs the candidate embeddings from each pipeline execution and builds up a cache of embeddings over the course of the eval run. Once the eval run finishes, it writes them to a two column Parquet file where the first column is the article id and the second is the corresponding embedding.

sophiasun0515

lgtm!

mdekstrand

This looks good to me for doing the saving, but will leave the code in an incompatible state for doing the measurements and automating the run — up to you if you want to fix that in this PR or a second PR.

I like the design of specifying an output directory, this will give us future flexibility to add more output files.

Things that are needed to finish wiring into measurement:

dvc.yaml — change the recommend- stages to specify an output directory, and correct the dependency and output files.
update measure.py to take an input directory, and look for recommendations.parquet in it.
dvc.yaml — update the measure- stages as well.

mdekstrand

Need to add the embeddings.parquet files to outs in dvc.yaml (so DVC will track / push / pull them), but otherwise looks good. One further comment on a now vs. later consistency refactor.

dvc.yaml

Write a file of embeddings to Parquet from eval recs generation

523cd24

karlhigley requested review from mdekstrand, rburke2233 and sophiasun0515 November 1, 2024 15:05

karlhigley self-assigned this Nov 1, 2024

sophiasun0515 approved these changes Nov 1, 2024

View reviewed changes

Remove outdated pipeline implementation

9520ef9

mdekstrand approved these changes Nov 1, 2024

View reviewed changes

karlhigley added 2 commits November 1, 2024 15:55

Update evaluate.py to read from a directory

97f955d

Update task definitions in dvc.yaml to use input/output directories

fd4d16c

karlhigley requested a review from mdekstrand November 1, 2024 19:55

mdekstrand approved these changes Nov 1, 2024

View reviewed changes

dvc.yaml Show resolved Hide resolved

dvc.yaml Show resolved Hide resolved

dvc.yaml Outdated Show resolved Hide resolved

karlhigley added 2 commits November 4, 2024 09:51

Add embeddings.parquet to outputs in dvc.yaml

9a055c3

Move metrics output files into eval run directories in dvc.yaml

aa7ffb3

karlhigley merged commit 1874a04 into main Nov 4, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a file of embeddings to Parquet from eval recs generation #124

Write a file of embeddings to Parquet from eval recs generation #124

karlhigley commented Nov 1, 2024 •

edited

Loading

sophiasun0515 left a comment

mdekstrand left a comment

mdekstrand left a comment

Write a file of embeddings to Parquet from eval recs generation #124

Write a file of embeddings to Parquet from eval recs generation #124

Conversation

karlhigley commented Nov 1, 2024 • edited Loading

sophiasun0515 left a comment

Choose a reason for hiding this comment

mdekstrand left a comment

Choose a reason for hiding this comment

mdekstrand left a comment

Choose a reason for hiding this comment

karlhigley commented Nov 1, 2024 •

edited

Loading