Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a file of embeddings to Parquet from eval recs generation #124

Merged
merged 6 commits into from
Nov 4, 2024

Conversation

karlhigley
Copy link
Collaborator

@karlhigley karlhigley commented Nov 1, 2024

This grabs the candidate embeddings from each pipeline execution and builds up a cache of embeddings over the course of the eval run. Once the eval run finishes, it writes them to a two column Parquet file where the first column is the article id and the second is the corresponding embedding.

Copy link
Contributor

@sophiasun0515 sophiasun0515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Contributor

@mdekstrand mdekstrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me for doing the saving, but will leave the code in an incompatible state for doing the measurements and automating the run — up to you if you want to fix that in this PR or a second PR.

I like the design of specifying an output directory, this will give us future flexibility to add more output files.

Things that are needed to finish wiring into measurement:

  1. dvc.yaml — change the recommend- stages to specify an output directory, and correct the dependency and output files.
  2. update measure.py to take an input directory, and look for recommendations.parquet in it.
  3. dvc.yaml — update the measure- stages as well.

@karlhigley karlhigley requested a review from mdekstrand November 1, 2024 19:55
Copy link
Contributor

@mdekstrand mdekstrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add the embeddings.parquet files to outs in dvc.yaml (so DVC will track / push / pull them), but otherwise looks good. One further comment on a now vs. later consistency refactor.

dvc.yaml Show resolved Hide resolved
dvc.yaml Show resolved Hide resolved
dvc.yaml Outdated Show resolved Hide resolved
@karlhigley karlhigley merged commit 1874a04 into main Nov 4, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants