Locality calibration + text generation #129

imrecommender · 2024-11-06T18:41:56Z

The main logic in the code includes:

the option to enable or disable text generation
the ability to choose whether to consider topic or other filters when finding similar articles from the clicked history
the option to apply a time decay weight when identifying similar articles.

I ran a test on different config combinations using the uploaded basic request JSON. There don’t seem to be any bugs at this point, but the details of the output still need to be checked.

…I-POPROX#118)

This copies embeddings from a candidate set to an selected set of articles, so that embeddings can be used in downstream analysis for metrics like ILD.

Since we're using the pipeline implementation from the `lkpipeline` module now, it's confusing to have this earlier implementation still hanging around in the code base. Removing for clarity.

…ics.py

…-POPROX#124) This grabs the candidate embeddings from each pipeline execution and builds up a cache of embeddings over the course of the eval run. Once the eval run finishes, it writes them to a two column Parquet file where the first column is the article id and the second is the corresponding embedding.

…RI-POPROX#127) This brings us **parallel generation** of recommendations for offline evaluation, and uses that to regenerate our offline recommendation outputs (and update our reported timings). This is a big PR, unfortunately, for a couple of reasons: - Parallelism requires refactoring quite a bit to have the parallel operation encapsulated in a worker function. - `ipyparallel` doesn't work well with code defined in the script that is being run, so most of the code — in particular, the class definitions it uses, but also functions — are refactored out into imported modules, so the `ipyparallel` workers can find them correctly. Highlights: - Parallelize generation with the `ipyparallel` package. I have found this to work well without some of the intermittent bugs I've encountered with multiprocessing or `ProcessPoolExecutor`, and it also makes it easy to run initialization and finalization tasks on all worker threads to set up outputs or finish writing outputs. - Refactor `generate` into a package (with `__main__.py`, so we can still run it with `python -m poprox_recommender.evaluation.generate`), with various modules implementing its pieces. - `recommendations` is now a directory instead of a single Parquet file; software that reads Parquets (including Pandas `read_parquet` and DuckDB) generally accept directories of Parquet files and concatenate the files they contain, to support sharding. We use this to shard the output based on worker process; each worker writes its own Parquet output. Otherwise sending data back to the parent and collecting or writing it becomes a bottleneck that severely impairs parallelism (I have learned this through long experience with Lenskit). - Since each worker has its own output, we can no longer de-duplicate embeddings while we write them. The solution I implemented is that each worker de-duplicates embeddings within its own output, and writes the results to a shared in a temporary directory; the parent process then collects all those embeddings, de-duplicates them once more, and writes the result to a single Parquet file. - To support the per-worker writing, and to avoid slurping all outputs into memory, this adds a batched parquet writer that accumulates batches of rows and writes them directly to the Parquet file. - We can still run in a single process with `-j 1` to the evaluation script. The default is to look at the `POPROX_NUM_CPUS` environment variable, or the min of the number of CPUs and 4. - This adds `recommend-mind-subset` and `measure-mind-subset` tasks to generate and measure for a small subset (first 1K rows) of MIND, to more easily facilitate quick testing of the offline eval code. - The `poprox_recommender.evaluation.generate.outputs` module defines the layout of the outputs from recommendation generation, so this is defined in one place instead of needing to keep the different pieces of the generator consistent. - **Renames** the “MRR” user metric to “RR” — the individual per-list metric is recip. rank, and the mean of them is the mean recip. rank.

locality + text generation

updated pixi lock

…ox-recommender-locality into zent/generate_py

Add generic poprox export support to generate.py

Several relatively small improvements to our Docker config: - configure Serverless to build for `linux/amd64` explicitly - update deploy.sh to run serverless properly with npx, and only pull necessary DVC data - bump Pixi version in docker images - clean up docker image a bit, removing DNF caches (saves 60MB in final image) - add `outputs/` and `.pixi/` to `.dockerignore`, speeding up Docker build startup a lot

Fix paths to the safetensors files

Remove stray reference to SentenceTransformers

Remove API key

Adding support for date and pipeline splitting to generate

Unify Offline Evals

Bake the MiniLM model into the Docker container for deployment

Aggregate Metrics by Recommender, Locality Theta, and Topic Theta for Locality-Cali Pipelineunifying evals

locality_theta parameter passing bug

karlhigley and others added 9 commits October 29, 2024 11:50

Add category and subcategory mentions to articles from MIND data (CCR…

69476ae

…I-POPROX#118)

Create an EmbeddingCopier component (CCRI-POPROX#119)

4100063

This copies embeddings from a candidate set to an selected set of articles, so that embeddings can be used in downstream analysis for metrics like ILD.

Remove outdated pipeline implementation (CCRI-POPROX#125)

c7dda31

Since we're using the pipeline implementation from the `lkpipeline` module now, it's confusing to have this earlier implementation still hanging around in the code base. Removing for clarity.

Changed the code logic and deleted subject_code category from the top…

4587fb0

…ics.py

Add generic poprox export support to generate.py

b786eee

record dvc file hashes

d7fb26c

Locality calibration + text generation

a1cb9ec

imrecommender requested a review from sophiasun0515 November 6, 2024 18:42

sophiasun0515 changed the base branch from xinyili/development to main November 6, 2024 18:46

move locality_calibration.py to the right directory

bb18c75

sophiasun0515 changed the base branch from main to xinyili/development November 6, 2024 18:51

Zentavious and others added 17 commits November 8, 2024 09:17

changing default text_gen to false while feature is in dev

5d81090

Merge pull request #4 from zentavious/xinyili/development

ec31476

locality + text generation

Add generic poprox export support to generate.py

4b9cb83

save WIP for grid search script

2708980

updated pixi.lock

d883626

remove draft script

2a3fa75

Merge pull request #5 from zentavious/sophia/theta-grid-search

4938406

updated pixi lock

Merge remote-tracking branch 'origin/main' into zent/generate_py

654e06c

tested code with inputs to generate parquet

3daf00d

Merge branch 'zent/generate_py' of https://github.com/zentavious/popr…

c95eb37

…ox-recommender-locality into zent/generate_py

removed variable from hyperparameter test

ea6a6cd

fixing ruff errors

c325364

Merge pull request #3 from zentavious/zent/generate_py

13959eb

Add generic poprox export support to generate.py

fixed reading theta values as input

dbef068

setup hydra for grid search

715b57b

fix region in deploy

07274ed

imrecommender and others added 30 commits December 4, 2024 01:09

modify the prompt

b3e5591

status

f8a9953

Fix paths to the safetensors files

102a271

Merge pull request #15 from zentavious/karl/fix/safetensors-paths

06465da

Fix paths to the safetensors files

Merge branch 'main' into zent/generate_poprox

806702e

Add all-MiniLM-L6-v2 to models directory

74aa054

Use the local copy of all-MiniLM-L6-v2

0b7214c

Remove stray reference to SentenceTransformers

Remove the PyTorch version of all-MiniLM-L6-v2

648916c

Remove the training script

ec348e2

Fix model path

0a28bd5

adding support for date splitting and pipelines splitting

017f331

Move model loading and client creation inside ContextGenerator

32a1794

Remove API key

Update openai to 1.55.3

63c3e63

Increase the allocated memory and ephemeral storage

09b73cb

Only include the context generator in the locality pipeline

3f48adb

Merge pull request #16 from zentavious/zent/generate_poprox

5a9a701

Adding support for date and pipeline splitting to generate

ingoring eval output files

068883e

Merge branch 'main' into xinyili/development

a6ced3e

adding experiment metrics to generate script

43aa506

adding clicked articles to profile

d11f346

Merge pull request #17 from zentavious/zent/uni_evals

5a7c810

Unify Offline Evals

Merge branch 'main' into karl/refactor/minilm-model

4e6f58d

Merge pull request #12 from zentavious/karl/refactor/minilm-model

ca5bc4c

Bake the MiniLM model into the Docker container for deployment

unifying evals

fab328c

Merge pull request #18 from zentavious/zent/eval_thetas

f1f414a

Aggregate Metrics by Recommender, Locality Theta, and Topic Theta for Locality-Cali Pipelineunifying evals

fixed two bugs related to the locality_theta parameter

75cc32f

removing debug statments

8d29c69

Merge pull request #19 from zentavious/zent/locality_parameter_bug

7564b44

locality_theta parameter passing bug

Added topic-based context generation

8988bda

debug locality_calibration.py

51cd93f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locality calibration + text generation #129

Locality calibration + text generation #129

imrecommender commented Nov 6, 2024

Locality calibration + text generation #129

Are you sure you want to change the base?

Locality calibration + text generation #129

Conversation

imrecommender commented Nov 6, 2024