add locality calibration and text generation #120

imrecommender · 2024-10-28T22:11:22Z

Description:

This PR introduces the basic framework for integrating locality calibration and text generation into the recommender system for the locality experiment. The main components are as follows:

Locality Calibration: Calibrates recommendations based on today’s local news distribution to align with relevant geographical content.
Text Generation: Generates personalized narratives for recommended articles by referencing similar articles previously read by the user.

To-Do:

Determine the optimal number of clicked articles to generate each narrative.
Decide whether to implement a cold start handling mechanism for narrative generation.
Confirm if clicked articles should be sorted chronologically forward or backward for narrative accuracy.

karlhigley

I think it makes sense to have a variety of components available in the core library, so I'm generally fine with adding locality-based calibration and context generation here, but as written I think this crosses the line where it makes more sense for this code to live in a fork of the repo. In particular, it's fine if you want to use openai in your experiment, but I'm very hesitant to add it as a general dependency of this library that everyone will need to install in order to build recommenders. I'm also a bit concerned about making calls out to an external service—it might work fine but that makes the recommender vulnerable to upstream issues with the OpenAI API that are outside our control.

I can see two potential paths forward, depending on how important OpenAI is for what you want to do:

Use a local text generation model from Hugging Face (since we're already depending on transformers)
Move this code to a fork of the repo and invest some time in making the platform more resilient to timeouts and failures

Thoughts on which seems more workable?

src/poprox_recommender/components/diversifiers/locality_calibration.py

karlhigley · 2024-10-29T16:07:29Z

src/poprox_recommender/components/diversifiers/locality_calibration.py

+        self.add_article_to_localities(rec_localities_with_candidate, article)
+        return normalized_category_count(rec_localities_with_candidate)
+
+    def calibration(


Overriding this method to calibrate across the two specific dimensions you care about for this study is an expedient way to get where you want to go and probably makes sense here. Outside the scope of this PR and not a requested change to this code, but we should also think about how to make the default implementation in the base class flexible enough to handle an arbitrary number of dimensions.

A rough idea—in general, the calculation method for calibration in this function should be similar regardless of the number of dimensions (i.e., preference distribution and defined theta). So, would it be possible to make the preference and theta inputs into a one-to-one matched list (i.e. [topic preference, locality preference], [topic theta, locality theta])? Then we could enumerate the calibration terms in the list🤔

src/poprox_recommender/components/diversifiers/locality_calibration.py

karlhigley · 2024-10-29T16:12:33Z

src/poprox_recommender/components/diversifiers/locality_calibration.py

+    def add_article_to_localities(self, rec_localities, article):
+        localities = extract_locality(article)
+        for local in localities:
+            rec_localities[local] = rec_localities.get(local, 0) + 1


These methods that modify their parameters should work fine but are a little awkward compared to the methods below that return a value. Maybe the two methods above could also copy the parameter, make modifications, and return the modified copy?

src/poprox_recommender/components/diversifiers/locality_calibration.py

karlhigley · 2024-10-29T16:21:16Z

src/poprox_recommender/components/diversifiers/locality_calibration.py

+    return gpt_generate(system_prompt, input_prompt)
+
+
+model = SentenceTransformer("all-MiniLM-L6-v2")


Makes sense that you'd need to embed the subheads separately here since the article embedder only does the headlines right now. I'd probably use the same language model for this that the NRMS model is based on, which I think is distilbert-base-uncased

Also worth thinking about whether the article embedder should be configurable to specify which article text fields (headline, subhead, body text) to embed. You can do it here but this probably isn't the last time we're going to run into this, so might make sense to move it upstream and make it re-usable.

I’m thinking, for handling headlines, should we use the original distilbert transformer, or can we easily extract embeddings from our fine-tuned distilbert on the MIND data? I agree that making this part more flexible for selecting configurations and text fields would be beneficial.

It seems like it should be possible to use the fine-tuned one by loading the fine-tuned weights into a plain distilbert model

karlhigley · 2024-10-29T16:30:38Z

src/poprox_recommender/components/diversifiers/locality_calibration.py

+        recommended_list = text_generation(candidate_articles.articles, article_indices, interest_profile)
+        return ArticleSet(articles=recommended_list)


This seems like a whole separate component to me, since the text generation gets applied to every selected article and isn't otherwise coupled to the calibration code. I'd probably use the article_indices and candidate_articles to build and return an ArticleSet here:

Suggested change

recommended_list = text_generation(candidate_articles.articles, article_indices, interest_profile)

return ArticleSet(articles=recommended_list)

return ArticleSet(articles=[candidate_articles.articles[idx] for idx in article_indices])

and then wire that output into a new component (called something like ContextGenerator?) that does the text generation part. More on this in the text_generation code below.

This was a challenge I was referring to a few weeks ago over Slack. Text generation shouldn't be applied to every article, so we'd need to set a flag of sorts if we want to pull out this functionality to another component.

Yeah, you can set whatever properties you want on ArticleSet, so you could add a list or tensor with boolean flags to indicate whether you want it applied or not and that'll get sent along to subsequent components. I might be missing something but I don't see where in the current code the equivalent selection is happening—it appears to be applying LLM generated text to all recommended articles?

Have moved text_generation out of the __call__ method.

karlhigley · 2024-10-29T16:32:19Z

src/poprox_recommender/components/diversifiers/locality_calibration.py

+    return top_k_indices
+
+
+def text_generation(candidate_articles, article_indices, interest_profile):


It looks like candidate_articles and article_indices are used here to reconstruct the equivalent of an ArticleSet of the selected articles. If this turns into a separate component, then the __call__() method could receive the same information using an ArticleSet and an InterestProfile as inputs.

src/poprox_recommender/components/diversifiers/locality_calibration.py

karlhigley · 2024-10-29T16:37:36Z

src/poprox_recommender/handler.py

@@ -3,8 +3,9 @@

 from poprox_concepts import ArticleSet
 from poprox_concepts.api.recommendations import RecommendationRequest, RecommendationResponse
+from poprox_recommender.components.diversifiers.locality_calibration import user_interest_generate


The handler shouldn't know anything about individual recommenders, so importing code from this recommender-specific component is a clue that the code below is probably in the wrong place. Could these changes live somewhere in the locality_calibration module?

Moved tolocality_calibrationmodule

Just to confirm, could we modify the code in the handler.py in our forked repository?

karlhigley · 2024-10-29T16:51:16Z

The tests are failing due to the openai import, since openai hasn't been added to the dependencies yet

mdekstrand · 2024-10-29T17:38:08Z

I agree with @karlhigley's comments on dependencies — let's keep everything in the main recommenders repo self-contained (no external services).

sophiasun0515

Looks good to me overall! I think we can determine the number of clicked articles (k) after doing some internal testing.

For cold start implementation, I don't see the current need for that (assuming we can exclude inactive users in the pre-experiment observation window);

As for the clicked article sorting, I think it'll be useful to apply it to count for user memory decay (i.e. a clicked article last week would be more salient to users than one from 2 months ago). Maybe we can fuse the chronological factor into embedding similarity computation so there's another layer to consider for candidate article selection other than content embeddings?

Final note -- regarding Karl and Michael's comments, I think maybe it'll be a better idea if we migrate this PR to our forked repo?

src/poprox_recommender/components/diversifiers/locality_calibration.py

sophiasun0515 · 2024-10-29T18:08:05Z

src/poprox_recommender/components/diversifiers/locality_calibration.py

+    similarities = cosine_similarity([target_embedding], candidate_embeddings)[0]
+    top_k_indices = np.argsort(similarities)[-k:][::-1]


Here's what I'm thinking if we can fuse chronological order into candidate selection

add locality calibration and text generation

1f36694

imrecommender requested review from sophiasun0515, karlhigley, mdekstrand and kluver October 28, 2024 22:11

karlhigley reviewed Oct 29, 2024

View reviewed changes

sophiasun0515 reviewed Oct 29, 2024

View reviewed changes

src/poprox_recommender/components/diversifiers/locality_calibration.py Show resolved Hide resolved

sophiasun0515 reviewed Oct 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add locality calibration and text generation #120

add locality calibration and text generation #120

imrecommender commented Oct 28, 2024

karlhigley left a comment

karlhigley Oct 29, 2024

imrecommender Nov 4, 2024

karlhigley Oct 29, 2024

karlhigley Oct 29, 2024

imrecommender Nov 4, 2024

karlhigley Nov 4, 2024

karlhigley Oct 29, 2024

zentavious Nov 1, 2024

karlhigley Nov 1, 2024 •

edited

Loading

imrecommender Nov 4, 2024

karlhigley Oct 29, 2024

karlhigley Oct 29, 2024

imrecommender Nov 4, 2024

imrecommender Nov 4, 2024

karlhigley commented Oct 29, 2024

mdekstrand commented Oct 29, 2024 •

edited

Loading

sophiasun0515 left a comment •

edited

Loading

sophiasun0515 Oct 29, 2024

		return gpt_generate(system_prompt, input_prompt)


		model = SentenceTransformer("all-MiniLM-L6-v2")

		recommended_list = text_generation(candidate_articles.articles, article_indices, interest_profile)
		return ArticleSet(articles=recommended_list)

	recommended_list = text_generation(candidate_articles.articles, article_indices, interest_profile)
	return ArticleSet(articles=recommended_list)
	return ArticleSet(articles=[candidate_articles.articles[idx] for idx in article_indices])

		return top_k_indices


		def text_generation(candidate_articles, article_indices, interest_profile):

		similarities = cosine_similarity([target_embedding], candidate_embeddings)[0]
		top_k_indices = np.argsort(similarities)[-k:][::-1]

add locality calibration and text generation #120

Are you sure you want to change the base?

add locality calibration and text generation #120

Conversation

imrecommender commented Oct 28, 2024

karlhigley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlhigley Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlhigley commented Oct 29, 2024

mdekstrand commented Oct 29, 2024 • edited Loading

sophiasun0515 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karlhigley Nov 1, 2024 •

edited

Loading

mdekstrand commented Oct 29, 2024 •

edited

Loading

sophiasun0515 left a comment •

edited

Loading