Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

News Locality Calibration #103

Merged
merged 5 commits into from
Oct 2, 2024
Merged

Conversation

sophiasun0515
Copy link
Contributor

@sophiasun0515 sophiasun0515 commented Sep 23, 2024

Apply some refactoring by creating a Calibrator object that both TopicCalibration and the new LocalityCalibration inherit from. Updated current test_calibration test suites.

Some TODOs we need to figure out for next step:

  • Add test for current logic
  • Tune the parameter of calibration (The current params are borrowed from topic calibration)
  • Run the code on an S3 instance and call necessary endpoints from POPROX
  • Figure out how to connect the logic to participant selection (who will be included in this experiment?)
  • Integrate the LLM context generation into calibrated articles

Copy link
Collaborator

@karlhigley karlhigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

)
return ArticleSet(articles=[candidate_articles.articles[int(idx)] for idx in article_indices])

def calibration(self, relevance_scores, articles, preferences, theta, topk) -> list[Article]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks pretty much the same as the corresponding method on the TopicCalibrator. Wonder if it could move into the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's certainly possible, I was bit hesitant before since the calibration function will call overridden add_article_to_categories() function, but maybe that's OK as well. I'll try factor calibration() and normalized_categories_with_candidate() into the parent class and see how it goes!

@@ -32,7 +36,7 @@ def select_articles(
pipeline state whose ``default`` is the final list of recommendations.
"""
available_pipelines = recommendation_pipelines(device=default_device())
pipeline = available_pipelines["nrms"]
pipeline = available_pipelines["cali"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change the default when no recommender is specified in the request from plain NRMS to NRMS+Locality calibration. That's probably okay if you plan to run this experiment from a separate endpoint, but maybe not something we want to do in the main/default recommender endpoint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a good point. Are we expected to fork the repo and run the fork?

Is POPROX planning to have a release cycle so we can uptake changes to the main POPROX?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current plan is to have people fork the repo and deploy from the fork. No current plans for a defined release cycle since it hasn't become an issue yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly! So based on the conversation looks like it's OK to keep the pipeline as cali instead of nrms if we will run only run this with allocated participants?

I'm actually also thinking about the possibility of running some experiment flag to determine which pipeline do we want to choose here:

if interest_profile.experiment_id == LOCALITY_EXPT_ID:
    pipeline = available_pipelines["cali"]
else:
    pipeline = available_pipelines["nrms"]

But I realized that might not be the way we want to allocate users based on experiment manifest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time the endpoint defined in this repo receives a request, it will already be for the appropriate recommender for each user according to the group assignments. The section that defines the experiment recommenders in the manifest will look something like this:

[recommenders.calibration]
endpoint = "https://n07svs3qeh.execute-api.us-east-1.amazonaws.com/?pipeline=calibration"

The platform side sends requests for users assigned to groups that use that recommender to that URL, and the query parameter at the end of the URL gets used by this code to select the appropriate pipeline.

(It's set up that way so that the recommender code doesn't need to know anything about experiments, groups, phases, etc in order to keep it simple for experimenters.)


return recommendations

def normalized_categories_with_candidate(self, rec_categories, article):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to the method on the other class too and might also be able to move to the base class

Copy link

@zentavious zentavious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only major point we should talk about is calibrating on user's locality preferences vs. a fixed locality distribution.

Otherwise everything else is primarily clarifying questions.

@@ -43,6 +76,19 @@ def user_topic_preference(past_articles: list[Article], click_history: list[Clic
return topic_count_dict


def user_locality_preference(past_articles: list[Article], click_history: list[Click]) -> dict[str, int]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to calibrate on the user's locality preference? I thought we might want to calibrate on the distribution of today's articles or a fixed distribution from our prior investigation.

Notably, I see an issue at the cold start of our experiment where users might have a very focused click history due to the focus of the baseline POPROX algorithm and their locality preference will be very biases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what we might want to calibrate on in this function? I was thinking it as just a function to compute the "groud_truth" of user click history with the system, and the actual calibration step happened after we get their preference (in the new LocalityCalibration class).

The cold-start issue might exist, but if we have an one-month window to observe users and collect their preferences, can that be a feasible solution?

best_candidate_score = adjusted_candidate_score
candidate = article_idx

if candidate is not None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only case when candidate would be None if we've already appended the entire list of article_idx to recommendations?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking because if a downstream class expects there to always be 10 recommendations this could be an ideal spot to throw an error or at least a warning (not sure how much logging POPROX has or wants to have)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our intent is to make the downstream code robust in the sense that the expected number of recommendations will always be generated if you construct your recommendation pipeline to do so. That doesn't imply that this code needs to always produce 10 recommendations though, because you can (for example) use the Fill component downstream to fill in however many slots are left with a second source of recommendations (like the UniformSampler, which chooses at random.)

# R is all candidates (not selected yet)

recommendations = [] # final recommendation (topk index)
rec_categories = defaultdict(int) # frequency distribution of catregories of S

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand this comment. Maybe its just copied from topic_calibration?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the S vs R nomenclature comes from the MMR paper, which then got carried into our calibrator implementation. R is the set of available candidates Remaining; S is the set of items Selected so far.

@@ -32,7 +36,7 @@ def select_articles(
pipeline state whose ``default`` is the final list of recommendations.
"""
available_pipelines = recommendation_pipelines(device=default_device())
pipeline = available_pipelines["nrms"]
pipeline = available_pipelines["cali"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a good point. Are we expected to fork the repo and run the fork?

Is POPROX planning to have a release cycle so we can uptake changes to the main POPROX?

)
return ArticleSet(articles=[candidate_articles.articles[int(idx)] for idx in article_indices])

def calibration(self, relevance_scores, articles, preferences, theta, topk) -> list[Article]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request: This function could also return a list of flags for articles that have been swapped from the original relevance_scores list for easy manipulation of those articles' slugs in a later stage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevance scores is sorted in a sense, but not in order descending by score. The order matches the list of candidate articles, so that the same position in the list of articles and in the list of scores correspond to each other.

In theory, this calibration component could compute the top-k list and do some kind of comparison to generate the kind of flags you're talking about, but it currently generates its own list of recs directly from the scores rather than operating on the top-k list and swapping things around.

If you want to identify which positions of the list are different from the plain top-k list, I'd suggest using this component, the TopkRanker component, and comparing the output of the two in a new GenerateExplanations component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing us to the TopRanker component; Indeed there should be a new GenerateExplanations component where we'll apply the LLM logic. We'll discuss that with the team.

@sophiasun0515 sophiasun0515 force-pushed the sophia/expt-news-locality-bias branch from 2766467 to 50e9c76 Compare October 1, 2024 21:01
@sophiasun0515 sophiasun0515 marked this pull request as ready for review October 1, 2024 21:02
@karlhigley
Copy link
Collaborator

I'd give feedback if I had any, but this LGTM. I like how the Calibrator turned out! 👍🏻

@sophiasun0515 sophiasun0515 merged commit 2fd4b13 into main Oct 2, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants