-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
News Locality Calibration #103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
) | ||
return ArticleSet(articles=[candidate_articles.articles[int(idx)] for idx in article_indices]) | ||
|
||
def calibration(self, relevance_scores, articles, preferences, theta, topk) -> list[Article]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method looks pretty much the same as the corresponding method on the TopicCalibrator
. Wonder if it could move into the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's certainly possible, I was bit hesitant before since the calibration
function will call overridden add_article_to_categories()
function, but maybe that's OK as well. I'll try factor calibration()
and normalized_categories_with_candidate()
into the parent class and see how it goes!
@@ -32,7 +36,7 @@ def select_articles( | |||
pipeline state whose ``default`` is the final list of recommendations. | |||
""" | |||
available_pipelines = recommendation_pipelines(device=default_device()) | |||
pipeline = available_pipelines["nrms"] | |||
pipeline = available_pipelines["cali"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the default when no recommender is specified in the request from plain NRMS to NRMS+Locality calibration. That's probably okay if you plan to run this experiment from a separate endpoint, but maybe not something we want to do in the main/default recommender endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a good point. Are we expected to fork the repo and run the fork?
Is POPROX planning to have a release cycle so we can uptake changes to the main POPROX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current plan is to have people fork the repo and deploy from the fork. No current plans for a defined release cycle since it hasn't become an issue yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly! So based on the conversation looks like it's OK to keep the pipeline as cali
instead of nrms
if we will run only run this with allocated participants?
I'm actually also thinking about the possibility of running some experiment flag to determine which pipeline do we want to choose here:
if interest_profile.experiment_id == LOCALITY_EXPT_ID:
pipeline = available_pipelines["cali"]
else:
pipeline = available_pipelines["nrms"]
But I realized that might not be the way we want to allocate users based on experiment manifest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time the endpoint defined in this repo receives a request, it will already be for the appropriate recommender for each user according to the group assignments. The section that defines the experiment recommenders in the manifest will look something like this:
[recommenders.calibration]
endpoint = "https://n07svs3qeh.execute-api.us-east-1.amazonaws.com/?pipeline=calibration"
The platform side sends requests for users assigned to groups that use that recommender to that URL, and the query parameter at the end of the URL gets used by this code to select the appropriate pipeline.
(It's set up that way so that the recommender code doesn't need to know anything about experiments, groups, phases, etc in order to keep it simple for experimenters.)
|
||
return recommendations | ||
|
||
def normalized_categories_with_candidate(self, rec_categories, article): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to the method on the other class too and might also be able to move to the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only major point we should talk about is calibrating on user's locality preferences vs. a fixed locality distribution.
Otherwise everything else is primarily clarifying questions.
@@ -43,6 +76,19 @@ def user_topic_preference(past_articles: list[Article], click_history: list[Clic | |||
return topic_count_dict | |||
|
|||
|
|||
def user_locality_preference(past_articles: list[Article], click_history: list[Click]) -> dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to calibrate on the user's locality preference? I thought we might want to calibrate on the distribution of today's articles or a fixed distribution from our prior investigation.
Notably, I see an issue at the cold start of our experiment where users might have a very focused click history due to the focus of the baseline POPROX algorithm and their locality preference will be very biases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on what we might want to calibrate on in this function? I was thinking it as just a function to compute the "groud_truth" of user click history with the system, and the actual calibration step happened after we get their preference (in the new LocalityCalibration
class).
The cold-start issue might exist, but if we have an one-month window to observe users and collect their preferences, can that be a feasible solution?
best_candidate_score = adjusted_candidate_score | ||
candidate = article_idx | ||
|
||
if candidate is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only case when candidate would be None if we've already appended the entire list of article_idx to recommendations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking because if a downstream class expects there to always be 10 recommendations this could be an ideal spot to throw an error or at least a warning (not sure how much logging POPROX has or wants to have)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our intent is to make the downstream code robust in the sense that the expected number of recommendations will always be generated if you construct your recommendation pipeline to do so. That doesn't imply that this code needs to always produce 10 recommendations though, because you can (for example) use the Fill
component downstream to fill in however many slots are left with a second source of recommendations (like the UniformSampler
, which chooses at random.)
# R is all candidates (not selected yet) | ||
|
||
recommendations = [] # final recommendation (topk index) | ||
rec_categories = defaultdict(int) # frequency distribution of catregories of S |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand this comment. Maybe its just copied from topic_calibration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the S
vs R
nomenclature comes from the MMR paper, which then got carried into our calibrator implementation. R
is the set of available candidates Remaining; S
is the set of items Selected so far.
@@ -32,7 +36,7 @@ def select_articles( | |||
pipeline state whose ``default`` is the final list of recommendations. | |||
""" | |||
available_pipelines = recommendation_pipelines(device=default_device()) | |||
pipeline = available_pipelines["nrms"] | |||
pipeline = available_pipelines["cali"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a good point. Are we expected to fork the repo and run the fork?
Is POPROX planning to have a release cycle so we can uptake changes to the main POPROX?
) | ||
return ArticleSet(articles=[candidate_articles.articles[int(idx)] for idx in article_indices]) | ||
|
||
def calibration(self, relevance_scores, articles, preferences, theta, topk) -> list[Article]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request: This function could also return a list of flags for articles that have been swapped from the original relevance_scores list for easy manipulation of those articles' slugs in a later stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevance scores is sorted in a sense, but not in order descending by score. The order matches the list of candidate articles, so that the same position in the list of articles and in the list of scores correspond to each other.
In theory, this calibration component could compute the top-k list and do some kind of comparison to generate the kind of flags you're talking about, but it currently generates its own list of recs directly from the scores rather than operating on the top-k list and swapping things around.
If you want to identify which positions of the list are different from the plain top-k list, I'd suggest using this component, the TopkRanker component, and comparing the output of the two in a new GenerateExplanations
component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing us to the TopRanker
component; Indeed there should be a new GenerateExplanations
component where we'll apply the LLM logic. We'll discuss that with the team.
89a7e83
to
2766467
Compare
2766467
to
50e9c76
Compare
I'd give feedback if I had any, but this LGTM. I like how the |
Apply some refactoring by creating a
Calibrator
object that bothTopicCalibration
and the newLocalityCalibration
inherit from. Updated currenttest_calibration
test suites.Some TODOs we need to figure out for next step: