Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added code for instance based weighing for rank objectives #3379

Conversation

ngoyal2707
Copy link
Contributor

@ngoyal2707 ngoyal2707 commented Jun 12, 2018

This PR uses sample_weights per list instance for all the ranking objectives. This is somewhat important for some use cases where in production systems we want to give more weights for some sample instance compared to other. (In this case, sample instance is query entire document set).

Testing Done:

  1. Used python wrapper to train on weighted loss and evaluate on weighted loss. Performs much better than just training on unweighted loss.
    usage:
    train_data_xgboost.set_weight(train_pairs_weight)

The weights need to be normalized externally as per current implementation as following:

train_pairs_weight /= train_pairs_weight.sum()
train_pairs_weight *= train_pairs_weight.shape[0]

which is essentially doing:
w_i = w_i*N / (sum (w_i))

If you think I should do the normalization inside the source code, let me know, I can easily change the PR with that.

EDIT:
This PR should resolve both of following issues:
#2460
#2561

@codecov-io
Copy link

codecov-io commented Jun 12, 2018

Codecov Report

Merging #3379 into master will not change coverage.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #3379   +/-   ##
=========================================
  Coverage     44.99%   44.99%           
  Complexity      228      228           
=========================================
  Files           166      166           
  Lines         12787    12787           
  Branches        466      466           
=========================================
  Hits           5754     5754           
  Misses         6841     6841           
  Partials        192      192
Impacted Files Coverage Δ Complexity Δ
src/objective/rank_obj.cc 11.04% <0%> (ø) 0 <0> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34e3edf...e00ddec. Read the comment docs.

@ngoyal2707
Copy link
Contributor Author

Hey @hcho3 , @tqchen : wonder if you guys can take a look at this and give your opinions.

@hcho3 hcho3 self-requested a review June 16, 2018 03:15
Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngoyal2707 Thank you for submitting this PR. I understand that the ranking objective should take account of instance weights.

A few things:

  1. I think you should include weight normalization in the ranking objective code, since the user-defined instance weights are usually not expected to sum to N.
  2. Can you add some tests? You said

Used python wrapper to train on weighted loss and evaluate on weighted loss. Performs much better than just training on unweighted loss.

and I think you are right. Can you craft an example data to demonstrate it? For instance, you can set zero weights for a few instances and see if these instances affect the loss.

Ideally, you should add two tests: 1) one showing that the weighted ranking objective is correct; and 2) another containing an edge case where the old implementation leads to non-sensical results.

Let me know if you need help writing a test. I'd be more than happy to assist.

@ngoyal2707 ngoyal2707 force-pushed the ngoyal_add_sample_instance_based_weighing_to_rank_objective branch from e00ddec to 95c6292 Compare June 19, 2018 17:58
@ngoyal2707
Copy link
Contributor Author

@hcho3 Thanks for looking into the PR, as per your comments I have added changes with:

  1. Put weight normalization inside the rank_obj.cc, so there's no need of externally normalize them.
  2. Added tests showing that putting weights works with "rank:pairwise" objective.

Questions:

  1. Is there any alternative to looping over whole dataset serially to find sum_weights inside GetGradients as it seems like it should be done only once, although I saw that other objectives were following the same https://github.com/dmlc/xgboost/blob/master/src/objective/regression_obj.cc#L219 , so guess it's not too slow?

Side Note:

  1. I think there's a small bug (or weirdness) with the implementation of how rank_obj.cc creates pairs. It can technically create same pair twice as the way loop is written, it never checks for pair already added. I don't think it will make much difference, but do you think it is intentional vs needs to be corrected?

@ngoyal2707 ngoyal2707 force-pushed the ngoyal_add_sample_instance_based_weighing_to_rank_objective branch from 95c6292 to b3ee94f Compare June 22, 2018 00:18
@ngoyal2707
Copy link
Contributor Author

@hcho3 your changes of doing weight normalization looks good to me. Can you please merge them to master now?

@hcho3 hcho3 force-pushed the ngoyal_add_sample_instance_based_weighing_to_rank_objective branch from 032912d to 93a4937 Compare June 22, 2018 21:09
@hcho3
Copy link
Collaborator

hcho3 commented Jun 22, 2018

@ngoyal2707 I ended up removing my latest change. The normalization factor would need to be re-computed whenever the objective function object is re-used for a different dataset, and I could not find a good way to detect the change of dataset. For now, I think we should just compute the normalization factor to be on the safe side.

@ngoyal2707
Copy link
Contributor Author

@hcho3 I see, good point, sorry I didn't catch that.
Can you please merge my original request then? I can look into the if I can move that to constructor as follow up refactor.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 22, 2018

@ngoyal2707 Yes, I will merge it once all tests pass.

@ejalonas
Copy link

One dumb question @ngoyal2707, can you clarify the structure of the train_pairs_weight in your initial example? Are those weights based on a group of results, or on each result within a group (e.g. if I have DMatrix with 5 groups of 10 results a piece for pairwise ranking, when I go to set the weights, is the structure of it an array of 5 floats (weighting per group), and array of 50 entries (weighting each result) or an array of len 5 where each entry is an array of len 10 (weighting each result again).

@ngoyal2707
Copy link
Contributor Author

ngoyal2707 commented Jun 26, 2018

Hey @ejalonas , In my implementation train_pairs_weight is at group level. So for your example of 5 groups of 10 results each, you will need to give 5 weights.

If you want individual instance based weight, you can use rank:ndcg objective with giving magnitude of label as your weights, although ndcg scaling from magnitude of label to weight is not linear.
Currently rank:pairwise objective ignores magnitude of label.

@ejalonas
Copy link

Thanks @ngoyal2707 ! I really appreciate your clarification and contribution to this repo.

CodingCat pushed a commit to CodingCat/xgboost that referenced this pull request Jul 26, 2018
* added code for instance based weighing for rank objectives

* Fix lint
@sunlei198911
Copy link

sunlei198911 commented Aug 28, 2018

@ngoyal2707 which release version will begin to support the group weights?

@lock lock bot locked as resolved and limited conversation to collaborators Nov 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants