added code for instance based weighing for rank objectives #3379

ngoyal2707 · 2018-06-12T01:23:24Z

This PR uses sample_weights per list instance for all the ranking objectives. This is somewhat important for some use cases where in production systems we want to give more weights for some sample instance compared to other. (In this case, sample instance is query entire document set).

Testing Done:

Used python wrapper to train on weighted loss and evaluate on weighted loss. Performs much better than just training on unweighted loss.
usage:
train_data_xgboost.set_weight(train_pairs_weight)

The weights need to be normalized externally as per current implementation as following:

train_pairs_weight /= train_pairs_weight.sum()
train_pairs_weight *= train_pairs_weight.shape[0]

which is essentially doing:
w_i = w_i*N / (sum (w_i))

If you think I should do the normalization inside the source code, let me know, I can easily change the PR with that.

EDIT:
This PR should resolve both of following issues:
#2460
#2561

codecov-io · 2018-06-12T02:21:36Z

Codecov Report

Merging #3379 into master will not change coverage.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master    #3379   +/-   ##
=========================================
  Coverage     44.99%   44.99%           
  Complexity      228      228           
=========================================
  Files           166      166           
  Lines         12787    12787           
  Branches        466      466           
=========================================
  Hits           5754     5754           
  Misses         6841     6841           
  Partials        192      192

Impacted Files	Coverage Δ	Complexity Δ
src/objective/rank_obj.cc	`11.04% <0%> (ø)`	`0 <0> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34e3edf...e00ddec. Read the comment docs.

ngoyal2707 · 2018-06-13T23:39:59Z

Hey @hcho3 , @tqchen : wonder if you guys can take a look at this and give your opinions.

hcho3

@ngoyal2707 Thank you for submitting this PR. I understand that the ranking objective should take account of instance weights.

A few things:

I think you should include weight normalization in the ranking objective code, since the user-defined instance weights are usually not expected to sum to N.
Can you add some tests? You said

Used python wrapper to train on weighted loss and evaluate on weighted loss. Performs much better than just training on unweighted loss.

and I think you are right. Can you craft an example data to demonstrate it? For instance, you can set zero weights for a few instances and see if these instances affect the loss.

Ideally, you should add two tests: 1) one showing that the weighted ranking objective is correct; and 2) another containing an edge case where the old implementation leads to non-sensical results.

Let me know if you need help writing a test. I'd be more than happy to assist.

ngoyal2707 · 2018-06-19T18:08:00Z

@hcho3 Thanks for looking into the PR, as per your comments I have added changes with:

Put weight normalization inside the rank_obj.cc, so there's no need of externally normalize them.
Added tests showing that putting weights works with "rank:pairwise" objective.

Questions:

Is there any alternative to looping over whole dataset serially to find sum_weights inside GetGradients as it seems like it should be done only once, although I saw that other objectives were following the same https://github.com/dmlc/xgboost/blob/master/src/objective/regression_obj.cc#L219 , so guess it's not too slow?

Side Note:

I think there's a small bug (or weirdness) with the implementation of how rank_obj.cc creates pairs. It can technically create same pair twice as the way loop is written, it never checks for pair already added. I don't think it will make much difference, but do you think it is intentional vs needs to be corrected?

ngoyal2707 · 2018-06-22T20:58:40Z

@hcho3 your changes of doing weight normalization looks good to me. Can you please merge them to master now?

hcho3 · 2018-06-22T21:13:52Z

@ngoyal2707 I ended up removing my latest change. The normalization factor would need to be re-computed whenever the objective function object is re-used for a different dataset, and I could not find a good way to detect the change of dataset. For now, I think we should just compute the normalization factor to be on the safe side.

ngoyal2707 · 2018-06-22T21:17:54Z

@hcho3 I see, good point, sorry I didn't catch that.
Can you please merge my original request then? I can look into the if I can move that to constructor as follow up refactor.

hcho3 · 2018-06-22T21:19:14Z

@ngoyal2707 Yes, I will merge it once all tests pass.

ejalonas · 2018-06-26T15:38:47Z

One dumb question @ngoyal2707, can you clarify the structure of the train_pairs_weight in your initial example? Are those weights based on a group of results, or on each result within a group (e.g. if I have DMatrix with 5 groups of 10 results a piece for pairwise ranking, when I go to set the weights, is the structure of it an array of 5 floats (weighting per group), and array of 50 entries (weighting each result) or an array of len 5 where each entry is an array of len 10 (weighting each result again).

ngoyal2707 · 2018-06-26T19:01:19Z

Hey @ejalonas , In my implementation train_pairs_weight is at group level. So for your example of 5 groups of 10 results each, you will need to give 5 weights.

If you want individual instance based weight, you can use rank:ndcg objective with giving magnitude of label as your weights, although ndcg scaling from magnitude of label to weight is not linear.
Currently rank:pairwise objective ignores magnitude of label.

ejalonas · 2018-06-26T19:04:19Z

Thanks @ngoyal2707 ! I really appreciate your clarification and contribution to this repo.

* added code for instance based weighing for rank objectives * Fix lint

sunlei198911 · 2018-08-28T10:30:11Z

@ngoyal2707 which release version will begin to support the group weights?

hcho3 self-requested a review June 16, 2018 03:15

hcho3 requested changes Jun 16, 2018

View reviewed changes

ngoyal2707 force-pushed the ngoyal_add_sample_instance_based_weighing_to_rank_objective branch from e00ddec to 95c6292 Compare June 19, 2018 17:58

ngoyal2707 force-pushed the ngoyal_add_sample_instance_based_weighing_to_rank_objective branch from 95c6292 to b3ee94f Compare June 22, 2018 00:18

Naman Goyal and others added 2 commits June 22, 2018 21:09

added code for instance based weighing for rank objectives

2cb1779

Fix lint

93a4937

hcho3 force-pushed the ngoyal_add_sample_instance_based_weighing_to_rank_objective branch from 032912d to 93a4937 Compare June 22, 2018 21:09

hcho3 approved these changes Jun 22, 2018

View reviewed changes

hcho3 merged commit 5cd851c into dmlc:master Jun 22, 2018

This was referenced Jun 25, 2018

Rank:pairwise instance weight not supported? #2460

Closed

Weighting instances in ranking model using xgboost #2561

Closed

ngoyal2707 mentioned this pull request Jul 6, 2018

[jvm_packages] weight_col for spark is not working as expected for ranking tasks #3454

Closed

CodingCat pushed a commit to CodingCat/xgboost that referenced this pull request Jul 26, 2018

added code for instance based weighing for rank objectives (dmlc#3379)

308e2c2

* added code for instance based weighing for rank objectives * Fix lint

lock bot locked as resolved and limited conversation to collaborators Nov 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added code for instance based weighing for rank objectives #3379

added code for instance based weighing for rank objectives #3379

ngoyal2707 commented Jun 12, 2018 •

edited

Loading

codecov-io commented Jun 12, 2018 •

edited

Loading

ngoyal2707 commented Jun 13, 2018

hcho3 left a comment •

edited

Loading

ngoyal2707 commented Jun 19, 2018

ngoyal2707 commented Jun 22, 2018

hcho3 commented Jun 22, 2018

ngoyal2707 commented Jun 22, 2018

hcho3 commented Jun 22, 2018

ejalonas commented Jun 26, 2018

ngoyal2707 commented Jun 26, 2018 •

edited

Loading

ejalonas commented Jun 26, 2018

sunlei198911 commented Aug 28, 2018 •

edited

Loading

added code for instance based weighing for rank objectives #3379

added code for instance based weighing for rank objectives #3379

Conversation

ngoyal2707 commented Jun 12, 2018 • edited Loading

codecov-io commented Jun 12, 2018 • edited Loading

Codecov Report

ngoyal2707 commented Jun 13, 2018

hcho3 left a comment • edited Loading

Choose a reason for hiding this comment

ngoyal2707 commented Jun 19, 2018

ngoyal2707 commented Jun 22, 2018

hcho3 commented Jun 22, 2018

ngoyal2707 commented Jun 22, 2018

hcho3 commented Jun 22, 2018

ejalonas commented Jun 26, 2018

ngoyal2707 commented Jun 26, 2018 • edited Loading

ejalonas commented Jun 26, 2018

sunlei198911 commented Aug 28, 2018 • edited Loading

ngoyal2707 commented Jun 12, 2018 •

edited

Loading

codecov-io commented Jun 12, 2018 •

edited

Loading

hcho3 left a comment •

edited

Loading

ngoyal2707 commented Jun 26, 2018 •

edited

Loading

sunlei198911 commented Aug 28, 2018 •

edited

Loading