[💡SUG] Remove Duplicated User and Item Interaction #487

rowedenny · 2020-11-07T03:16:11Z

Is your feature request related to a problem? Please describe.
In some cases, we would like to merge the duplicated user-item interactions by keeping the earliest one (referring to the timestamp). The rationality here is to test the performance of a method in recommending novel items that a user did not consume before.

Describe the solution you'd like

It is assumed to be an optional function, since it may not be universally true for the general recommendation system;
Looking through the implementation of Dataset, I think to add a new function within data_processing is one alternative solution. Here is one example,

def _remove_duplication(self):
    self.inter_feat = self.inter_feat.sort_values(by=[self.time_field], ascending=True)
    self.inter_feat = self.inter_feat.drop_duplicates(subset=[self.uid_field, self.iid_field], keep='first')

Describe alternatives you've considered
None

Additional context
None

The text was updated successfully, but these errors were encountered:

hyp1231 · 2020-11-07T03:33:53Z

Thanks for your suggestion!

If you use the 27 benchmarks we have collected, we have put the deduplication in the process of generating atomic files, with the optional argument --duplicate_removal. Detailed, we merge repeated interaction records, and record the latest timestamp and the number of repetitions. Besides, you can also download the processed atomic files in different processing strategy (not_merged and merged).

If you want to use your own dataset, you can also generate atomic files in this way.

We will also add this feature in future versions.

rowedenny · 2020-11-07T05:35:13Z

First of all, thanks for the quick response. That is exactly what I am looking for.

May I suggest that record optionally like pandas.drop_diplicates, either keep the first or the last?
Referring to:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

hyp1231 · 2020-11-07T14:38:00Z

No problem for the new feature of _data_processing, and that's also we'd like to see. 😃

hyp1231 · 2020-11-11T06:03:10Z

We have made a improvement (refer to #492 ), which removes duplicated interactions if arg rm_dup_inter is not None.

Note that If arg TIME_FIELD exists, interaction records in inter_feat will be sorted by values of inter_feat[TIME_FIELD]. Otherwise it will remain unchanged. After that, if rm_dup_inter == first, we will keep the first user-item interaction in duplicates; if rm_dup_inter == last, we will keep the last one.

Thanks for the insightful suggestion. I hope this can help you.

rowedenny added the enhancement New feature or request label Nov 7, 2020

hyp1231 pinned this issue Nov 7, 2020

linzihan-backforward unpinned this issue Nov 10, 2020

hyp1231 assigned chenyushuo Nov 10, 2020

chenyushuo closed this as completed Dec 6, 2020

hyp1231 linked a pull request Dec 7, 2020 that will close this issue

FEA: Add _remove_duplication to Dataset #492

Merged

hyp1231 mentioned this issue Dec 18, 2020

Merged and not Merged RUCAIBox/RecSysDatasets#82

Closed

Sherry-XLL added the dataset label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[💡SUG] Remove Duplicated User and Item Interaction #487

[💡SUG] Remove Duplicated User and Item Interaction #487

rowedenny commented Nov 7, 2020 •

edited

Loading

hyp1231 commented Nov 7, 2020 •

edited

Loading

rowedenny commented Nov 7, 2020

hyp1231 commented Nov 7, 2020

hyp1231 commented Nov 11, 2020

[💡SUG] Remove Duplicated User and Item Interaction #487

[💡SUG] Remove Duplicated User and Item Interaction #487

Comments

rowedenny commented Nov 7, 2020 • edited Loading

hyp1231 commented Nov 7, 2020 • edited Loading

rowedenny commented Nov 7, 2020

hyp1231 commented Nov 7, 2020

hyp1231 commented Nov 11, 2020

rowedenny commented Nov 7, 2020 •

edited

Loading

hyp1231 commented Nov 7, 2020 •

edited

Loading