Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[💡SUG] Remove Duplicated User and Item Interaction #487

Closed
rowedenny opened this issue Nov 7, 2020 · 4 comments · Fixed by #492
Closed

[💡SUG] Remove Duplicated User and Item Interaction #487

rowedenny opened this issue Nov 7, 2020 · 4 comments · Fixed by #492
Assignees
Labels
enhancement New feature or request

Comments

@rowedenny
Copy link
Contributor

rowedenny commented Nov 7, 2020

Is your feature request related to a problem? Please describe.
In some cases, we would like to merge the duplicated user-item interactions by keeping the earliest one (referring to the timestamp). The rationality here is to test the performance of a method in recommending novel items that a user did not consume before.

Describe the solution you'd like

  1. It is assumed to be an optional function, since it may not be universally true for the general recommendation system;
  2. Looking through the implementation of Dataset, I think to add a new function within data_processing is one alternative solution. Here is one example,
def _remove_duplication(self):
    self.inter_feat = self.inter_feat.sort_values(by=[self.time_field], ascending=True)
    self.inter_feat = self.inter_feat.drop_duplicates(subset=[self.uid_field, self.iid_field], keep='first')

Describe alternatives you've considered
None

Additional context
None

@rowedenny rowedenny added the enhancement New feature or request label Nov 7, 2020
@hyp1231
Copy link
Member

hyp1231 commented Nov 7, 2020

Thanks for your suggestion!

If you use the 27 benchmarks we have collected, we have put the deduplication in the process of generating atomic files, with the optional argument --duplicate_removal. Detailed, we merge repeated interaction records, and record the latest timestamp and the number of repetitions. Besides, you can also download the processed atomic files in different processing strategy (not_merged and merged).

If you want to use your own dataset, you can also generate atomic files in this way.

We will also add this feature in future versions.

@hyp1231 hyp1231 pinned this issue Nov 7, 2020
@rowedenny
Copy link
Contributor Author

First of all, thanks for the quick response. That is exactly what I am looking for.

May I suggest that record optionally like pandas.drop_diplicates, either keep the first or the last?
Referring to:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

@hyp1231
Copy link
Member

hyp1231 commented Nov 7, 2020

No problem for the new feature of _data_processing, and that's also we'd like to see. 😃

@hyp1231
Copy link
Member

hyp1231 commented Nov 11, 2020

We have made a improvement (refer to #492 ), which removes duplicated interactions if arg rm_dup_inter is not None.

Note that If arg TIME_FIELD exists, interaction records in inter_feat will be sorted by values of inter_feat[TIME_FIELD]. Otherwise it will remain unchanged. After that, if rm_dup_inter == first, we will keep the first user-item interaction in duplicates; if rm_dup_inter == last, we will keep the last one.

Thanks for the insightful suggestion. I hope this can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants