SUSTech_DBGroup (Southern University of Science and Technology)
Team member: Weibao FU, Peiqi YIN, Lan LU
Supervisor: Prof. Bo TANG, Prof. Xiao YAN
Task:
Given records extracted from various website, we aim to solve the entity resolution problem, namely whether two records points to the same real world object.
Input:
Three datasets (records are only compared with other records in the same dataset)
Output:
All pair of records pointing to the same real world objects
Measurement:
F-score
- Install python 3 if you haven't get one.
- Download and unzip this project.
- Execute
run.py
bypython run.py
and please make sure that the csv filesX2.csv
,X3.csv
andX4.csv
for original datasets are inside the project folder. - All matched pairs will be placed in
output.csv
.
The architecture of our work can be divided into two parts, data cleaning and entity matching.
In this part, we reorganize and clean the given csv files row by row. Cleaned results with only key field values for each row are returned. Codes for X2.csv
, X3.csv
and X4.csv
are provided in clean_x2.py
, clean_x3.py
, and clean_x4.py
respectively. The detailed steps are described as follows.
Preprocessing: Columns in the same row are merged together and changed into lowercase.
Attributes Extraction: Key information such as brand are extracted from each row through their corresponding regular rules. The regular rules are designed based on the given datasets and other descriptions from e-commerce websites or the official website of the brand. Note that'0'
represents a missing value in this key field.
Attributes Correction: Different expressions of a key field with the same meaning are translated to the same expression. What's more, we tried our best to fill the missing values according to other existing field values of the same row.
Return: Key field values are organized into a dataframe and returned.
In this part, we give each record an identification according to its key field values and do entity resolution for records according to their identification values. Codes for X2.csv
, X3.csv
and X4.csv
are provided in handler_x2.py
, handler_x3.py
, and handler_x4.py
respectively. Note that clean_x2.py
is called by handler_x2.py
and so forth. The detailed steps are given as follows.
Complete Matching: Original csv file is turned into cleaned dataframe and significant fields are picked out. If there is no missing value among these fields of a record, they are used as the unique identification for this record and we add this record to list solved_spec
. Otherwise, we do not give the record an identification and add it to list unsolved_spec
. Note that we can classify all the records into several groups and use different fields as the identification for records in different groups.
Recycle Mechanism: For records in unsolved_spec
, we try to match them to items in solved_spec
. Since at least one important filed value is missing for records in unsolved_spec
, we use their secondary key values. We give several combinations of secondary key values based on observation of real data. The identification of items in solved_spec
is assigned to identification of items in unsolved_spec
while we move items in unsolved_spec
to solved_spec
if they obtain the same nonzero secondary key values under one combination.
Residual Matching: For records still in unsolved_spec
, we also give them a general identification value which as not as tight and correct as the previous ones. This helps us make a rough classification for items still without identification values after step 1 and 2.
Return: We regard items with the same identification values as the same entities in real world and match them. The matched items are saved as our output in output.csv
.
Dataset | Recall | Precision | F-score |
---|---|---|---|
X2.csv | 971 | 995 | 983 |
X3.csv | 991 | 980 | 986 |
X4.csv | 980 | 880 | 927 |