Clarify and Identify Duplicates By ID #4

theryankelly · 2022-02-25T17:11:05Z

Prior reviews of data that came out of the aggregator identified that if a listing appeared in multiple weeks' scrapes, it would appear multiple times in the data. This is to be expected. As there is no unique posting id captured we need to first clarify our current process is for removing these records.

One key place where this deduplication happens is in the cleaner script after adjusting titles runs this script:

 #Remove duplicate titles
  listing <- listing[!(listing$title == 'None'), ]
  listing$uniqueid <- paste(listing$ask,listing$bedrooms,listing$title,listing$latitude,listing$longitude)
  listing <- subset(listing[!duplicated( listing$uniqueid), ])

We need to confirm that records with all matching fields besides created date (the date we capture, not the posting_date) are being accurately removed before any further cleaning or analysis is completed.

After this review process, we will move to implement appropriate steps to remove duplicate captures of the same postings.

The text was updated successfully, but these errors were encountered:

theryankelly added the data_cleaning label Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify and Identify Duplicates By ID #4

Clarify and Identify Duplicates By ID #4

theryankelly commented Feb 25, 2022 •

edited

Loading

Clarify and Identify Duplicates By ID #4

Clarify and Identify Duplicates By ID #4

Comments

theryankelly commented Feb 25, 2022 • edited Loading

theryankelly commented Feb 25, 2022 •

edited

Loading