How do I perform deduplication with the python record linkage toolkit with large data sets? #207

sidhugithub1 · 2024-08-01T05:12:46Z

I am doing Dedup in a single dataset of 1M size in the machine (M5.4xlarge 16 core and 64 GB RAM). I have done the following matching config, but it is running out of memory.

Indexing sortedneighbourhood for AddressTypeDescription with window=3
Indexing block for ['Designation', 'Department', 'City', 'Gender', 'Country', 'Region']

Error running out of memory

Unable to allocate 165. GiB for an array with shape (22179322464,) and data type int64
Unable to allocate 14.2 GiB for an array with shape (1906374956,) and data type int64
Unable to allocate 23.0 GiB for an array with shape (1, 3092850189) and data type object
Unable to allocate 23.0 GiB for an array with shape (3092850193, 1) and data type object

Basically it is getting stuck/stop the process at indexing step for large dataset.

Could you please suggest how to overcome this scenario?

Regards
Sid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I perform deduplication with the python record linkage toolkit with large data sets? #207

How do I perform deduplication with the python record linkage toolkit with large data sets? #207

sidhugithub1 commented Aug 1, 2024

How do I perform deduplication with the python record linkage toolkit with large data sets? #207

How do I perform deduplication with the python record linkage toolkit with large data sets? #207

Comments

sidhugithub1 commented Aug 1, 2024

Error running out of memory