Skip to content

How do I perform deduplication with the python record linkage toolkit with large data sets? #207

@sidhugithub1

Description

@sidhugithub1

I am doing Dedup in a single dataset of 1M size in the machine (M5.4xlarge 16 core and 64 GB RAM). I have done the following matching config, but it is running out of memory.

  1. Indexing sortedneighbourhood for AddressTypeDescription with window=3
  2. Indexing block for ['Designation', 'Department', 'City', 'Gender', 'Country', 'Region']

Error running out of memory

Unable to allocate 165. GiB for an array with shape (22179322464,) and data type int64
Unable to allocate 14.2 GiB for an array with shape (1906374956,) and data type int64
Unable to allocate 23.0 GiB for an array with shape (1, 3092850189) and data type object
Unable to allocate 23.0 GiB for an array with shape (3092850193, 1) and data type object

Basically it is getting stuck/stop the process at indexing step for large dataset.

Could you please suggest how to overcome this scenario?

Regards
Sid

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions