Get an error message while running match_string #63

iibarant · 2021-08-17T18:29:20Z

Hi there,

I would like to run match_strings on addresses on df with 88510 rows and 3 columns.
All I get is
OverflowError: value too large to convert to int.

Is there a quick fix?

Thank you very much!

ParticularMiner · 2021-08-17T18:35:15Z

Hi @iibarant

please provide more information. 88510 is small enough to handle by my computer (which can process >500,000 records).

iibarant · 2021-08-17T18:45:31Z

Here's the code:

from string_grouper import match_strings
matches = match_strings(check2['full address'])

I run the code on MacBook Pro 2.4 GHz 8-Core Intel Core i9 32 GB 2667 MHz DDR4
The dataframe contains 3 columns name, phone, full address and I need to run the match on the address only.

Thank you!

ParticularMiner · 2021-08-17T18:53:35Z

Thanks @iibarant

Curious! This is an unexpected error. Can you please provide the traceback log (just copy and paste whatever python spits out) of the error so that I can determine where exactly the problem is stemming from in the code.

iibarant · 2021-08-17T19:22:50Z

There you go ...

matches = match_strings(check2['full address'])
Traceback (most recent call last):

File "", line 1, in
matches = match_strings(check2['full address'])

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 131, in match_strings
string_grouper = StringGrouper(master,

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 264, in fit
matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 467, in _build_matches
return awesome_cossim_topn(

File "/opt/anaconda3/lib/python3.8/site-packages/sparse_dot_topn/awesome_cossim_topn.py", line 119, in awesome_cossim_topn
alt_indices, alt_data = ct_thread.sparse_dot_topn_extd_threaded(

File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 133, in sparse_dot_topn.sparse_dot_topn_threaded.__pyx_fuse_0sparse_dot_topn_extd_threaded

File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 168, in sparse_dot_topn.sparse_dot_topn_threaded.sparse_dot_topn_extd_threaded

OverflowError: value too large to convert to int

ParticularMiner · 2021-08-17T19:45:35Z

@iibarant

Looks like the error stems from ‘sparse_dot_topn’, a package dependency.
And I’m not sure if it’s platform-dependent (the package, as far as I know, has only been tested on Linux and Microsoft Windows OS’s) or something else.

Could you try the following command just to see what happens:

matches = match_strings(check2['full address'], max_n_matches=20)

(This limits the output a bit.)

iibarant · 2021-08-17T19:58:17Z

Yes, that works. Thank you. Should I check whether the code works with greater max_n_matches ? I'm planning to keep similarity score > 0.9. Would it be possible to apply on the call?

ParticularMiner · 2021-08-17T20:11:59Z

@iibarant

Ok good. Yes, I suggest you try successively larger values of max_n_matches until the output size no longer changes. Yes of course, you can also use min_similarity at the same time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get an error message while running match_string #63

Get an error message while running match_string #63

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021

Get an error message while running match_string #63

Get an error message while running match_string #63

Comments

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021

iibarant commented Aug 17, 2021

ParticularMiner commented Aug 17, 2021