Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get an error message while running match_string #63

Open
iibarant opened this issue Aug 17, 2021 · 7 comments
Open

Get an error message while running match_string #63

iibarant opened this issue Aug 17, 2021 · 7 comments

Comments

@iibarant
Copy link

Hi there,

I would like to run match_strings on addresses on df with 88510 rows and 3 columns.
All I get is
OverflowError: value too large to convert to int.

Is there a quick fix?

Thank you very much!

@ParticularMiner
Copy link
Contributor

Hi @iibarant

please provide more information. 88510 is small enough to handle by my computer (which can process >500,000 records).

@iibarant
Copy link
Author

Here's the code:

from string_grouper import match_strings
matches = match_strings(check2['full address'])

I run the code on MacBook Pro 2.4 GHz 8-Core Intel Core i9 32 GB 2667 MHz DDR4
The dataframe contains 3 columns name, phone, full address and I need to run the match on the address only.

image

Thank you!

@ParticularMiner
Copy link
Contributor

Thanks @iibarant

Curious! This is an unexpected error. Can you please provide the traceback log (just copy and paste whatever python spits out) of the error so that I can determine where exactly the problem is stemming from in the code.

@iibarant
Copy link
Author

There you go ...

matches = match_strings(check2['full address'])
Traceback (most recent call last):

File "", line 1, in
matches = match_strings(check2['full address'])

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 131, in match_strings
string_grouper = StringGrouper(master,

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 264, in fit
matches, self._true_max_n_matches = self._build_matches(master_matrix, duplicate_matrix)

File "/opt/anaconda3/lib/python3.8/site-packages/string_grouper/string_grouper.py", line 467, in _build_matches
return awesome_cossim_topn(

File "/opt/anaconda3/lib/python3.8/site-packages/sparse_dot_topn/awesome_cossim_topn.py", line 119, in awesome_cossim_topn
alt_indices, alt_data = ct_thread.sparse_dot_topn_extd_threaded(

File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 133, in sparse_dot_topn.sparse_dot_topn_threaded.__pyx_fuse_0sparse_dot_topn_extd_threaded

File "sparse_dot_topn/sparse_dot_topn_threaded.pyx", line 168, in sparse_dot_topn.sparse_dot_topn_threaded.sparse_dot_topn_extd_threaded

OverflowError: value too large to convert to int

@ParticularMiner
Copy link
Contributor

@iibarant

Looks like the error stems from ‘sparse_dot_topn’, a package dependency.
And I’m not sure if it’s platform-dependent (the package, as far as I know, has only been tested on Linux and Microsoft Windows OS’s) or something else.

Could you try the following command just to see what happens:

matches = match_strings(check2['full address'], max_n_matches=20)

(This limits the output a bit.)

@iibarant
Copy link
Author

Yes, that works. Thank you. Should I check whether the code works with greater max_n_matches ? I'm planning to keep similarity score > 0.9. Would it be possible to apply on the call?

@ParticularMiner
Copy link
Contributor

@iibarant

Ok good. Yes, I suggest you try successively larger values of max_n_matches until the output size no longer changes. Yes of course, you can also use min_similarity at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants