-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INTERNAL Error: Invalid unicode detected in segment statistics update! #664
Comments
related to #553 |
@Th368MoJ i wonder if this could be to do with the pandas null issue. I think it's an outside chance, but may be worth Mariana trying first because it's simple. @MarianaBazely Could you try ensuring all null values are Sample code to clear up zero length strings and any instances of numpy/pandas nulls:
|
When I changed my blocking rules I was able to generate the image... |
Did you manage to identify which blocking rule was causing the problem? Would be useful to know as it may help us track down this problem Did clearing up the nulls make any difference? |
All nulls have been replaced to np.nan. I can remove all rows with nulls instead. These were the blocking rules that "caused" (I'm not sure if they are guilty) the unicode error:
Anything else, for example this huge one below is fine:
|
Weird! Please could you try with the broken blocking rules with the with the code I posted above that removes np.nan and pd.na with pure python None? |
I am experiencing the same error
I am working on Windows 10 with an input dataset of 2 million rows, 23 columns. I tried changing blanks/na/nan to None with @RobinL's suggested code but it did not help:
I also tried discarding all non-ASCII characters which also did not help:
These are the blocking rules
With these blocking rules the error goes away
I also tried playing around with the size of the input data set (under the apparently incorrect assumption that it was bad character encoding causing the error). If I narrow the input to the first 14,000 rows the error still occurs. If I narrow it down to the first 13,000 rows the error does not occur. |
hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value, Looking forward to your reply |
Thanks for the info. Are you on windows? Is the source of the data Microsoft SQL server? One work around is to use the Spark linker, which doesn't suffer from this problem |
I am on Windows. Original source of the data is MSSQL, but I am just reading it from a CSV |
Thanks - that seems to be the pattern - every time we've seen this error the data is from mssql server on windows. We'll continue to look into it. Its been challenging because I haven't yet been able to come up with a small/simple reproducible example |
yes, i'm on win10. using spark it worked |
@leewilson-kmd I don't suppose it would be possible to share the data that causes the error? (The CSV file). If so, would be best to share direct with mark@duckdblabs.com (see duckdb/duckdb#1650) |
It looks like this finally may be solved on the latest master of duckdb: Would be useful if people who have seen this issue would check whether it's gone away on the latest duckdb, so we can bump our dependencies and close this issue |
Closing this as no-one has reported it for a long time and duckdb think it's fixed |
Hi,
this is the error I get when I run
clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)
:Some extra info:
Splink version 3.0.1
Name Version Build Channel
anaconda 2022.05 py39_0
anaconda-client 1.9.0 py39haa95532_0
anaconda-navigator 2.1.4 py39haa95532_0
anaconda-project 0.10.2 pyhd3eb1b0_0
Windows specifications
Edition Windows 10 Enterprise
Version 20H2
Thank you very much
The text was updated successfully, but these errors were encountered: