Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTERNAL Error: Invalid unicode detected in segment statistics update! #664

Closed
MarianaBazely opened this issue Jul 21, 2022 · 15 comments
Closed

Comments

@MarianaBazely
Copy link

MarianaBazely commented Jul 21, 2022

Hi,

this is the error I get when I run clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95):


`---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [51], in <cell line: 1>()
----> 1 clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:1211, in Linker.cluster_pairwise_predictions_at_threshold(self, df_predict, threshold_match_probability)
   1203 self._initialise_df_concat_with_tf(df_predict)
   1205 edges_table = _cc_create_unique_id_cols(
   1206     self,
   1207     df_predict,
   1208     threshold_match_probability,
   1209 )
-> 1211 cc = solve_connected_components(self, edges_table)
   1213 return cc

File ~\AppData\Roaming\Python\Python310\site-packages\splink\connected_components.py:411, in solve_connected_components(linker, edges_table, _generated_graph)
    409 sql = _cc_update_representatives_first_iter()
    410 # Execute if we have no batching, otherwise add it to our batched process
--> 411 representatives = linker._enqueue_and_execute_sql_pipeline(
    412     sql, "__splink__df_representatives"
    413 )
    414 prev_representatives_table = representatives
    416 # Loop while our representative table still has unsettled nodes

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:364, in Linker._enqueue_and_execute_sql_pipeline(self, sql, output_table_name, materialise_as_hash, use_cache, transpile)
    361 """Wrapper method to enqueue and execute a sql pipeline in a single call."""
    363 self._enqueue_sql(sql, output_table_name)
--> 364 return self._execute_sql_pipeline([], materialise_as_hash, use_cache, transpile)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:323, in Linker._execute_sql_pipeline(self, input_dataframes, materialise_as_hash, use_cache, transpile)
    319     sql_gen = self._pipeline._generate_pipeline(input_dataframes)
    321     output_tablename_templated = self._pipeline.queue[-1].output_table_name
--> 323     dataframe = self._sql_to_splink_dataframe(
    324         sql_gen,
    325         output_tablename_templated,
    326         materialise_as_hash,
    327         use_cache,
    328         transpile,
    329     )
    330     return dataframe
    331 else:
    332     # In debug mode, we do not pipeline the sql and print the
    333     # results of each part of the pipeline

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:405, in Linker._sql_to_splink_dataframe(self, sql, output_tablename_templated, materialise_as_hash, use_cache, transpile)
    402     print(sql)
    404 if materialise_as_hash:
--> 405     splink_dataframe = self._execute_sql(
    406         sql, output_tablename_templated, table_name_hash, transpile=transpile
    407     )
    408 else:
    409     splink_dataframe = self._execute_sql(
    410         sql,
    411         output_tablename_templated,
    412         output_tablename_templated,
    413         transpile=transpile,
    414     )

File ~\AppData\Roaming\Python\Python310\site-packages\splink\duckdb\duckdb_linker.py:227, in DuckDBLinker._execute_sql(self, sql, templated_name, physical_name, transpile)
    220 logger.log(5, log_sql(sql))
    222 sql = f"""
    223 CREATE TABLE {physical_name}
    224 AS
    225 ({sql})
    226 """
--> 227 self._con.execute(sql).fetch_df()
    229 return DuckDBLinkerDataFrame(templated_name, physical_name, self)

RuntimeError: INTERNAL Error: INTERNAL Error: Invalid unicode detected in segment statistics update!
​`

Some extra info:
Splink version 3.0.1

Name Version Build Channel

anaconda 2022.05 py39_0
anaconda-client 1.9.0 py39haa95532_0
anaconda-navigator 2.1.4 py39haa95532_0
anaconda-project 0.10.2 pyhd3eb1b0_0

Windows specifications
Edition Windows 10 Enterprise
Version 20H2

Thank you very much

@mamonu
Copy link
Contributor

mamonu commented Jul 21, 2022

related to #553

@RobinL
Copy link
Member

RobinL commented Jul 21, 2022

@Th368MoJ i wonder if this could be to do with the pandas null issue. I think it's an outside chance, but may be worth Mariana trying first because it's simple.

@MarianaBazely Could you try ensuring all null values are None rather than np.nan or pd.NA and rerunning.

Sample code to clear up zero length strings and any instances of numpy/pandas nulls:

data = [
    {'a': '', 'b': pd.NA, 'c':np.nan}
]

df = pd.DataFrame(data)

# deal with col a
df = df.replace(r'^\s*$', None, regex=True)

# deal with col b and c
df2 = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

@MarianaBazely
Copy link
Author

When I changed my blocking rules I was able to generate the image...

@RobinL
Copy link
Member

RobinL commented Jul 22, 2022

Did you manage to identify which blocking rule was causing the problem? Would be useful to know as it may help us track down this problem

Did clearing up the nulls make any difference?

@MarianaBazely
Copy link
Author

All nulls have been replaced to np.nan. I can remove all rows with nulls instead.

These were the blocking rules that "caused" (I'm not sure if they are guilty) the unicode error:

    "blocking_rules_to_generate_predictions": [
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla"

Anything else, for example this huge one below is fine:

    "blocking_rules_to_generate_predictions": [
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity and l.sex = r.sex",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and r.ethnicity = l.ethnicity and l.byear = r.byear and l.utla = r.utla",
    "l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
    and r.ethnicity = l.ethnicity and l.utla = r.utla",
    "l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
    and r.ethnicity = l.ethnicity and l.utla = r.utla and l.sex = r.sex",

@RobinL
Copy link
Member

RobinL commented Jul 27, 2022

Weird! Please could you try with the broken blocking rules with the with the code I posted above that removes np.nan and pd.na with pure python None?

@RobinL RobinL changed the title unicode error INTERNAL Error: Invalid unicode detected in segment statistics update! Sep 30, 2022
@leewilson-kmd
Copy link

leewilson-kmd commented Oct 11, 2022

I am experiencing the same error

Traceback (most recent call last):
  File "c:/path/test_splink.py", line 92, in <module>
    linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 999, in estimate_parameters_using_expectation_maximisation
    em_training_session._train()
  File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 165, in _train
    cvv = self._comparison_vectors()
  File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 161, in _comparison_vectors
    return self._training_linker._execute_sql_pipeline(input_dataframes)
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 346, in _execute_sql_pipeline
    dataframe = self._sql_to_splink_dataframe_checking_cache(
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 481, in _sql_to_splink_dataframe_checking_cache
    splink_dataframe = self._execute_sql_against_backend(
  File "c:/path/to/env/lib\site-packages\splink\duckdb\duckdb_linker.py", line 190, in _execute_sql_against_backend
    self._con.execute(sql).fetch_df()
duckdb.InternalException: INTERNAL Error: Invalid unicode detected in segment statistics update!

I am working on Windows 10 with an input dataset of 2 million rows, 23 columns.

I tried changing blanks/na/nan to None with @RobinL's suggested code but it did not help:

df = df.replace(r'^\s*$', None, regex=True)
df = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

I also tried discarding all non-ASCII characters which also did not help:

for col in df.columns[1:]:
    df[col] = df[col].str.encode('ascii', 'ignore').str.decode('ascii', 'ignore')

These are the blocking rules

'blocking_rules_to_generate_predictions' = [
  'l.first_name = r.first_name and l.last_name = r.last_name',
  'l.birth_date = r.birth_date',
  'l.properties_city = r.properties_city and l.country = r.country'
]

With these blocking rules the error goes away

'blocking_rules_to_generate_predictions' = [
  'l.first_name = r.first_name and l.last_name = r.last_name',
  'l.birth_date = r.birth_date'
]

I also tried playing around with the size of the input data set (under the apparently incorrect assumption that it was bad character encoding causing the error). If I narrow the input to the first 14,000 rows the error still occurs. If I narrow it down to the first 13,000 rows the error does not occur.

@Stongtong
Copy link

Stongtong commented Oct 13, 2022

hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value, Looking forward to your reply

@RobinL
Copy link
Member

RobinL commented Oct 13, 2022

Thanks for the info. Are you on windows? Is the source of the data Microsoft SQL server? One work around is to use the Spark linker, which doesn't suffer from this problem

@leewilson-kmd
Copy link

I am on Windows. Original source of the data is MSSQL, but I am just reading it from a CSV

@RobinL
Copy link
Member

RobinL commented Oct 16, 2022

Thanks - that seems to be the pattern - every time we've seen this error the data is from mssql server on windows. We'll continue to look into it. Its been challenging because I haven't yet been able to come up with a small/simple reproducible example

@Stongtong
Copy link

yes, i'm on win10. using spark it worked

@RobinL
Copy link
Member

RobinL commented Oct 22, 2022

@leewilson-kmd I don't suppose it would be possible to share the data that causes the error? (The CSV file). If so, would be best to share direct with mark@duckdblabs.com (see duckdb/duckdb#1650)

@RobinL
Copy link
Member

RobinL commented Nov 12, 2022

It looks like this finally may be solved on the latest master of duckdb:
duckdb/duckdb#1650 (comment)

Would be useful if people who have seen this issue would check whether it's gone away on the latest duckdb, so we can bump our dependencies and close this issue

@RobinL
Copy link
Member

RobinL commented Feb 16, 2023

Closing this as no-one has reported it for a long time and duckdb think it's fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants