INTERNAL Error: Invalid unicode detected in segment statistics update! #664

MarianaBazely · 2022-07-21T08:57:52Z

Hi,

this is the error I get when I run clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95):


`---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [51], in <cell line: 1>()
----> 1 clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:1211, in Linker.cluster_pairwise_predictions_at_threshold(self, df_predict, threshold_match_probability)
   1203 self._initialise_df_concat_with_tf(df_predict)
   1205 edges_table = _cc_create_unique_id_cols(
   1206     self,
   1207     df_predict,
   1208     threshold_match_probability,
   1209 )
-> 1211 cc = solve_connected_components(self, edges_table)
   1213 return cc

File ~\AppData\Roaming\Python\Python310\site-packages\splink\connected_components.py:411, in solve_connected_components(linker, edges_table, _generated_graph)
    409 sql = _cc_update_representatives_first_iter()
    410 # Execute if we have no batching, otherwise add it to our batched process
--> 411 representatives = linker._enqueue_and_execute_sql_pipeline(
    412     sql, "__splink__df_representatives"
    413 )
    414 prev_representatives_table = representatives
    416 # Loop while our representative table still has unsettled nodes

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:364, in Linker._enqueue_and_execute_sql_pipeline(self, sql, output_table_name, materialise_as_hash, use_cache, transpile)
    361 """Wrapper method to enqueue and execute a sql pipeline in a single call."""
    363 self._enqueue_sql(sql, output_table_name)
--> 364 return self._execute_sql_pipeline([], materialise_as_hash, use_cache, transpile)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:323, in Linker._execute_sql_pipeline(self, input_dataframes, materialise_as_hash, use_cache, transpile)
    319     sql_gen = self._pipeline._generate_pipeline(input_dataframes)
    321     output_tablename_templated = self._pipeline.queue[-1].output_table_name
--> 323     dataframe = self._sql_to_splink_dataframe(
    324         sql_gen,
    325         output_tablename_templated,
    326         materialise_as_hash,
    327         use_cache,
    328         transpile,
    329     )
    330     return dataframe
    331 else:
    332     # In debug mode, we do not pipeline the sql and print the
    333     # results of each part of the pipeline

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:405, in Linker._sql_to_splink_dataframe(self, sql, output_tablename_templated, materialise_as_hash, use_cache, transpile)
    402     print(sql)
    404 if materialise_as_hash:
--> 405     splink_dataframe = self._execute_sql(
    406         sql, output_tablename_templated, table_name_hash, transpile=transpile
    407     )
    408 else:
    409     splink_dataframe = self._execute_sql(
    410         sql,
    411         output_tablename_templated,
    412         output_tablename_templated,
    413         transpile=transpile,
    414     )

File ~\AppData\Roaming\Python\Python310\site-packages\splink\duckdb\duckdb_linker.py:227, in DuckDBLinker._execute_sql(self, sql, templated_name, physical_name, transpile)
    220 logger.log(5, log_sql(sql))
    222 sql = f"""
    223 CREATE TABLE {physical_name}
    224 AS
    225 ({sql})
    226 """
--> 227 self._con.execute(sql).fetch_df()
    229 return DuckDBLinkerDataFrame(templated_name, physical_name, self)

RuntimeError: INTERNAL Error: INTERNAL Error: Invalid unicode detected in segment statistics update!
`

Some extra info:
Splink version 3.0.1

Name Version Build Channel

anaconda 2022.05 py39_0
anaconda-client 1.9.0 py39haa95532_0
anaconda-navigator 2.1.4 py39haa95532_0
anaconda-project 0.10.2 pyhd3eb1b0_0

Windows specifications
Edition Windows 10 Enterprise
Version 20H2

Thank you very much

The text was updated successfully, but these errors were encountered:

mamonu · 2022-07-21T09:05:05Z

related to #553

RobinL · 2022-07-21T11:05:39Z

@Th368MoJ i wonder if this could be to do with the pandas null issue. I think it's an outside chance, but may be worth Mariana trying first because it's simple.

@MarianaBazely Could you try ensuring all null values are None rather than np.nan or pd.NA and rerunning.

Sample code to clear up zero length strings and any instances of numpy/pandas nulls:

data = [
    {'a': '', 'b': pd.NA, 'c':np.nan}
]

df = pd.DataFrame(data)

# deal with col a
df = df.replace(r'^\s*$', None, regex=True)

# deal with col b and c
df2 = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

MarianaBazely · 2022-07-22T08:09:40Z

When I changed my blocking rules I was able to generate the image...

RobinL · 2022-07-22T08:21:32Z

Did you manage to identify which blocking rule was causing the problem? Would be useful to know as it may help us track down this problem

Did clearing up the nulls make any difference?

MarianaBazely · 2022-07-27T09:20:10Z

All nulls have been replaced to np.nan. I can remove all rows with nulls instead.

These were the blocking rules that "caused" (I'm not sure if they are guilty) the unicode error:

    "blocking_rules_to_generate_predictions": [
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla"

Anything else, for example this huge one below is fine:

    "blocking_rules_to_generate_predictions": [
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bmonth = r.bmonth",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.bday = r.bday",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.byear = r.byear",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and l.utla = r.utla and l.ethnicity = r.ethnicity and l.sex = r.sex",
    "l.finitial = r.finitial and l.sinitial = r.sinitial and r.ethnicity = l.ethnicity and l.byear = r.byear and l.utla = r.utla",
    "l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
    and r.ethnicity = l.ethnicity and l.utla = r.utla",
    "l.sinitial = r.sinitial and l.finitial = r.finitial and l.bday = r.bday and l.byear = r.byear and l.bmonth = r.bmonth \
    and r.ethnicity = l.ethnicity and l.utla = r.utla and l.sex = r.sex",

RobinL · 2022-07-27T21:13:15Z

Weird! Please could you try with the broken blocking rules with the with the code I posted above that removes np.nan and pd.na with pure python None?

leewilson-kmd · 2022-10-11T02:44:22Z

I am experiencing the same error

Traceback (most recent call last):
  File "c:/path/test_splink.py", line 92, in <module>
    linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 999, in estimate_parameters_using_expectation_maximisation
    em_training_session._train()
  File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 165, in _train
    cvv = self._comparison_vectors()
  File "c:/path/to/env/lib\site-packages\splink\em_training_session.py", line 161, in _comparison_vectors
    return self._training_linker._execute_sql_pipeline(input_dataframes)
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 346, in _execute_sql_pipeline
    dataframe = self._sql_to_splink_dataframe_checking_cache(
  File "c:/path/to/env/lib\site-packages\splink\linker.py", line 481, in _sql_to_splink_dataframe_checking_cache
    splink_dataframe = self._execute_sql_against_backend(
  File "c:/path/to/env/lib\site-packages\splink\duckdb\duckdb_linker.py", line 190, in _execute_sql_against_backend
    self._con.execute(sql).fetch_df()
duckdb.InternalException: INTERNAL Error: Invalid unicode detected in segment statistics update!

I am working on Windows 10 with an input dataset of 2 million rows, 23 columns.

I tried changing blanks/na/nan to None with @RobinL's suggested code but it did not help:

df = df.replace(r'^\s*$', None, regex=True)
df = df.fillna(np.nan).replace([np.nan, pd.NA], [None, None])

I also tried discarding all non-ASCII characters which also did not help:

for col in df.columns[1:]:
    df[col] = df[col].str.encode('ascii', 'ignore').str.decode('ascii', 'ignore')

These are the blocking rules

'blocking_rules_to_generate_predictions' = [
  'l.first_name = r.first_name and l.last_name = r.last_name',
  'l.birth_date = r.birth_date',
  'l.properties_city = r.properties_city and l.country = r.country'
]

With these blocking rules the error goes away

'blocking_rules_to_generate_predictions' = [
  'l.first_name = r.first_name and l.last_name = r.last_name',
  'l.birth_date = r.birth_date'
]

I also tried playing around with the size of the input data set (under the apparently incorrect assumption that it was bad character encoding causing the error). If I narrow the input to the first 14,000 rows the error still occurs. If I narrow it down to the first 13,000 rows the error does not occur.

Stongtong · 2022-10-13T09:18:05Z

hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value, Looking forward to your reply

RobinL · 2022-10-13T09:41:06Z

Thanks for the info. Are you on windows? Is the source of the data Microsoft SQL server? One work around is to use the Spark linker, which doesn't suffer from this problem

leewilson-kmd · 2022-10-16T18:58:29Z

I am on Windows. Original source of the data is MSSQL, but I am just reading it from a CSV

RobinL · 2022-10-16T19:23:09Z

Thanks - that seems to be the pattern - every time we've seen this error the data is from mssql server on windows. We'll continue to look into it. Its been challenging because I haven't yet been able to come up with a small/simple reproducible example

Stongtong · 2022-10-17T08:50:15Z

yes, i'm on win10. using spark it worked

RobinL · 2022-10-22T06:12:20Z

@leewilson-kmd I don't suppose it would be possible to share the data that causes the error? (The CSV file). If so, would be best to share direct with mark@duckdblabs.com (see duckdb/duckdb#1650)

RobinL · 2022-11-12T18:10:36Z

It looks like this finally may be solved on the latest master of duckdb:
duckdb/duckdb#1650 (comment)

Would be useful if people who have seen this issue would check whether it's gone away on the latest duckdb, so we can bump our dependencies and close this issue

RobinL · 2023-02-16T17:13:58Z

Closing this as no-one has reported it for a long time and duckdb think it's fixed

RobinL mentioned this issue Aug 29, 2022

Invalid unicode in segments statistics on Win Machine / duck db issue? #553

Closed

RobinL changed the title ~~unicode error~~ INTERNAL Error: Invalid unicode detected in segment statistics update! Sep 30, 2022

RobinL mentioned this issue Oct 21, 2022

invalid unicode in segments statistics duckdb/duckdb#1650

Closed

RossKen added check if still an issue duckdb labels Jan 25, 2023

RobinL closed this as completed Feb 16, 2023

RobinL mentioned this issue Jun 27, 2023

[BUG] fix how nulls are registered in pyspark when loading a pandas df #1373

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INTERNAL Error: Invalid unicode detected in segment statistics update! #664

INTERNAL Error: Invalid unicode detected in segment statistics update! #664

MarianaBazely commented Jul 21, 2022 •

edited by mamonu

Loading

mamonu commented Jul 21, 2022

RobinL commented Jul 21, 2022 •

edited

Loading

MarianaBazely commented Jul 22, 2022

RobinL commented Jul 22, 2022 •

edited

Loading

MarianaBazely commented Jul 27, 2022

RobinL commented Jul 27, 2022 •

edited

Loading

leewilson-kmd commented Oct 11, 2022 •

edited

Loading

Stongtong commented Oct 13, 2022 •

edited

Loading

RobinL commented Oct 13, 2022 •

edited

Loading

leewilson-kmd commented Oct 16, 2022

RobinL commented Oct 16, 2022

Stongtong commented Oct 17, 2022

RobinL commented Oct 22, 2022

RobinL commented Nov 12, 2022

RobinL commented Feb 16, 2023

INTERNAL Error: Invalid unicode detected in segment statistics update! #664

INTERNAL Error: Invalid unicode detected in segment statistics update! #664

Comments

MarianaBazely commented Jul 21, 2022 • edited by mamonu Loading

Name Version Build Channel

mamonu commented Jul 21, 2022

RobinL commented Jul 21, 2022 • edited Loading

MarianaBazely commented Jul 22, 2022

RobinL commented Jul 22, 2022 • edited Loading

MarianaBazely commented Jul 27, 2022

RobinL commented Jul 27, 2022 • edited Loading

leewilson-kmd commented Oct 11, 2022 • edited Loading

Stongtong commented Oct 13, 2022 • edited Loading

RobinL commented Oct 13, 2022 • edited Loading

leewilson-kmd commented Oct 16, 2022

RobinL commented Oct 16, 2022

Stongtong commented Oct 17, 2022

RobinL commented Oct 22, 2022

RobinL commented Nov 12, 2022

RobinL commented Feb 16, 2023

MarianaBazely commented Jul 21, 2022 •

edited by mamonu

Loading

RobinL commented Jul 21, 2022 •

edited

Loading

RobinL commented Jul 22, 2022 •

edited

Loading

RobinL commented Jul 27, 2022 •

edited

Loading

leewilson-kmd commented Oct 11, 2022 •

edited

Loading

Stongtong commented Oct 13, 2022 •

edited

Loading

RobinL commented Oct 13, 2022 •

edited

Loading