Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Error On Some CSV Inputs #58

Open
IsaacMilarky opened this issue May 22, 2024 · 1 comment
Open

Strange Error On Some CSV Inputs #58

IsaacMilarky opened this issue May 22, 2024 · 1 comment
Assignees
Labels
dedupliFHIR DedupliFHIR repo tickets

Comments

@IsaacMilarky
Copy link
Collaborator

Describe the bug
Ran into out of range error when using tool with this CSV input:

"id","truth_value","family_name","given_name","gender","birth_date","phone","street_address","city","state","postal_code","SSN"
"4_Arlene_Oliver.xml","","oliver","arlene","female","05/26/1972","434-228-8487","845 Collier Pike Center","North Katheryn","WA","99033",""
"21_Raul_Waters.xml","","waters","raul","male","10/31/1972","(878)360-4765","61035 Adell Ranch Wall","Kerlukeburgh","NC","28885",""
"22_Raul_Waters.xml","","waters","raul","male","10/31/1972","(878)360-4765","61035 Adell Ranch Wall","Kerlukeburgh","NC","28885",""
"29_Terrance_Weber.xml","","weber","terrance","male","09/19/1960","903-814-2082","88790 Lemke Trail Road","South Quinton","NE","68843",""
"30_Terrance_Weber.xml","","weber","terrance","male","09/12/1960","903-814-2082","88790 Lemke Trail Road","South Quinton","NE","68843",""

This input has no SSN or truth value which may have something to do with the error.

However when i run the script:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csv

Things get very bad (please note that running with the same data i initially sent is working fine) :

Format is CSV
Data dir is /tmp/x.csv
----- Estimating u probabilities using random sampling -----
u probability not trained for postal_code - Exact match Postcode District (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code - Exact match Postcode Area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
    - given_name (no m values are trained).
    - family_name (no m values are trained).
    - birth_date (no m values are trained).
    - postal_code (some u values are not trained, no m values are trained).
    - street_address (no m values are trained).
    - phone (no m values are trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."ssn" = r."ssn"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - family_name
    - birth_date
    - postal_code
    - street_address
    - phone

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 

WARNING:
Level Within 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Within 10 years on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Exact match Postcode District on comparison postal_code not observed in dataset, unable to train m value

WARNING:
Level Exact match Postcode Area on comparison postal_code not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.904 in the m_probability of street_address, level `All other comparisons`
Iteration 2: Largest change in params was 0.197 in probability_two_random_records_match
Iteration 3: Largest change in params was 1.06e-09 in probability_two_random_records_match

EM converged after 3 iterations
m probability not trained for birth_date - Within 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Within 10 years (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - Exact match Postcode District (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - Exact match Postcode Area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
    - birth_date (some m values are not trained).
    - postal_code (some u values are not trained, some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."birth_date" = r."birth_date"

Parameter estimates will be made for the following comparison(s):
    - given_name
    - family_name
    - postal_code
    - street_address
    - phone

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: 
    - birth_date

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 691, in _log_and_run_sql_execution
    return self._run_sql_execution(final_sql, templated_name, physical_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/duckdb/linker.py", line 225, in _run_sql_execution
    self._con.sql(final_sql)
duckdb.duckdb.OutOfRangeException: Out of Range Error: cannot take logarithm of zero

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cage/public_html/mdinteractive/extraApps/dedupliFHIR-main/cli/ecqm_dedupe.py", line 190, in <module>
    cli()
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cage/public_html/mdinteractive-03/extraApps/dedupliFHIR-main/cli/deduplifhirLib/utils.py", line 150, in wrapper
    return func(*args,**kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/cage/public_html/mdinteractive/extraApps/dedupliFHIR-main/cli/ecqm_dedupe.py", line 35, in dedupe_data
    linker.estimate_parameters_using_expectation_maximisation(
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 1682, in estimate_parameters_using_expectation_maximisation
    em_training_session._train()
  File "/usr/local/lib/python3.11/site-packages/splink/em_training_session.py", line 197, in _train
    expectation_maximisation(self, cvv)
  File "/usr/local/lib/python3.11/site-packages/splink/expectation_maximisation.py", line 256, in expectation_maximisation
    df_params = linker._execute_sql_pipeline([df_comparison_vector_values])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 629, in _execute_sql_pipeline
    raise e
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 623, in _execute_sql_pipeline
    dataframe = self._sql_to_splink_dataframe_checking_cache(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 889, in _sql_to_splink_dataframe_checking_cache
    splink_dataframe = self._execute_sql_against_backend(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/splink/duckdb/linker.py", line 220, in _execute_sql_against_backend
    self._log_and_run_sql_execution(sql, templated_name, physical_name)
  File "/usr/local/lib/python3.11/site-packages/splink/linker.py", line 703, in _log_and_run_sql_execution
    raise SplinkException(
splink.exceptions.SplinkException: Error executing the following sql for table `__splink__m_u_counts`(__splink__m_u_counts_26a1779fb):
CREATE TABLE __splink__m_u_counts_26a1779fb AS
(
  WITH __splink__df_comparison_vectors AS (
    SELECT
      *
    FROM __splink__df_comparison_vectors_a7cd2b9d8
  ), __splink__df_match_weight_parts AS (
    SELECT
      "unique_id_l",
      "unique_id_r",
      gamma_given_name,
      CASE
        WHEN gamma_given_name = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_given_name = 4
        THEN CAST(111.89189187998129 AS DOUBLE)
        WHEN gamma_given_name = 3
        THEN CAST(81.88130481588395 AS DOUBLE)
        WHEN gamma_given_name = 2
        THEN CAST(92.09886357253461 AS DOUBLE)
        WHEN gamma_given_name = 1
        THEN CAST(48.06629076272102 AS DOUBLE)
        WHEN gamma_given_name = 0
        THEN CAST(4.009431336940681e-229 AS DOUBLE)
      END AS bf_given_name,
      CASE
        WHEN gamma_given_name = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 4
        THEN (
          CASE
            WHEN NOT COALESCE("tf_given_name_l", "tf_given_name_r") IS NULL
            THEN POWER(
              CAST(0.0014492753623188406 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_given_name_l", "tf_given_name_r") >= COALESCE("tf_given_name_r", "tf_given_name_l")
                    THEN COALESCE("tf_given_name_l", "tf_given_name_r")
                    ELSE COALESCE("tf_given_name_r", "tf_given_name_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_given_name = 3
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 2
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_given_name = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_given_name,
      gamma_family_name,
      CASE
        WHEN gamma_family_name = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_family_name = 4
        THEN CAST(139.19707503507163 AS DOUBLE)
        WHEN gamma_family_name = 3
        THEN CAST(115.4071885370217 AS DOUBLE)
        WHEN gamma_family_name = 2
        THEN CAST(165.4130726624298 AS DOUBLE)
        WHEN gamma_family_name = 1
        THEN CAST(57.22386641775578 AS DOUBLE)
        WHEN gamma_family_name = 0
        THEN CAST(3.992587823932763e-229 AS DOUBLE)
      END AS bf_family_name,
      CASE
        WHEN gamma_family_name = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 4
        THEN (
          CASE
            WHEN NOT COALESCE("tf_family_name_l", "tf_family_name_r") IS NULL
            THEN POWER(
              CAST(0.0011649825407125862 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_family_name_l", "tf_family_name_r") >= COALESCE("tf_family_name_r", "tf_family_name_l")
                    THEN COALESCE("tf_family_name_l", "tf_family_name_r")
                    ELSE COALESCE("tf_family_name_r", "tf_family_name_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_family_name = 3
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 2
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_family_name = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_family_name,
      gamma_postal_code,
      CASE
        WHEN gamma_postal_code = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_postal_code = 4
        THEN CAST(266.5650741350906 AS DOUBLE)
        WHEN gamma_postal_code = 3
        THEN CAST(3.970841325158504e-229 AS DOUBLE)
        WHEN gamma_postal_code = 2
        THEN CAST(1.259921049894873 AS DOUBLE)
        WHEN gamma_postal_code = 1
        THEN CAST(0.19842513149602492 AS DOUBLE)
        WHEN gamma_postal_code = 0
        THEN CAST(0.03125 AS DOUBLE)
      END AS bf_postal_code,
      gamma_street_address,
      CASE
        WHEN gamma_street_address = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_street_address = 1
        THEN CAST(259.14714720358563 AS DOUBLE)
        WHEN gamma_street_address = 0
        THEN CAST(0.9640980278331669 AS DOUBLE)
      END AS bf_street_address,
      CASE
        WHEN gamma_street_address = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_street_address = 1
        THEN (
          CASE
            WHEN NOT COALESCE("tf_street_address_l", "tf_street_address_r") IS NULL
            THEN POWER(
              CAST(0.00013905627143784185 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_street_address_l", "tf_street_address_r") >= COALESCE("tf_street_address_r", "tf_street_address_l")
                    THEN COALESCE("tf_street_address_l", "tf_street_address_r")
                    ELSE COALESCE("tf_street_address_r", "tf_street_address_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_street_address = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_street_address,
      gamma_phone,
      CASE
        WHEN gamma_phone = -1
        THEN CAST(1.0 AS DOUBLE)
        WHEN gamma_phone = 1
        THEN CAST(1.7515202196470399 AS DOUBLE)
        WHEN gamma_phone = 0
        THEN CAST(0.5579969710188165 AS DOUBLE)
      END AS bf_phone,
      CASE
        WHEN gamma_phone = -1
        THEN CAST(1 AS DOUBLE)
        WHEN gamma_phone = 1
        THEN (
          CASE
            WHEN NOT COALESCE("tf_phone_l", "tf_phone_r") IS NULL
            THEN POWER(
              CAST(0.3703346620932604 AS DOUBLE) / NULLIF(
                (
                  CASE
                    WHEN COALESCE("tf_phone_l", "tf_phone_r") >= COALESCE("tf_phone_r", "tf_phone_l")
                    THEN COALESCE("tf_phone_l", "tf_phone_r")
                    ELSE COALESCE("tf_phone_r", "tf_phone_l")
                  END
                ),
                0
              ),
              CAST(1.0 AS DOUBLE)
            )
            ELSE CAST(1 AS DOUBLE)
          END
        )
        WHEN gamma_phone = 0
        THEN CAST(1 AS DOUBLE)
      END AS bf_tf_adj_phone,
      "ssn_l",
      "ssn_r",
      match_key
    FROM __splink__df_comparison_vectors
  ), __splink__df_predict AS (
    SELECT
      LOG(
        2,
        CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
      ) AS match_weight,
      CASE
        WHEN bf_given_name = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_given_name = CAST('infinity' AS DOUBLE)
        OR bf_family_name = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_family_name = CAST('infinity' AS DOUBLE)
        OR bf_postal_code = CAST('infinity' AS DOUBLE)
        OR bf_street_address = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_street_address = CAST('infinity' AS DOUBLE)
        OR bf_phone = CAST('infinity' AS DOUBLE)
        OR bf_tf_adj_phone = CAST('infinity' AS DOUBLE)
        THEN 1.0
        ELSE (
          CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
        ) / NULLIF(
          (
            1 + (
              CAST(0.02611780678018568 AS DOUBLE) * bf_given_name * bf_tf_adj_given_name * bf_family_name * bf_tf_adj_family_name * bf_postal_code * bf_street_address * bf_tf_adj_street_address * bf_phone * bf_tf_adj_phone
            )
          ),
          0
        )
      END AS match_probability,
      "unique_id_l",
      "unique_id_r",
      gamma_given_name,
      gamma_family_name,
      gamma_postal_code,
      gamma_street_address,
      gamma_phone,
      "ssn_l",
      "ssn_r",
      match_key
    FROM __splink__df_match_weight_parts
    ORDER BY
      1 NULLS LAST
  )
  SELECT
    gamma_given_name AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'given_name' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_given_name
  UNION ALL
  SELECT
    gamma_family_name AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'family_name' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_family_name
  UNION ALL
  SELECT
    gamma_postal_code AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'postal_code' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_postal_code
  UNION ALL
  SELECT
    gamma_street_address AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'street_address' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_street_address
  UNION ALL
  SELECT
    gamma_phone AS comparison_vector_value,
    SUM(match_probability * 1) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) AS u_count,
    'phone' AS output_column_name
  FROM __splink__df_predict
  GROUP BY
    gamma_phone
  UNION ALL
  SELECT
    0 AS comparison_vector_value,
    SUM(match_probability * 1) / NULLIF(SUM(1), 0) AS m_count,
    SUM((
      1 - match_probability
    ) * 1) / NULLIF(SUM(1), 0) AS u_count,
    '_probability_two_random_records_match' AS output_column_name
  FROM __splink__df_predict
)

Error was: Out of Range Error: cannot take logarithm of zero

To Reproduce
Use the above input with this CLI call:
python3.11 cli/ecqm_dedupe.py dedupe-data --fmt CSV /tmp/x.csv /tmp/out.csv

Expected behavior
The CLI should output the deduplicated data normally.

Actual behavior
The tool throws an out of range error

@IsaacMilarky
Copy link
Collaborator Author

The problem that was described in the issue that was filed was that the data that was provided did not have the proper fields that the tool was set to look for (SSN specifically). This is a problem for how the tool works on the back-end because the tool uses a strategy of using both deterministic and probabilistic rules to find duplicates and it can’t do that if none of the data satisfies any of the deterministic rules.

To solve this, the user must change the deterministic rules so that they match at least 5 data points in order to be accepted. This is done through editing splink_settings.json and changing the fields to ones that are present in the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dedupliFHIR DedupliFHIR repo tickets
Projects
None yet
Development

No branches or pull requests

2 participants