Surprised by type inference: csvjoin removing underscores in columns containg values like: 1_100 #1246

joeweaver · 2024-05-15T17:09:02Z

Hello,

I've spent a while figuring out why underscores were getting removed in what I believe is a fairly simple use-case. I'm using csvjoin 2.0.0

Here's a minimal example:

file1.csv contents:

Name,Type
MHYS,foo
JABI,bar

file2.csv contents:

Name,ID
MHYS,100_1
JABI,1030_11

Running csvjoin -c Name file1.csv file2.csv produces the following:

Name,Type ,ID
MHYS,foo,1001
JABI,bar,103011

The underscores in the ID field are getting dropped. I've tracked this down to type inference. Running csvjoin with the --no-inference option produces the desired behaviour.

I was a bit surprised by this, as it seems a bit aggressive of a default inference on what I believe to be a very common text
pattern. I've had my share of being bitten by type inference in the tidyverse and when using pandas, but these sort of fields were never an issue.

Finding a type inference method that handles all situations perfectly is a pipe dream, and I don't have an exact solution, but I'd like to point out that:

This happens silently, I was lucky to catch the error while debugging my data pipeline.
Figuring out the root cause was a bit of a time sink. My main reason for filing this issue is to ensure that even if there isn't a good way to fix the issue, this post may help others searching for 'missing/dropped/removed underscores'.

The text was updated successfully, but these errors were encountered:

jpmckinney · 2024-05-15T18:24:09Z

We use Python's Decimal for parsing: https://github.com/wireservice/agate/blob/e6fc5beb65f444a84b02884e9c5e7b3e344599dd/agate/data_types/number.py#L95

For example:

>>> from decimal import Decimal
>>> Decimal('1_1')
Decimal('11')

Decimal works this way because, in Python, the underscore can be used as a grouping character:

>>> 1_000_000
1000000
>>> 1_1_1_1
1111

I'd be curious to know how tidyverse or pandas do this.

In terms of feedback to the user, we can hide it behind the -v (--verbose) flag, or we can figure out a way to keep the feedback minimal (e.g. only warn once, not every time).

joeweaver · 2024-05-16T09:07:13Z

I'll check to confirm my memory that it doesn't occur in pandas/tidyverse. I'm not super familiar with their underlying code, but I can take a quick look to figure out how they're handling it.

I generally lean towards cli tools being as quiet as possible, so in that respect I'm leaning towards the verbose option.

However, I also value no surprises - the root cause of surprise was that csvkit was doing any datatype inference at all.
My mental model had it just treating everything as string, probably because while I use it a lot (thanks!), I don't do any real data analysis with it, just a bunch of joins, stacks, and greps. In that respect, a one-off warning might be better.

jpmckinney added question framework labels May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprised by type inference: csvjoin removing underscores in columns containg values like: 1_100 #1246

Surprised by type inference: csvjoin removing underscores in columns containg values like: 1_100 #1246

joeweaver commented May 15, 2024

jpmckinney commented May 15, 2024 •

edited

Loading

joeweaver commented May 16, 2024

Surprised by type inference: csvjoin removing underscores in columns containg values like: 1_100 #1246

Surprised by type inference: csvjoin removing underscores in columns containg values like: 1_100 #1246

Comments

joeweaver commented May 15, 2024

jpmckinney commented May 15, 2024 • edited Loading

joeweaver commented May 16, 2024

jpmckinney commented May 15, 2024 •

edited

Loading