Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

vfrank66 · 2020-04-09T17:22:30Z

When importing data from athena my data has nulls which is a change from file to file. Csv's nullable int columns get converted to float64 when returned as a dataframe from pandas.read_sql_athena, which is unexpected behavior.

I suggest that aws-data-wrangler adds an option to Session.pandas.read_csv that can convert all columns to string columns such as df = pd.read_csv('/path/to/file.csv', dtype=str).

Oddly pandas does not have an easy way to convert an already existing dataframe float64 to str AND format. The suggested way is at load format the dataframe for float64 (suggested workaround and for printing, the inability to format when convert to a string )

What is happening:

Data
col1,col2,col3
19,3,1
20,,1
,5,4

Becomes:

col1,col2,col3
19.0,3.0,1
20.0,,1
,5.0,4

Using aws data wrangler like:

def read_csv(sql, bucket, files, sep, wr_session):
    """ Read csv table using AWS Wrangler"""
    return wr_session.pandas.read_csv(
            f's3://{bucket}/{file}',
            sep=sep
        )

Forcing me to write a following function after the above read_csv

def _convert_float_columns_to_str(df):
	""" Pandas pulls in columns as floats where there is a nan
	but that is not correct for our logic, so we want to find float 
	datatypes and round to int 'like' datatype and convert to str.
	"""
	df.columns = map(str.lower, df.columns)
	float_columns = list(df.select_dtypes(include=['float64']).columns)
	# round does not, not sure why
	# df[float_columns] = df[float_columns].round(0)
	# df = df.round()
	# format all the columns to 
	df[float_columns] = df[float_columns].applymap('{:,.0f}'.format)
	df = df.applymap(str)
	return df

Why can I just specify the column(s) I want? Because this is data I am receiving externally and I don't know when a column will have null and thus the int column will implicitly be converted to float64 and when I run code against it if row['col'] == 19 the data source data keeps changing from csv to application.

The text was updated successfully, but these errors were encountered:

igorborgest · 2020-04-11T14:06:14Z

Thanks @vfrank66! Awesome troubleshooting.

We actually considered all these issues in the new 1.0.0 version. Could you give it a try?

Please, let us know any feedback.

igorborgest · 2020-04-11T18:48:33Z

Thank you for the contribution. Please, reopen the issue if it persists even after the version 1.0.0.

vfrank66 changed the title ~~Null Int columns being pulled in as string columns: Allow format for float64 on read_sql_athena~~ Athena BigInt columns being pulled Float64 columns: Allow format all columns on read_sql_athena Apr 9, 2020

vfrank66 changed the title ~~Athena BigInt columns being pulled Float64 columns: Allow format all columns on read_sql_athena~~ Nullable int columns being pulled Float64 columns: Allow format all columns on read_sql_athena Apr 9, 2020

vfrank66 changed the title ~~Nullable int columns being pulled Float64 columns: Allow format all columns on read_sql_athena~~ Nullable int columns being pulled into float columns: Allow format all columns on read_sql_athena Apr 9, 2020

vfrank66 changed the title ~~Nullable int columns being pulled into float columns: Allow format all columns on read_sql_athena~~ Nullable int columns being pulled into float columns: Allow format all columns on read_csv Apr 9, 2020

igorborgest self-assigned this Apr 11, 2020

igorborgest added bug Something isn't working enhancement New feature or request major release Will be addressed in the next major release labels Apr 11, 2020

igorborgest closed this as completed Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

vfrank66 commented Apr 9, 2020 •

edited

Loading

igorborgest commented Apr 11, 2020

igorborgest commented Apr 11, 2020

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

Comments

vfrank66 commented Apr 9, 2020 • edited Loading

igorborgest commented Apr 11, 2020

igorborgest commented Apr 11, 2020

vfrank66 commented Apr 9, 2020 •

edited

Loading