Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

Closed
vfrank66 opened this issue Apr 9, 2020 · 2 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request major release Will be addressed in the next major release

Comments

@vfrank66
Copy link

vfrank66 commented Apr 9, 2020

When importing data from athena my data has nulls which is a change from file to file. Csv's nullable int columns get converted to float64 when returned as a dataframe from pandas.read_sql_athena, which is unexpected behavior.

I suggest that aws-data-wrangler adds an option to Session.pandas.read_csv that can convert all columns to string columns such as df = pd.read_csv('/path/to/file.csv', dtype=str).

Oddly pandas does not have an easy way to convert an already existing dataframe float64 to str AND format. The suggested way is at load format the dataframe for float64 (suggested workaround and for printing, the inability to format when convert to a string )

What is happening:

Data
col1,col2,col3
19,3,1
20,,1
,5,4

Becomes:

col1,col2,col3
19.0,3.0,1
20.0,,1
,5.0,4

Using aws data wrangler like:

def read_csv(sql, bucket, files, sep, wr_session):
    """ Read csv table using AWS Wrangler"""
    return wr_session.pandas.read_csv(
            f's3://{bucket}/{file}',
            sep=sep
        )

Forcing me to write a following function after the above read_csv

def _convert_float_columns_to_str(df):
	""" Pandas pulls in columns as floats where there is a nan
	but that is not correct for our logic, so we want to find float 
	datatypes and round to int 'like' datatype and convert to str.
	"""
	df.columns = map(str.lower, df.columns)
	float_columns = list(df.select_dtypes(include=['float64']).columns)
	# round does not, not sure why
	# df[float_columns] = df[float_columns].round(0)
	# df = df.round()
	# format all the columns to 
	df[float_columns] = df[float_columns].applymap('{:,.0f}'.format)
	df = df.applymap(str)
	return df

Why can I just specify the column(s) I want? Because this is data I am receiving externally and I don't know when a column will have null and thus the int column will implicitly be converted to float64 and when I run code against it if row['col'] == 19 the data source data keeps changing from csv to application.

@vfrank66 vfrank66 changed the title Null Int columns being pulled in as string columns: Allow format for float64 on read_sql_athena Athena BigInt columns being pulled Float64 columns: Allow format all columns on read_sql_athena Apr 9, 2020
@vfrank66 vfrank66 changed the title Athena BigInt columns being pulled Float64 columns: Allow format all columns on read_sql_athena Nullable int columns being pulled Float64 columns: Allow format all columns on read_sql_athena Apr 9, 2020
@vfrank66 vfrank66 changed the title Nullable int columns being pulled Float64 columns: Allow format all columns on read_sql_athena Nullable int columns being pulled into float columns: Allow format all columns on read_sql_athena Apr 9, 2020
@vfrank66 vfrank66 changed the title Nullable int columns being pulled into float columns: Allow format all columns on read_sql_athena Nullable int columns being pulled into float columns: Allow format all columns on read_csv Apr 9, 2020
@igorborgest igorborgest self-assigned this Apr 11, 2020
@igorborgest igorborgest added bug Something isn't working enhancement New feature or request major release Will be addressed in the next major release labels Apr 11, 2020
@igorborgest
Copy link
Contributor

Thanks @vfrank66! Awesome troubleshooting.

We actually considered all these issues in the new 1.0.0 version. Could you give it a try?

Please, let us know any feedback.

@igorborgest
Copy link
Contributor

Thank you for the contribution. Please, reopen the issue if it persists even after the version 1.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request major release Will be addressed in the next major release
Projects
None yet
Development

No branches or pull requests

2 participants