Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write to hbase from pyspark shell #34

Open
jyothirmai2309 opened this issue Jun 21, 2018 · 3 comments
Open

write to hbase from pyspark shell #34

jyothirmai2309 opened this issue Jun 21, 2018 · 3 comments

Comments

@jyothirmai2309
Copy link

while writing from pyspark to hbase ,data is storing as hexa format what should i do to store data as integer in pyspark write command

@jyothirmai2309
Copy link
Author

Do we have any option to convert integer values to integer only while writing dataframe to hbase through pyspark ,by default while writing dataframe to hbase integer values are converting to byte type in hbase table?

Below is the code:
catalog2 = {
"table": {"namespace": "default","name": "trip_test1"},
"rowkey": "key1",
"columns": {
"serial_no": {"cf": "rowkey","col": "key1","type": "string"},
"payment_type": {"cf": "sales","col": "payment_type","type": "string"},
"fare_amount": {"cf": "sales","col": "fare_amount","type": "string"},
"surcharge": {"cf": "sales","col": "surcharge","type": "string"},
"mta_tax": {"cf": "sales","col": "mta_tax","type": "string"},
"tip_amount": {"cf": "sales","col": "tip_amount","type": "string"},
"tolls_amount": {"cf": "sales","col": "tolls_amount","type": "string"},
"total_amount": {"cf": "sales","col": "total_amount","type": "string"}
}
}

import json
cat2=json.dumps(catalog2)

df.write.option("catalog",cat2).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

output:
\x00\x00\x03\xE7 column=sales:payment_type, timestamp=1529495930994, value=CSH
\x00\x00\x03\xE7 column=sales:surcharge, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE7 column=sales:tip_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE7 column=sales:tolls_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE7 column=sales:total_amount, timestamp=1529495930994, value=@!\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE8 column=sales:fare_amount, timestamp=1529495930994, value=@\x18\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE8 column=sales:mta_tax, timestamp=1529495930994, value=?\xE0\x00\x00\x00\x00\x00\x00

expected output:
999 column=sales:fare_amount, timestamp=1529392479358, value=8.0
999 column=sales:mta_tax, timestamp=1529392479358, value=0.5
999 column=sales:payment_type, timestamp=1529392479358, value=CSH
999 column=sales:surcharge, timestamp=1529392479358, value=0.0
999 column=sales:tip_amount, timestamp=1529392479358, value=0.0
999 column=sales:tolls_amount, timestamp=1529392479358, value=0.0
999 column=sales:total_amount, timestamp=1529392479358, value=8.5

@bobomeng
Copy link

if you are using our package here, the integer values are converted to binary array while writing to hbase.

@nathamsr11
Copy link

Do we have any option to convert integer values to integer only while writing dataframe to hbase through pyspark ,by default while writing dataframe to hbase integer values are converting to byte type in hbase table?

Below is the code:
catalog2 = {
"table": {"namespace": "default","name": "trip_test1"},
"rowkey": "key1",
"columns": {
"serial_no": {"cf": "rowkey","col": "key1","type": "string"},
"payment_type": {"cf": "sales","col": "payment_type","type": "string"},
"fare_amount": {"cf": "sales","col": "fare_amount","type": "string"},
"surcharge": {"cf": "sales","col": "surcharge","type": "string"},
"mta_tax": {"cf": "sales","col": "mta_tax","type": "string"},
"tip_amount": {"cf": "sales","col": "tip_amount","type": "string"},
"tolls_amount": {"cf": "sales","col": "tolls_amount","type": "string"},
"total_amount": {"cf": "sales","col": "total_amount","type": "string"}
}
}

import json
cat2=json.dumps(catalog2)

df.write.option("catalog",cat2).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

output:
\x00\x00\x03\xE7 column=sales:payment_type, timestamp=1529495930994, value=CSH
\x00\x00\x03\xE7 column=sales:surcharge, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE7 column=sales:tip_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE7 column=sales:tolls_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE7 column=sales:total_amount, timestamp=1529495930994, value=@!\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE8 column=sales:fare_amount, timestamp=1529495930994, value=@\x18\x00\x00\x00\x00\x00\x00
\x00\x00\x03\xE8 column=sales:mta_tax, timestamp=1529495930994, value=?\xE0\x00\x00\x00\x00\x00\x00

expected output:
999 column=sales:fare_amount, timestamp=1529392479358, value=8.0
999 column=sales:mta_tax, timestamp=1529392479358, value=0.5
999 column=sales:payment_type, timestamp=1529392479358, value=CSH
999 column=sales:surcharge, timestamp=1529392479358, value=0.0
999 column=sales:tip_amount, timestamp=1529392479358, value=0.0
999 column=sales:tolls_amount, timestamp=1529392479358, value=0.0
999 column=sales:total_amount, timestamp=1529392479358, value=8.5

thanks 7 i had the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants