Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

earthquakes.csv has different schema than sample expects #24

Open
ddkaiser opened this issue Mar 5, 2015 · 2 comments
Open

earthquakes.csv has different schema than sample expects #24

ddkaiser opened this issue Mar 5, 2015 · 2 comments
Milestone

Comments

@ddkaiser
Copy link
Contributor

ddkaiser commented Mar 5, 2015

The “create table earthquakes” instructions given at: https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/point-in-polygon-aggregation-hive no longer aligns with the schema of the data located at: https://github.com/Esri/gis-tools-for-hadoop/tree/master/samples/data/earthquake-data

(I’m guessing that the earthquake-data is occasionally pulled from a USGS or similar source, and they changed their column definitions?)

I had to insert an additional column “unknown” of type double in front of the Magnitude column.

For example, the instructions provide the following schema:

(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE)

and a random sample line from the file (the unknown column is 80.0 and the magnitude is 6.5):

1930/12/06 07:03:28.00,53.0,-172.0,80.0,6.5,ML,0,,,,AK,

The schema that I used:

(earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, unknown DOUBLE, magnitude DOUBLE)

If corrected, this works:

hive> select min(magnitude), max(magnitude) from earthquakes;
OK
5.0 9.1

If magnitude still points to the wrong column, you will see:

hive> select min(magnitude), max(magnitude) from earthquakes;
OK
-5.0    700.0
@randallwhitman
Copy link
Contributor

The version of earthquakes.csv with header row, contains the following header:

datetime,latitude,longitude,depth,magnitude,magtype,nbstations,gap,distance,rms,source,eventid

@randallwhitman
Copy link
Contributor

The DDL (in the README and in run-sample.sql) matches a column-subset variant of the data that we also had. The mismatch can be resolved either by updating the DDL in both files - or by uploading the column-subset version of the earthquake data.

randallwhitman added a commit that referenced this issue Mar 10, 2015
data as shared with the custom MapReduce sample    (#24)
@randallwhitman randallwhitman added this to the v2.1 milestone Mar 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants