Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241

Closed
sscarduzio opened this issue Apr 2, 2016 · 50 comments
Closed

Spark SQL backend (to support Elasticsearch, Cassandra, etc) #241

sscarduzio opened this issue Apr 2, 2016 · 50 comments
Labels
validation:required A committer should validate the issue

Comments

@sscarduzio
Copy link

I can't resist saying Caravel looks much neater than Kibana, plus the user management doesn't cost money and it's not an afterthought.
It would be amazing to see Caravel replacing my Kibana dashboard, using the data I've got currently in Elasticsearch.

You use an SQL interface to query the data store, is there any chance Caravel can speak to Elasticsearch through Spark SQL?
Spark has a mature Elasticsearch connector, so it should be OK.

And wait.. If you support Spark SQL, you'll be immediately able to support HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source!

Is this a path worth exploring for this project? I think it's quite exciting.

@gbrian
Copy link
Contributor

gbrian commented Apr 2, 2016

+1
I'm looking for Apache Drill connector, as well

@ariepratama
Copy link

+1
on this feature too

@mistercrunch
Copy link
Member

Totally worth doing, there's 2 paths for it, either by creating a SqlAlchemy dialect (might not be possible is Spark SQL is funky), or creating a new datasource and implementing the query interface. For now we have 2 datasources: sqlalchemy or druid. It's totally doable to add a third one, it just needs to implement something like:
https://github.com/airbnb/caravel/blob/master/caravel/models.py#L460

Basically you need to receive these parameters and return a pandas dataframe.

@mistercrunch
Copy link
Member

We use Spark at Airbnb and have some SparkSql in places, we might have use cases for it internally, but I'm not sure where it fits in the priority list.

@mistercrunch mistercrunch added the enhancement:request Enhancement request submitted by anyone from the community label Apr 2, 2016
@sscarduzio
Copy link
Author

Cool thanks for the pointers! This new connector would surely unlock a wealth of valuable contributions from other businesses which happen to not use Druid or a plain RDBMS.

Sounds like a good investment to me :)

@joshwalters
Copy link
Contributor

I am really interested in adding Hive support, I may take a crack at it sometime in the next few weeks. Dropbox has a Python/Hive project that I was looking at: https://github.com/dropbox/PyHive

@gbrian
Copy link
Contributor

gbrian commented Apr 6, 2016

Does it means Impala as well? Thanks

@guang
Copy link

guang commented Apr 7, 2016

+1

@csalperwyck
Copy link

+1 for Hive

@joshwalters
Copy link
Contributor

@gbrian Yes, the package I am looking at would add support for Hive and Impala. I opened an issue to track this: #339

@OElesin
Copy link

OElesin commented Apr 23, 2016

Great work guys, but can I load data from Elasticsearch?

@rahulgagrani
Copy link

+1 to addition of Elasticsearch support.

@philippfrenzel
Copy link

+1

1 similar comment
@povilasb
Copy link

+1

@nabilblk
Copy link

nabilblk commented May 6, 2016

+1 for Hive

@bwboy
Copy link

bwboy commented May 11, 2016

+1 for Hive and Elasticsearch

@JohnOmernik
Copy link

I am working on an Apache Drill Sql Alchemy Dialect. I have some basic things working, and have been working with others on the Drill mailing list. There has been talk of plugging Drill to Elastic Search, which seems a bit convoluted, however, since Elasticsearch doesn't have a SQL interface, Drill works really nice, if we get a Dialect working for Drill, then other storage plugins will (hopefully) just work. Some of the work can be found here:

Docker container with pyodbc, unixodbc, Drill ODBC, and caravel all working:

https://github.com/JohnOmernik/caraveldrill

Drill Dialect (work in progress, feel free to play with it and try it, please report issues as you find them, this is iterative brute force programming at this point!)
https://github.com/JohnOmernik/sqlalchemy-drill

@sathieu
Copy link

sathieu commented Jun 1, 2016

I've taken a different approach and started a native backend.

WIP is at https://github.com/sathieu/caravel/tree/elasticsearch (beware: I'll squash commits and force push).

Not much is working yet, and I don't have dedicated time on it. We'll see what comes.

@tninja
Copy link
Contributor

tninja commented Jun 1, 2016

+1 to sparksql

@bolkedebruin
Copy link
Contributor

For what is worth: spark 2 will be sql compliant so then a sqlalchemy dialect is feasible

@benvogan
Copy link

benvogan commented Jul 6, 2016

+1 for spark SQL. That will get you connected to most data sources these days.

@giaosudau
Copy link

  • 1 for Spark SQL, Hive.

@shkr
Copy link
Contributor

shkr commented Jul 20, 2016

You can connect it to Spark SQL. If it uses a hive back-end then you refer to this documentation page for instructions on how to connect sparkl sql via a jdbc+hive connector. https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#01%20Databricks%20Overview/14%20Third%20Party%20Integrations/05%20Beeline.html. The one I prefer is dropbox/pyhive to connect to spark sql in my python projects. For scala or java the jdbc+hive will be preferable.

@sbookworm
Copy link

+1 for spark sql

@mistercrunch
Copy link
Member

Sweet! Can other confirm that SparkSQL works for them through SQL alchemy?!

@mistercrunch mistercrunch added the validation:required A committer should validate the issue label Jul 22, 2016
@mistercrunch
Copy link
Member

mistercrunch commented Jul 22, 2016

Giving hints about how to use SparkSQL in the docs: #803

@giaosudau
Copy link

@mistercrunch Right now it does.
But in long query it stops a hold process. I think it relate to thread.

@maver1ck
Copy link
Contributor

I used Spark Thrift Server with Pyhive and it almost works (I need to change one line in hive dialect)

@kaiosama
Copy link

kaiosama commented Feb 6, 2017

@shkr Hi, I am trying to achieve the same thing with pyHive and have not been able to make it work. What is the URI you are using for setting up Superset data source? I am trying something like jdbc+hive://localhost:10000/, and it gives an error: "Can't load plugin: sqlalchemy.dialects:jdbc.hive". I am sure I must be missing something here.. Thanks in advance for any instructions on this.

-- update --
Looks like I had a hiveserver2 problem, I restarted it and then I was able to use this URI:
hive://user@localhost:10000/database
However I can't get what is listed on the wiki to work (jdbc+hive://), the error message is "Can't load plugin: sqlalchemy.dialects:jdbc.hive"

I have another question that is, what do you mean when you say use SparkSql as backend? I am fairly new to this, but AFAIK I can save dataframes in SparkSql to a Hive table, from which I can then create a Superset table/slice using the above connector. But is there more that I can do to make this process better? My overall goal is to be able to create tables/ slices from parquet files on HDFS.

@ChethanChandra
Copy link

+1 for Elasticsearch support.

@santhavathi
Copy link

santhavathi commented Feb 17, 2017

@giaosudau, What is the SQLAlchemy URI, I should give in superset to connect to SparkSQL
I used below and it is not working, 172.31.12.201 is where 1.6.2 spark master runs
hive://172.31.12.201:7077/test_database

@shkr
Copy link
Contributor

shkr commented Feb 20, 2017

@santhavathi when you open spark ui dashboard, there is a ip printed on top, which is the hostname of the head of the cluster. you have to use that, as your hostname in the hive url.

example : hive://<spark-cluster-master/

@santhavathi
Copy link

@shkr, thanks so much for the reply.
I had to start the hive server (spark thrift server) on my spark cluster.
Also giving hive:// gives below error
ERROR: Connection failed!

The error message returned was:
Could not locate column in row for column 'tab_name'

I used impala:// and it works now.

@cduverne
Copy link

Hello guys, I see in the documentation that SparkSQL is supported : http://airbnb.io/superset/installation.html#database-dependencies.

What does this concretely mean ? Which DB can we query then ?

Thanks a lot in advance.

@kaiosama
Copy link

@shkr according to your latest comment, I tried the following URI: hive://172.17.0.2, where 172.17.0.2 is what I got from spark UI.

It allows me to add it as a database, so far so good. However when I query against a table in this database, the job tracker shows a MapReduce job. I would expect the job to be a Spark job though, is it true in your case?
I was able to connect to local hive using hive://localhost:10000, so far these two work like the same thing to me.

@santhavathi
Copy link

@kaiosama, when you said you are connecting to hive://172.17.0.2, what is the port you used here, and are you directly connecting to spark master without hiveserver running?

@kaiosama
Copy link

@santhavathi that is the full URI I used, without port #. I tried using some port #s from the spark UI page but none of them works.

It was with a running hive server. Maybe I am missing something here, but it seems to me that Spark-sql is supposed to be used against Hive, i.e. you always need a running Hive server? Or can the Spark-sql connector be used against other sources? It's like @cduverne mentioned, it's not very clear to me. And I have not got any replies about how to get "jdbc+hive" work as said in the document.

@oblamine
Copy link

+1 for Hbase support :)

@mistercrunch
Copy link
Member

At Airbnb we can do Hbase through Presto with the HBase Presto connector.

@oblamine
Copy link

oblamine commented Mar 2, 2017

would you please give me a link so i can follow install steps?

@balchandra
Copy link

Hi Can someone, please list down steps to do to connect ElasticSearch from Superset.
It would be great help

@mistercrunch
Copy link
Member

@shkr
Copy link
Contributor

shkr commented Mar 7, 2017

@kaiosama The hostname directs the sql-alchemy to use SQL at the given port. Hard to say whether a map reduce is the normal behavior to expect, without knowing details about your setup of hive, map reduce and spark.

@balchandra
Copy link

@mistercrunch...
I tried using the same ...connecting Superset with Sqlalchemy-elasticquery.
I was able to connect when both Superset and Elasticsearch are installed in same server.
Also i was not able to view table/indices when got connected.
Can you tell me how exactly it is supposed to be used.
It will help me to great extent
Thanks in advance

@mistercrunch
Copy link
Member

Looks like sqlalchemy-elasticquery isn't what I thought it was. Depending on how ANSI compliant ElasticSearch's SQL is, it may be possible to create your own sqlalchemy dialect. If not, someone would have to create a new connector for it. Luckily I recently refactored and formalized the connector abstraction.

@xycloud
Copy link

xycloud commented Mar 30, 2017

+1 for elasticsearch

1 similar comment
@zbidi
Copy link

zbidi commented Apr 3, 2017

+1 for elasticsearch

@hongqp
Copy link

hongqp commented Apr 18, 2017

+1 for Hive and Elasticsearch

@apache apache locked and limited conversation to collaborators Apr 18, 2017
@kristw kristw added the inactive Inactive for >= 30 days label Mar 20, 2019
@mistercrunch
Copy link
Member

Good news about ElasticSearch here! #8441

@stale stale bot removed the inactive Inactive for >= 30 days label Oct 25, 2019
@srinify
Copy link
Contributor

srinify commented Apr 9, 2021

Closing since Superset now works with Elasticsearch!

https://superset.apache.org/docs/databases/elasticsearch

@srinify srinify closed this as completed Apr 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
validation:required A committer should validate the issue
Projects
None yet
Development

No branches or pull requests