-
Notifications
You must be signed in to change notification settings - Fork 2.8k
BigQuery Interpreter for Apazhe Zeppelin[ZEPPELIN-1153] #1170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2a2bedc
75d8ee6
50c41fc
17846f1
a00b48e
089820b
73e3f6d
2254a49
6132d78
87f5efe
4b82abd
11e88dc
5983e36
f872aa0
17fd4e8
f318b20
287744c
31c373f
4db74c1
5a2e674
3d5f8e7
20962d2
ae096d2
aa52553
d90e10f
e88b017
22e3487
8fa647b
17f6d89
d85abd2
764385c
569757f
b6d181c
d0c8e01
4a3153f
bbf26cc
69cb724
e520b7b
4a1d29c
64affbb
97874a4
41e076e
3be1912
7d4f40b
6a95333
fcab6b7
03a777f
64525b8
d3c2316
ffed801
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| # Overview | ||
| BigQuery interpreter for Apache Zeppelin | ||
|
|
||
| # Pre requisities | ||
| You can follow the instructions at [Apache Zeppelin on Dataproc](https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/apache-zeppelin/README.MD) to bring up Zeppelin on Google dataproc. | ||
| You could also install and bring up Zeppelin on Google compute Engine. | ||
|
|
||
| # Unit Tests | ||
| BigQuery Unit tests are excluded as these tests depend on the BigQuery external service. This is because BigQuery does not have a local mock at this point. | ||
|
|
||
| If you like to run these tests manually, please follow the following steps: | ||
| * [Create a new project](https://support.google.com/cloud/answer/6251787?hl=en) | ||
| * [Create a Google Compute Engine instance](https://cloud.google.com/compute/docs/instances/create-start-instance) | ||
| * Copy the project ID that you created and add it to the property "projectId" in `resources/constants.json` | ||
| * Run the command mvn <options> -Dbigquery.text.exclude='' test -pl bigquery -am | ||
|
|
||
|
|
||
| # Interpreter Configuration | ||
|
|
||
| Configure the following properties during Interpreter creation. | ||
|
|
||
| <table class="table-configuration"> | ||
| <tr> | ||
| <th>Name</th> | ||
| <th>Default Value</th> | ||
| <th>Description</th> | ||
| </tr> | ||
| <tr> | ||
| <td>zeppelin.bigquery.project_id</td> | ||
| <td> </td> | ||
| <td>Google Project Id</td> | ||
| </tr> | ||
| <tr> | ||
| <td>zeppelin.bigquery.wait_time</td> | ||
| <td>5000</td> | ||
| <td>Query Timeout in Milliseconds</td> | ||
| </tr> | ||
| <tr> | ||
| <td>zeppelin.bigquery.max_no_of_rows</td> | ||
| <td>100000</td> | ||
| <td>Max result set size</td> | ||
| </tr> | ||
| </table> | ||
|
|
||
| # Connection | ||
| The Interpreter opens a connection with the BigQuery Service using the supplied Google project ID and the compute environment variables. | ||
|
|
||
| # Google BigQuery API Javadoc | ||
| [API Javadocs](https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/) | ||
| [Source] (http://central.maven.org/maven2/com/google/apis/google-api-services-bigquery/v2-rev265-1.21.0/google-api-services-bigquery-v2-rev265-1.21.0-sources.jar) | ||
|
|
||
| We have used the curated veneer version of the Java APIs versus [Idiomatic Java client] (https://github.com/GoogleCloudPlatform/gcloud-java/tree/master/gcloud-java-bigquery) to build the interpreter. This is mainly for usability reasons. | ||
|
|
||
| # Enabling the BigQuery Interpreter | ||
|
|
||
| In a notebook, to enable the **BigQuery** interpreter, click the **Gear** icon and select **bigquery**. | ||
|
|
||
| # Using the BigQuery Interpreter | ||
|
|
||
| In a paragraph, use `%bigquery.sql` to select the **BigQuery** interpreter and then input SQL statements against your datasets stored in BigQuery. | ||
| You can use [BigQuery SQL Reference](https://cloud.google.com/bigquery/query-reference) to build your own SQL. | ||
|
|
||
| For Example, SQL to query for top 10 departure delays across airports using the flights public dataset | ||
|
|
||
| ```bash | ||
| %bigquery.sql | ||
| SELECT departure_airport,count(case when departure_delay>0 then 1 else 0 end) as no_of_delays | ||
| FROM [bigquery-samples:airline_ontime_data.flights] | ||
| group by departure_airport | ||
| order by 2 desc | ||
| limit 10 | ||
| ``` | ||
|
|
||
| Another Example, SQL to query for most commonly used java packages from the github data hosted in BigQuery | ||
|
|
||
| ```bash | ||
| %bigquery.sql | ||
| SELECT | ||
| package, | ||
| COUNT(*) count | ||
| FROM ( | ||
| SELECT | ||
| REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, | ||
| id | ||
| FROM ( | ||
| SELECT | ||
| SPLIT(content, '\n') line, | ||
| id | ||
| FROM | ||
| [bigquery-public-data:github_repos.sample_contents] | ||
| WHERE | ||
| content CONTAINS 'import' | ||
| AND sample_path LIKE '%.java' | ||
| HAVING | ||
| LEFT(line, 6)='import' ) | ||
| GROUP BY | ||
| package, | ||
| id ) | ||
| GROUP BY | ||
| 1 | ||
| ORDER BY | ||
| count DESC | ||
| LIMIT | ||
| 40 | ||
| ``` | ||
|
|
||
| # Sample Screenshot | ||
|
|
||
|  | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,177 @@ | ||
| <?xml version="1.0" encoding="UTF-8"?> | ||
| <!-- | ||
| ~ Licensed to the Apache Software Foundation (ASF) under one or more | ||
| ~ contributor license agreements. See the NOTICE file distributed with | ||
| ~ this work for additional information regarding copyright ownership. | ||
| ~ The ASF licenses this file to You under the Apache License, Version 2.0 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the standard license hearer that is used in Apache Zeppelin project. |
||
| ~ (the "License"); you may not use this file except in compliance with | ||
| ~ the License. You may obtain a copy of the License at | ||
| ~ | ||
| ~ http://www.apache.org/licenses/LICENSE-2.0 | ||
| ~ | ||
| ~ Unless required by applicable law or agreed to in writing, software | ||
| ~ distributed under the License is distributed on an "AS IS" BASIS, | ||
| ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| ~ See the License for the specific language governing permissions and | ||
| ~ limitations under the License. | ||
| --> | ||
|
|
||
| <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
| xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> | ||
| <modelVersion>4.0.0</modelVersion> | ||
|
|
||
| <parent> | ||
| <artifactId>zeppelin</artifactId> | ||
| <groupId>org.apache.zeppelin</groupId> | ||
| <version>0.7.0-SNAPSHOT</version> | ||
| </parent> | ||
|
|
||
| <groupId>org.apache.zeppelin</groupId> | ||
| <artifactId>zeppelin-bigquery</artifactId> | ||
| <packaging>jar</packaging> | ||
| <version>0.7.0-SNAPSHOT</version> | ||
| <name>Zeppelin: BigQuery interpreter</name> | ||
| <url>http://www.apache.org</url> | ||
|
|
||
| <dependencies> | ||
|
|
||
| <dependency> | ||
| <groupId>com.google.apis</groupId> | ||
| <artifactId>google-api-services-bigquery</artifactId> | ||
| <version>v2-rev265-1.21.0</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>com.google.oauth-client</groupId> | ||
| <artifactId>google-oauth-client</artifactId> | ||
| <version>${project.oauth.version}</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>com.google.http-client</groupId> | ||
| <artifactId>google-http-client-jackson2</artifactId> | ||
| <version>${project.http.version}</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>com.google.oauth-client</groupId> | ||
| <artifactId>google-oauth-client-jetty</artifactId> | ||
| <version>${project.oauth.version}</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>com.google.code.gson</groupId> | ||
| <artifactId>gson</artifactId> | ||
| <version>2.6</version> | ||
| </dependency> | ||
|
|
||
| <dependency> | ||
| <groupId>org.apache.zeppelin</groupId> | ||
| <artifactId>zeppelin-interpreter</artifactId> | ||
| <version>${project.version}</version> | ||
| <scope>provided</scope> | ||
| </dependency> | ||
|
|
||
| <dependency> | ||
| <groupId>org.slf4j</groupId> | ||
| <artifactId>slf4j-api</artifactId> | ||
| </dependency> | ||
|
|
||
| <dependency> | ||
| <groupId>org.slf4j</groupId> | ||
| <artifactId>slf4j-log4j12</artifactId> | ||
| </dependency> | ||
|
|
||
| <dependency> | ||
| <groupId>junit</groupId> | ||
| <artifactId>junit</artifactId> | ||
| <scope>test</scope> | ||
| </dependency> | ||
| </dependencies> | ||
|
|
||
| <properties> | ||
| <project.http.version>1.21.0</project.http.version> | ||
| <project.oauth.version>1.21.0</project.oauth.version> | ||
| <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> | ||
| <bigquery.test.exclude>**/BigQueryInterpreterTest.java</bigquery.test.exclude> | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Excluding tests will fix the CI, but for the future maintainers, assuming they will not know much about it at first, I think we need to have few things documented in
I tried on my local env and got which leaves not much clues on what went wrong. What do you think, does it make sense or did I miss something here?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A way to run it is documented under README.md now! |
||
| </properties> | ||
|
|
||
| <build> | ||
| <plugins> | ||
| <plugin> | ||
| <artifactId>maven-enforcer-plugin</artifactId> | ||
| <version>1.3.1</version> | ||
| <executions> | ||
| <execution> | ||
| <id>enforce</id> | ||
| <phase>none</phase> | ||
| </execution> | ||
| </executions> | ||
| </plugin> | ||
|
|
||
| <plugin> | ||
| <groupId>org.apache.maven.plugins</groupId> | ||
| <artifactId>maven-surefire-plugin</artifactId> | ||
| <configuration> | ||
| <excludes> | ||
| <exclude>${bigquery.test.exclude}</exclude> | ||
| </excludes> | ||
| </configuration> | ||
| </plugin> | ||
|
|
||
| <plugin> | ||
| <artifactId>maven-dependency-plugin</artifactId> | ||
| <version>2.8</version> | ||
| <executions> | ||
| <execution> | ||
| <id>copy-dependencies</id> | ||
| <phase>package</phase> | ||
| <goals> | ||
| <goal>copy-dependencies</goal> | ||
| </goals> | ||
| <configuration> | ||
| <outputDirectory>${project.build.directory}/../../interpreter/bqsql</outputDirectory> | ||
| <overWriteReleases>false</overWriteReleases> | ||
| <overWriteSnapshots>false</overWriteSnapshots> | ||
| <overWriteIfNewer>true</overWriteIfNewer> | ||
| <includeScope>runtime</includeScope> | ||
| </configuration> | ||
| </execution> | ||
| <execution> | ||
| <id>copy-artifact</id> | ||
| <phase>package</phase> | ||
| <goals> | ||
| <goal>copy</goal> | ||
| </goals> | ||
| <configuration> | ||
| <outputDirectory>${project.build.directory}/../../interpreter/bqsql</outputDirectory> | ||
| <overWriteReleases>false</overWriteReleases> | ||
| <overWriteSnapshots>false</overWriteSnapshots> | ||
| <overWriteIfNewer>true</overWriteIfNewer> | ||
| <includeScope>runtime</includeScope> | ||
| <artifactItems> | ||
| <artifactItem> | ||
| <groupId>${project.groupId}</groupId> | ||
| <artifactId>${project.artifactId}</artifactId> | ||
| <version>${project.version}</version> | ||
| <type>${project.packaging}</type> | ||
| </artifactItem> | ||
| </artifactItems> | ||
| </configuration> | ||
| </execution> | ||
| </executions> | ||
| </plugin> | ||
| <plugin> | ||
| <artifactId>maven-assembly-plugin</artifactId> | ||
| <configuration> | ||
| <archive> | ||
| <manifest> | ||
| <mainClass> | ||
| org.apache.zeppelin.bigquery.BigQueryInterpreter | ||
| </mainClass> | ||
| </manifest> | ||
| </archive> | ||
| <descriptorRefs> | ||
| <descriptorRef>jar-with-dependencies</descriptorRef> | ||
| </descriptorRefs> | ||
| </configuration> | ||
| </plugin> | ||
| </plugins> | ||
| </build> | ||
| </project> | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is JavaDoc for the artefact
right?
AFAIK it's an open-source library, so would you be so kind to add a link here to it's source code please? This could help future maintainers to keep up with changes, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. These packages are licensed under Apache 2.0. I have asked around to see if the code is publicly available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any updates on this one?