Skip to content

Commit

Permalink
Merge pull request #13 from nextflow-io/abhinav/google-bigquery
Browse files Browse the repository at this point in the history
Add Google BigQuery support
  • Loading branch information
pditommaso authored Nov 17, 2022
2 parents 3a0ac32 + 9f6ad93 commit df3976d
Show file tree
Hide file tree
Showing 18 changed files with 539 additions and 28 deletions.
1 change: 1 addition & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ jobs:
run: ./gradlew check
env:
GRADLE_OPTS: '-Dorg.gradle.daemon=false'
NXF_SMOKE: 1

- name: Publish
if: failure()
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The current version provides out-of-the-box support for the following databases:
* [SQLite](https://www.sqlite.org/index.html)
* [DuckDB](https://duckdb.org/)
* [AWS Athena](https://aws.amazon.com/athena/) (Setup guide [here](/docs/aws-athena.md))
* [Google BigQuery](https://cloud.google.com/bigquery) (Setup guide [here](/docs/google-bigquery.md))

NOTE: THIS IS A PREVIEW TECHNOLOGY, FEATURES AND CONFIGURATION SETTINGS CAN CHANGE IN FUTURE RELEASES.

Expand All @@ -30,6 +31,15 @@ plugins {
The above declaration allows the use of the SQL plugin functionalities in your Nextflow pipelines.
See the section below to configure the connection properties with a database instance.

For BigQuery datasource you need to use the nf-bigquery plugin

```
plugins {
id 'nf-bigquery@0.0.1'
}
```


## Configuration

The target database connection coordinates are specified in the `nextflow.config` file using the
Expand Down
6 changes: 3 additions & 3 deletions docs/aws-athena.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ params {
plugins {
id 'nf-sqldb@0.5.0'
id 'nf-sqldb@0.6.0'
}
Expand All @@ -40,7 +40,7 @@ sql {

### Pipeline

Once the configuration has been setup correctly, you can use it in the Nextlow code as shown below
Once the configuration has been setup correctly, you can use it in the Nextflow code as shown below

```nextflow
include { fromQuery } from 'plugin/nf-sqldb'
Expand All @@ -59,7 +59,7 @@ Channel.fromQuery(sqlQuery, db: 'athena')

### Output

When you execute the above code, you'll see the AWS Athena query results on the console
When you execute the above code, you'll see the query results on the console

```console
[SRR6797500, WGS, SAN RAFFAELE, public, SRX3756197, 131677, Illumina HiSeq 2500, PAIRED, RANDOM, GENOMIC, ILLUMINA, SRS3011891, SAMN08629009, Mycobacterium tuberculosis, SRP128089, 2018-03-02, PRJNA428596, 165, null, 201, 383, null, 131677_WGS, Pathogen.cl, null, uncalculated, uncalculated, null, null, null, bam, sra, s3, s3.us-east-1, {k=assemblyname, v=GCF_000195955.2}, {k=bases, v=383901808}, {k=bytes, v=173931377}, {k=biosample_sam, v=MTB131677}, {k=collected_by_sam, v=missing}, {k=collection_date_sam, v=2010/2014}, {k=host_disease_sam, v=Tuberculosis}, {k=host_sam, v=Homo sapiens}, {k=isolate_sam, v=Clinical isolate18}, {k=isolation_source_sam_ss_dpl262, v=Not applicable}, {k=lat_lon_sam, v=Not collected}, {k=primary_search, v=131677}, {k=primary_search, v=131677_210916_BGD_210916_100.gatk.bam}, {k=primary_search, v=131677_WGS}, {k=primary_search, v=428596}, {k=primary_search, v=8629009}, {k=primary_search, v=PRJNA428596}, {k=primary_search, v=SAMN08629009}, {k=primary_search, v=SRP128089}, {k=primary_search, v=SRR6797500}, {k=primary_search, v=SRS3011891}, {k=primary_search, v=SRX3756197}, {k=primary_search, v=bp0}, {"assemblyname": "GCF_000195955.2", "bases": 383901808, "bytes": 173931377, "biosample_sam": "MTB131677", "collected_by_sam": ["missing"], "collection_date_sam": ["2010/2014"], "host_disease_sam": ["Tuberculosis"], "host_sam": ["Homo sapiens"], "isolate_sam": ["Clinical isolate18"], "isolation_source_sam_ss_dpl262": ["Not applicable"], "lat_lon_sam": ["Not collected"], "primary_search": "131677"}]
Expand Down
65 changes: 65 additions & 0 deletions docs/google-bigquery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Google BigQuery integration setup

## Pre-requisites

1. A Google Cloud project with BigQuery APIs enabled
2. A service account with sufficient permissions

## Usage

In the example below, it is assumed that the [NCBI SRA Metadata](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery/) has been used as the data source. You can refer the official [NCBI docs](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery/) for setting up the `nih-sra-datastore` within your BigQuery console.

*NOTE*: For Google BiqQuery you do not need to specify the `user` and `password` fields as these are provided by your service account credentials JSON file.

### Configuration

```nextflow config
//NOTE: Replace the values in the config file as per your setup
params {
google_bigquery_db = "nih-sra-datastore.sra.metadata"
google_project_id = "<YOUR_GOOGLE_PROJECT_ID>"
google_service_account_email = "<YOUR_GOOGLE_SERVICE_ACCOUNT_EMAIL>"
google_service_account_key = "<YOUR_GOOGLE_SERVICE_ACCOUNT_KEY_LOCATION>"
}
plugins {
id 'nf-bigquery@0.0.1'
}
sql {
db {
googlebigquery {
url = "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=${params.google_project_id};OAuthType=0;OAuthServiceAcctEmail=${params.google_service_account_email};OAuthPvtKeyPath=${params.google_service_account_key};"
}
}
}
```

### Pipeline

Once the configuration has been setup correctly, you can use it in the Nextflow code as shown below

```nextflow
include { fromQuery } from 'plugin/nf-bigquery'
def googleSqlQuery = """
SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Mycobacterium tuberculosis'
AND bioproject = 'PRJNA670836'
LIMIT 2;
"""
Channel.fromQuery(googleSqlQuery, db: 'googlebigquery')
.view()
```

### Output

When you execute the above code, you'll see the query results on the console

```console
[SRR6797500, WGS, SAN RAFFAELE, public, SRX3756197, 131677, Illumina HiSeq 2500, PAIRED, RANDOM, GENOMIC, ILLUMINA, SRS3011891, SAMN08629009, Mycobacterium tuberculosis, SRP128089, 2018-03-02, PRJNA428596, 165, null, 201, 383, null, 131677_WGS, Pathogen.cl, null, uncalculated, uncalculated, null, null, null, bam, sra, s3, s3.us-east-1, {k=assemblyname, v=GCF_000195955.2}, {k=bases, v=383901808}, {k=bytes, v=173931377}, {k=biosample_sam, v=MTB131677}, {k=collected_by_sam, v=missing}, {k=collection_date_sam, v=2010/2014}, {k=host_disease_sam, v=Tuberculosis}, {k=host_sam, v=Homo sapiens}, {k=isolate_sam, v=Clinical isolate18}, {k=isolation_source_sam_ss_dpl262, v=Not applicable}, {k=lat_lon_sam, v=Not collected}, {k=primary_search, v=131677}, {k=primary_search, v=131677_210916_BGD_210916_100.gatk.bam}, {k=primary_search, v=131677_WGS}, {k=primary_search, v=428596}, {k=primary_search, v=8629009}, {k=primary_search, v=PRJNA428596}, {k=primary_search, v=SAMN08629009}, {k=primary_search, v=SRP128089}, {k=primary_search, v=SRR6797500}, {k=primary_search, v=SRS3011891}, {k=primary_search, v=SRX3756197}, {k=primary_search, v=bp0}, {"assemblyname": "GCF_000195955.2", "bases": 383901808, "bytes": 173931377, "biosample_sam": "MTB131677", "collected_by_sam": ["missing"], "collection_date_sam": ["2010/2014"], "host_disease_sam": ["Tuberculosis"], "host_sam": ["Homo sapiens"], "isolate_sam": ["Clinical isolate18"], "isolation_source_sam_ss_dpl262": ["Not applicable"], "lat_lon_sam": ["Not collected"], "primary_search": "131677"}]
```
140 changes: 140 additions & 0 deletions plugins/nf-bigquery/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
/*
* Copyright 2020-2022, Seqera Labs
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

plugins {
id 'java-library'
id 'groovy'
id 'idea'
id 'de.undercouch.download' version '4.1.2'
}

group = 'io.nextflow'
// DO NOT SET THE VERSION HERE
// THE VERSION FOR PLUGINS IS DEFINED IN THE `/resources/META-INF/MANIFEST.NF` file
java {
toolchain {
languageVersion = JavaLanguageVersion.of(11)
}
}

idea {
module.inheritOutputDirs = true
}

repositories {
mavenCentral()
maven { url = 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/releases' }
maven { url = 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/snapshots' }
}

configurations {
// see https://docs.gradle.org/4.1/userguide/dependency_management.html#sub:exclude_transitive_dependencies
runtimeClasspath.exclude group: 'org.slf4j', module: 'slf4j-api'
}

sourceSets {
main.java.srcDirs = []
main.groovy.srcDirs = ['src/main']
main.resources.srcDirs = ['src/resources']
test.groovy.srcDirs = ['src/test']
test.java.srcDirs = []
test.resources.srcDirs = []
}

ext{
nextflowVersion = '22.08.1-edge'
}

dependencies {
compileOnly "io.nextflow:nextflow:$nextflowVersion"
compileOnly 'org.slf4j:slf4j-api:1.7.10'
compileOnly 'org.pf4j:pf4j:3.4.1'

api("org.codehaus.groovy:groovy-sql:3.0.10") { transitive = false }

api project(":plugins:nf-sqldb")

// JDBC driver setup for Google BigQuery - the 3rd party JAR are being downloaded and setup as gradle tasks below.
// Reference https://cloud.google.com/bigquery/docs/reference/odbc-jdbc-drivers
api files('src/dist/lib/GoogleBigQueryJDBC42.jar')
//NOTE: Had to remove the slf4j jar due to a conflict
implementation fileTree(dir: 'src/dist/lib/libs', include: '*.jar')


testImplementation "io.nextflow:nextflow:$nextflowVersion"
testImplementation "org.codehaus.groovy:groovy:3.0.10"
testImplementation "org.codehaus.groovy:groovy-nio:3.0.10"
testImplementation("org.codehaus.groovy:groovy-test:3.0.10") { exclude group: 'org.codehaus.groovy' }
testImplementation("cglib:cglib-nodep:3.3.0")
testImplementation("org.objenesis:objenesis:3.2")
testImplementation("org.spockframework:spock-core:2.1-groovy-3.0") {
exclude group: 'org.codehaus.groovy';
exclude group: 'net.bytebuddy'
}
testImplementation('org.spockframework:spock-junit4:2.1-groovy-3.0') {
exclude group: 'org.codehaus.groovy';
exclude group: 'net.bytebuddy'
}
testImplementation('com.google.jimfs:jimfs:1.1')

testImplementation(testFixtures("io.nextflow:nextflow:$nextflowVersion"))
testImplementation(testFixtures("io.nextflow:nf-commons:$nextflowVersion"))
}

test {
useJUnitPlatform()
}

/**
* Google BigQuery
* The following tasks download and confirm the MD5 checksum of the ZIP archive
* for Simba BigQuery JDBC driver and extract its contents to the build directory
* Reference: https://cloud.google.com/bigquery/docs/reference/odbc-jdbc-drivers
*/
task downloadBigqueryDep(type: Download) {
src 'https://storage.googleapis.com/simba-bq-release/jdbc/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip'
dest new File(buildDir, 'downloads/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip')
overwrite false
}

task verifyBigqueryDep(type: Verify, dependsOn: downloadBigqueryDep) {
src new File(buildDir, 'downloads/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip')
algorithm 'MD5'
checksum '2e54169cfba2050f0a0f01bcf12c8aa7'
}

task unzipBigqueryDep(dependsOn: verifyBigqueryDep, type: Copy) {
from zipTree(new File(buildDir, 'downloads/SimbaJDBCDriverforGoogleBigQuery42_1.3.0.1001.zip'))
into "${buildDir}/downloads/unzip/googlebigquery"
}
unzipBigqueryDep.doLast{
file("${buildDir}/downloads/unzip/googlebigquery/libs/slf4j-api-1.7.36.jar").delete()
}

// Files under src/dist are included into the distribution zip
// https://docs.gradle.org/current/userguide/application_plugin.html
task copyBigqueryDep(dependsOn: unzipBigqueryDep, type: Copy) {
from file(new File(buildDir, '/downloads/unzip/googlebigquery/GoogleBigQueryJDBC42.jar'))
into "src/dist/lib"
}

task copyBigqueryLibs(dependsOn: copyBigqueryDep, type: Copy) {
from file(new File(buildDir, '/downloads/unzip/googlebigquery/libs'))
into "src/dist/lib/libs"
}

project.copyPluginLibs.dependsOn('copyBigqueryLibs')
project.compileGroovy.dependsOn('copyBigqueryLibs')
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
package nextflow.sql

import nextflow.sql.config.DriverRegistry


/**
* @author : jorge <jorge.aguilera@seqera.io>
*
*/
class BigQueryDriverRegistry extends DriverRegistry {

BigQueryDriverRegistry(){
super()
addDriver('bigquery','com.simba.googlebigquery.jdbc.Driver')
}

}
34 changes: 34 additions & 0 deletions plugins/nf-bigquery/src/main/nextflow/sql/BigQuerySqlPlugin.groovy
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Copyright 2020-2022, Seqera Labs
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/

package nextflow.sql

import nextflow.sql.config.DriverRegistry
import org.pf4j.PluginWrapper

/**
* Implements BigQuerySQL plugin for Nextflow
*
* @author : jorge <jorge.aguilera@seqera.io>
*/
class BigQuerySqlPlugin extends SqlPlugin {

BigQuerySqlPlugin(PluginWrapper wrapper) {
super(wrapper)
DriverRegistry.DEFAULT = new BigQueryDriverRegistry()
}
}
6 changes: 6 additions & 0 deletions plugins/nf-bigquery/src/resources/META-INF/MANIFEST.MF
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Manifest-Version: 1.0
Plugin-Class: nextflow.sql.BigQuerySqlPlugin
Plugin-Id: nf-bigquery
Plugin-Provider: Seqera Labs
Plugin-Version: 0.0.1
Plugin-Requires: >=22.08.1-edge
17 changes: 17 additions & 0 deletions plugins/nf-bigquery/src/resources/META-INF/extensions.idx
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#
# Copyright 2020-2022, Seqera Labs
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

nextflow.sql.ChannelSqlExtension
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
/*
* Copyright 2020-2022, Seqera Labs
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/

package nextflow.sql.bigquery

import nextflow.sql.BigQueryDriverRegistry
import nextflow.sql.config.DriverRegistry
import nextflow.sql.config.SqlDataSource
import spock.lang.Specification
/**
*
* @author Paolo Di Tommaso <paolo.ditommaso@gmail.com>
*/
class BigQuerySqlDataSourceTest extends Specification {

def 'should map url to driver' () {
given:
DriverRegistry.DEFAULT = new BigQueryDriverRegistry()
def helper = new SqlDataSource([:])

expect:
helper.urlToDriver(JBDC_URL) == DRIVER
where:
JBDC_URL | DRIVER
'jdbc:postgresql:database' | 'org.postgresql.Driver'
'jdbc:sqlite:database' | 'org.sqlite.JDBC'
'jdbc:h2:mem:' | 'org.h2.Driver'
'jdbc:mysql:some-host' | 'com.mysql.cj.jdbc.Driver'
'jdbc:mariadb:other-host' | 'org.mariadb.jdbc.Driver'
'jdbc:duckdb:' | 'org.duckdb.DuckDBDriver'
'jdbc:awsathena:' | 'com.simba.athena.jdbc.Driver'
'jdbc:bigquery:' | 'com.simba.googlebigquery.jdbc.Driver'
}

}
Loading

0 comments on commit df3976d

Please sign in to comment.