Datalake Studio

Datalake Studio is an enhanced Data Exploration and Management tool

Key Features of Datalake Studio:

Quick for big data: Datalake Studio is built on top of DuckDB, a high-performance, embedded SQL OLAP database management system. DuckDB is designed to handle large datasets, making it ideal for data exploration and analysis.

See your data: Plot automatically your data or see data over a map: Points, H3 aggregations, etc

Versatile Data Loading Options: Users can effortlessly upload data from a several sources: directly from local computer, via a URL, or from an Amazon S3 bucket. Additionally, it supports direct data downloads from PostgreSQL databases, enhancing its utility for database administrators and data analysts.

Several data formats: Wide range of data formats, Datalake Studio is compatible with CSV, TSV, Parquet and Shapefile formats. Load data without tedious conversions.

ChatGPT Integration with SQL Assistants: Users with ChatGPT credentials can use the power of SQL assistants. These assistants provide contextual understanding about your tables and fields, making data manipulation and query formulation more intuitive and efficient.

Enhancement through Remote APIs: Users have the ability to enrich their data by integrating information from remote APIs.

API Exposure for Data Sharing: After completing data transformation processes, users can expose their data through APIs. This feature allows for easy sharing and collaboration, making Datalake Studio not just a tool for data exploration, but also a platform for data distribution.

Project build with Docker

docker-compose up --build

Open http://localhost:8080/ in your browser.

## If you dont want to use compose

docker build -t datalakestudioserver . docker run --name datalakestudioserver -p 8000:8000 datalakestudioserver

docker build -t datalakestudiofront . docker run --name datalakestudiofront -p 8080:8080 datalakestudiofront

Project build without Docker

Server

Inside server folder run:

pip3 install -r requirements.txt
python3 server.py

If you want to use venv:

python3 -m venv venv
source venv/bin/activate

Exit venv:

deactivate

Client

Inside the client folder of the project, run these commands to build the Vue UI project:

npm install
npm run dev -- --port 8080

Open http://localhost:8080/ in your browser.

Configuration files

Server

Inside server folder create a file named config.yml. Example:

port: 8000
database: "data/datalakeStudio.db"

And another file named secrets.yml with properties:

# Optional for DuckDB to work with S3, if not defined, user aws credentials will be loaded through the AWS Default Credentials Provider Chain
s3_access_key_id: "YOUR_S3_ACCESS_KEY_ID"
s3_secret_access_key: "YOUR_S3_SECRET_ACCESS_KEY"

# For OpenAI
openai_organization: "YOUR_OPENAI_ORGANIZATION"
openai_api_key: "YOUR_OPENAI_API_KEY"

# For API search
api_domain: "YOUR_API_DOMAIN"
api_context: "YOUR_API_CONTEXT"

# Database connections
pgpass_file: "YOUR_PG_PASS_FILE"

# Mapbox
mapbox_access_token: "YOUR_MAPBOX_ACCESS_TOKEN"

Also, docker-compose will get the credentials in .aws for AWS access.

If you want to use remote database, copy your pgpass file to the server folder. pgpass is a file with the following format:

hostname:port:database:username:password

Usage

Load data

You can load data from local filesystem, from any URL or from S3. Try to load this example: https://raw.githubusercontent.com/javitorres/GenericCross/main/public/data/iris.csv

Table explorer

Inspect loaded data. Export data to CSV or Parquet

Get data profile

or use crossfilter to play with your data

If your data has spatial info you can see in a map:

Query panel

Query your data and generate new tables. Save or load your queries. Use ChatGPT to create new queries

Load data from APIs

Enrich your datasets calling external APIs

New table:

Load data from remote databases

Explore your external databases and load data into Datalake Studio for local analysis

Expose your data via API

Publish endpoints serving your data with parametrized queries:

Keep control of endpoints published:

Explore your S3 buckets

Move in your S3 buckets and write descriptions

Preview files or load them into DatalaleStudio

Talk to ChatGPT

Talk to explore your data (experimental)

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
client		client
server		server
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bugs.txt		bugs.txt
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datalake Studio

Key Features of Datalake Studio:

Project build with Docker

Project build without Docker

Server

Client

Configuration files

Server

Usage

Load data

Table explorer

Query panel

Load data from APIs

Load data from remote databases

Expose your data via API

Explore your S3 buckets

Talk to ChatGPT

About

Releases

Packages

Contributors 2

Languages

License

javitorres/datalakeStudio

Folders and files

Latest commit

History

Repository files navigation

Datalake Studio

Key Features of Datalake Studio:

Project build with Docker

Project build without Docker

Server

Client

Configuration files

Server

Usage

Load data

Table explorer

Query panel

Load data from APIs

Load data from remote databases

Expose your data via API

Explore your S3 buckets

Talk to ChatGPT

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages