Data Weaver

Data Weaver is a data processing and ETL (Extract, Transform, Load) tool built on Apache Spark. It allows you to define data pipelines using YAML configuration files and execute them using Spark for data transformation and integration.

Getting Started

Prerequisites

Before using Data Weaver, make sure you have the following prerequisites installed:

Apache Spark 3.5.0: Download and install Apache Spark.
Java 11 or later: Data Weaver requires Java to run.

Installation

Clone the Data Weaver repository to your local machine:

git clone https://github.com/yourusername/data-weaver.git

Usage

Defining Data Pipelines

Data pipelines are defined using YAML configuration files. You can create your pipeline configurations and place them in a directory of your choice. Each configuration should define data sources, transformations, and sinks.

Here's an example of a simple pipeline configuration:

name: ExamplePipeline
tag: example
dataSources:
  - id: testSource
    type: MySQL
    query: >
      SELECT name
      FROM test_table
    config:
      readMode: ReadOnce # ReadOnce, Incremental..
      connection: testConnection # Connection name related to the defined connections inside application.conf
transformations:
  - id: transform1
    type: SQLTransformation
    sources:
      - source1 # Source name related to the defined data sources inside pipeline.yaml
    query: >
      SELECT name as id
      FROM testSource
      WHERE column1 = 'value'
  - id: transform2
    type: ScalaTransformation
    sources:
      - transform1 # Source name related to the defined data sources or transformations inside pipeline.yaml
    action: dropDuplicates
sinks:
  - id: sink1
    type: BigQuery
    config:
      saveMode: Append # Append, Overwrite, Merge...
      profile: testProfile # Profile name related to the defined profiles inside application.conf

Running Data Pipelines

To run data pipelines, you can use the Data Weaver command-line interface (CLI). Here's how to execute a pipeline:

weaver run --pipelines /path/to/pipelines/folder --tag 1d

Configuration

You can configure Data Weaver by editing the flow.conf file located in the config directory. This configuration file contains various settings for Data Weaver, including Spark configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Weaver

Table of Contents

Getting Started

Prerequisites

Installation

Usage

Defining Data Pipelines

Running Data Pipelines

Configuration

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Weaver

Table of Contents

Getting Started

Prerequisites

Installation

Usage

Defining Data Pipelines

Running Data Pipelines

Configuration