Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]Flint/PPL Tutorial based End2End sample #1010

Open
YANG-DB opened this issue Jan 8, 2025 · 0 comments
Open

[FEATURE]Flint/PPL Tutorial based End2End sample #1010

YANG-DB opened this issue Jan 8, 2025 · 0 comments
Labels
enhancement New feature or request infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. testing test related feature

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Jan 8, 2025

Is your feature request related to a problem?
As part of the need to educate the community and users of how to use flint, ppl and its functionality we would like to introduce a mechanism (framework) that will allow setting up a simple tutorial based experience that will assist users to explore and experiment with Flint , Flint API, PPL, Queries and more.

Containerized Testing Framework

Spark

This guide will get you up and running with OpenSearch Flint using Apache Spark / EMR, including sample code to highlight some powerful features.

We will use docker-compose to generate an End2End running sample containing:

  • Spark / EMR with Flint's deployed job
  • OpenSearch server container
  • OpenSearch Dashboards container
  • S3 alike (Minio) container

The Spark container is configured with both the Flint and PPL extensions, enabling it to both execute PPL queries and query indices on the OpenSearch server.

  spark:
    image: bitnami/spark:${SPARK_VERSION:-3.5.3}
    container_name: spark
    ports:
      - "${MASTER_UI_PORT:-8080}:8080"
      - "${MASTER_PORT:-7077}:7077"
      - "${UI_PORT:-4040}:4040"
      - "${SPARK_CONNECT_PORT}:15002"
    entrypoint: /opt/bitnami/scripts/spark/master-entrypoint.sh
    user: root
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_PUBLIC_DNS=localhost
      - AWS_ENDPOINT_URL_S3=http://minio-S3
      - OPENSEARCH_ADMIN_PASSWORD=${OPENSEARCH_ADMIN_PASSWORD}
    volumes:
      - type: bind
        source: ./spark-master-entrypoint.sh
        target: /opt/bitnami/scripts/spark/master-entrypoint.sh
      - type: bind
        source: ./spark-defaults.conf
        target: /opt/bitnami/spark/conf/spark-defaults.conf
      - type: bind
        source: ./log4j2.properties
        target: /opt/bitnami/spark/conf/log4j2.properties
      - type: bind
        source: $PPL_JAR
        target: /opt/bitnami/spark/jars/ppl-spark-integration.jar
      - type: bind
        source: $FLINT_JAR
        target: /opt/bitnami/spark/jars/flint-spark-integration.jar
      - type: bind
        source: ./s3.credentials
        target: /opt/bitnami/spark/s3.credentials

The OpenSearch Dashboards container is configured to connect to the OpenSearch server container.

The Spark container is started up as a driver and runs the Spark application.

Spark uses minio as an S3 compliant object store allowing flint to query long term storage locally.

spark.datasource.flint.auth           basic
spark.datasource.flint.auth.username  admin
spark.datasource.flint.auth.password  C0rrecthorsebatterystaple.
spark.sql.warehouse.dir               s3a://integ-test/
spark.hadoop.fs.s3a.impl              org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.access.key        Vt7jnvi5BICr1rkfsheT
spark.hadoop.fs.s3a.secret.key        5NK3StGvoGCLUWvbaGN0LBUf9N6sjE94PEzLdqwO
spark.hadoop.fs.s3a.endpoint          minio-S3:9000
spark.hadoop.fs.s3a.connection.ssl.enabled false

Jupiter Notebook based tutorial

Using the following Dockerfile to add support for the Jupyter notebook and tutorial folder library

FROM python:3.10-bullseye

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
      sudo \
      curl \
      vim \
      unzip \
      openjdk-11-jdk \
      build-essential \
      software-properties-common \
      ssh && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip3 install -r requirements.txt

RUN python3 -m spylon_kernel install

RUN curl https://github.com/SpencerPark/IJava/releases/download/v1.3.0/ijava-1.3.0.zip -Lo ijava-1.3.0.zip \
  && unzip ijava-1.3.0.zip \
  && python3 install.py --sys-prefix \
  && rm ijava-1.3.0.zip

# Optional env variables
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH

WORKDIR ${SPARK_HOME}

ENV SPARK_VERSION=3.5.2
ENV SPARK_MAJOR_VERSION=3.5
ENV ICEBERG_VERSION=1.6.0

# Download spark
RUN mkdir -p ${SPARK_HOME} \
 && curl https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz -o spark-${SPARK_VERSION}-bin-hadoop3.tgz \
 && tar xvzf spark-${SPARK_VERSION}-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
 && rm -rf spark-${SPARK_VERSION}-bin-hadoop3.tgz

# Add spark runtime jar to IJava classpath
ENV IJAVA_CLASSPATH=/opt/spark/jars/*

RUN mkdir -p /home/demo/data \
 && curl https://data.cityofnewyork.us/resource/tg4x-b46p.json > /home/iceberg/data/nyc_film_permits.json \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-04.parquet -o /home/iceberg/data/yellow_tripdata_2022-04.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet -o /home/iceberg/data/yellow_tripdata_2022-03.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet -o /home/iceberg/data/yellow_tripdata_2022-02.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -o /home/iceberg/data/yellow_tripdata_2022-01.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-12.parquet -o /home/iceberg/data/yellow_tripdata_2021-12.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-11.parquet -o /home/iceberg/data/yellow_tripdata_2021-11.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-10.parquet -o /home/iceberg/data/yellow_tripdata_2021-10.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-09.parquet -o /home/iceberg/data/yellow_tripdata_2021-09.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-08.parquet -o /home/iceberg/data/yellow_tripdata_2021-08.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-07.parquet -o /home/iceberg/data/yellow_tripdata_2021-07.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-06.parquet -o /home/iceberg/data/yellow_tripdata_2021-06.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-05.parquet -o /home/iceberg/data/yellow_tripdata_2021-05.parquet \
 && curl https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-04.parquet -o /home/iceberg/data/yellow_tripdata_2021-04.parquet

RUN mkdir -p /home/demo/localwarehouse /home/demo/notebooks /home/demo/warehouse /home/demo/spark-events /home/demo
COPY notebooks/ /home/demo/notebooks

# Add a notebook command
RUN echo '#! /bin/sh' >> /bin/notebook \
 && echo 'export PYSPARK_DRIVER_PYTHON=jupyter-notebook' >> /bin/notebook \
 && echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"--notebook-dir=/home/demo/notebooks --ip='*' --NotebookApp.token='' --NotebookApp.password='' --port=8888 --no-browser --allow-root\"" >> /bin/notebook \
 && echo "pyspark" >> /bin/notebook \
 && chmod u+x /bin/notebook

# Add a pyspark-notebook command (alias for notebook command for backwards-compatibility)
RUN echo '#! /bin/sh' >> /bin/pyspark-notebook \
 && echo 'export PYSPARK_DRIVER_PYTHON=jupyter-notebook' >> /bin/pyspark-notebook \
 && echo "export PYSPARK_DRIVER_PYTHON_OPTS=\"--notebook-dir=/home/demo/notebooks --ip='*' --NotebookApp.token='' --NotebookApp.password='' --port=8888 --no-browser --allow-root\"" >> /bin/pyspark-notebook \
 && echo "pyspark" >> /bin/pyspark-notebook \
 && chmod u+x /bin/pyspark-notebook

COPY spark-defaults.conf /opt/spark/conf
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"

RUN chmod u+x /opt/spark/sbin/* && \
    chmod u+x /opt/spark/bin/*

COPY .pyiceberg.yaml /root/.pyiceberg.yaml

COPY entrypoint.sh .

ENTRYPOINT ["./entrypoint.sh"]
CMD ["notebook"]

The /home/demo/data mapped volume would contain the list of python jupyter notebook tutorials to get started working with Flint / PPL using spark

  • - An Introduction to the Flint API.ipynb
  • - PPL Getting Started.ipynb
  • - PPL Data Projections.ipynb
  • - SQL Data Accelerations.ipynb

NYC Taxi Dataset

The NYC Taxi Dataset provides a rich source of real-world data for experimentation with Flint, PPL, and Spark. This dataset includes yellow taxi trip records, including pickup and drop-off times, locations, trip distances, fare amounts, and other relevant metadata.
This dataset is used for demonstrating Flint's capabilities in querying, data indexing, and analytics both for SQL & PPL.

Data Setup

The NYC Taxi Dataset is included in the Docker setup as .parquet files located in the /home/demo/data directory of the container. Each file corresponds to a specific month and year, enabling experimentation with partitioned data and time-series queries.

The .parquet files are preloaded for the following months:

  • 2021: April to December
  • 2022: January to April

These files can be accessed from Spark or directly via Minio (S3-alike object storage).

Tutorials Featuring NYC Taxi Dataset

The dataset is used as the basis for hands-on tutorials available in the /home/demo/notebooks folder:

An Introduction to the Flint API.ipynb: Learn how to query and manipulate data.
PPL Getting Started.ipynb: Explore Flint's PPL capabilities with real-world data.
PPL Data Projections.ipynb: Project and filter key metrics from the dataset.
SQL Data Accelerations.ipynb: Accelerate data processing with OpenSearch indices using Flint optimizations.


General purpose testing facilities

To enhance flexibility and support a wide range of use cases, the Docker setup includes a general-purpose data folder located at /home/demo/data.
This folder is designed to house datasets and accompanying resources tailored for specific tutorials and learning scenarios. Each dataset resides in its own subfolder, containing:

Dataset Files: The raw or preprocessed data required for the tutorial, such as .parquet, .csv, or .json files.

Loading Script: A Jupyter Notebook (load_dataset.ipynb) that demonstrates how to load and prepare the dataset using Spark or other tools.

Tutorial-Specific Notebooks: A collection of Jupyter Notebooks designed to guide users through specific functionalities and use cases related to Flint, PPL, or Spark.

These notebooks provide step-by-step instructions for tasks such as querying, data transformation, and visualization.

Example Structure
For the NYC Taxi Dataset, the folder structure would look like this:

/home/demo/data/nyc_taxi/
  ├── yellow_tripdata_2021-12.parquet
  ├── yellow_tripdata_2022-01.parquet
  ├── load_dataset.ipynb
  ├── An_Introduction_to_the_Flint_API.ipynb
  ├── PPL_Getting_Started.ipynb
  ├── PPL_Data_Projections.ipynb
  └── SQL_Data_Accelerations.ipynb

Do you have any additional context?

@YANG-DB YANG-DB added enhancement New feature or request untriaged infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. testing test related feature and removed untriaged labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. testing test related feature
Projects
None yet
Development

No branches or pull requests

1 participant