Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Output preview of Parquet files #2926

Open
anna-geller opened this issue Jan 25, 2024 · 1 comment
Open

Add support for Output preview of Parquet files #2926

anna-geller opened this issue Jan 25, 2024 · 1 comment
Assignees
Labels
area/frontend Needs frontend code changes enhancement New feature or request good first issue Great issue for new contributors

Comments

@anna-geller
Copy link
Member

anna-geller commented Jan 25, 2024

Feature description

Parquet is such an important file format in data lakehouse architectures that we should support Output preview of parquet files:

image

Example flow: https://kestra.io/blueprints/ingest/company.team.zip_to_parquet

id: zip_to_parquet
namespace: company.team
variables:
  file_id: "{{ execution.startDate | dateAdd(-3, 'MONTHS') | date('yyyyMM') }}"
tasks:
  - id: get_zipfile
    type: io.kestra.plugin.core.http.Download
    uri: https://divvy-tripdata.s3.amazonaws.com/{{ render(vars.file_id)
      }}-divvy-tripdata.zip
  - id: unzip
    type: io.kestra.plugin.compress.ArchiveDecompress
    algorithm: ZIP
    from: "{{ outputs.get_zipfile.uri }}"
  - id: parquet_output
    type: io.kestra.plugin.scripts.python.Script
    warningOnStdErr: false
    taskRunner:
      type: io.kestra.plugin.scripts.runner.docker.Docker
    containerImage: ghcr.io/kestra-io/pydata:latest
    env:
      FILE_ID: "{{ render(vars.file_id) }}"
    inputFiles: "{{ outputs.unzip.files }}"
    script: |
      import os
      import pandas as pd

      file_id = os.environ["FILE_ID"]
      file = f"{file_id}-divvy-tripdata.csv"

      df = pd.read_csv(file)
      df.to_parquet(f"{file_id}.parquet")
    outputFiles:
      - "*.parquet"
@anna-geller anna-geller added the enhancement New feature or request label Jan 25, 2024
@anna-geller anna-geller added this to the v0.16.0 milestone Jan 25, 2024
@anna-geller anna-geller modified the milestones: v0.16.0, v0.18.0 Feb 5, 2024
@anna-geller anna-geller modified the milestones: v0.18.0, v0.22.0 Feb 14, 2024
@anna-geller anna-geller removed this from the v0.22.0 milestone Jul 23, 2024
@anna-geller anna-geller added the area/frontend Needs frontend code changes label Aug 20, 2024
@anna-geller anna-geller added the good first issue Great issue for new contributors label Oct 10, 2024
@abhishekkhairnar
Copy link
Contributor

@anna-geller
please assign this issue to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend Needs frontend code changes enhancement New feature or request good first issue Great issue for new contributors
Projects
Status: Backlog
Development

No branches or pull requests

2 participants