DE1 Python Package

Curated collection of DE1's favorite kedro utilities.

Installation

pip install de1

DataSets

EmptyPartitionedDataSet

For those times when data is not yet available in a particular folder, or if no data is a valid value.

Particularly useful when doing sub-node parallelization.

Example usage

# catalog.yml
empty_json_collection:
    type: de1.empty.EmptyPartitionedDataSet
    path: data/02_intermediate/json_collection
    dataset: json.JSONDataSet

 empty_json = catalog.load('empty_json_collection')
 assert empty_json.load() == {}

LazyPartitionedDataSet

For when the data is too big to calculate all at once, and requires at least some clean-up in the process.

Example Usage

# catalog.yml
lazy_json_collection:
    type: de1.lazy.LazyPartitionedDataSet
    path: data/02_intermediate/json_collection
    dataset: json.JSONDataSet

data = {
    'key1': lambda: 'HI',
    'key2': lambda: 'BYE',
}
catalog.save('lazy_json_collection', data)
lazy_json_collection = catalog.load('lazy_json_collection')
assert lazy_json_collection['key1']() == 'HI'

PDFDataSet

A dataset that uses pdfplumber to extract text and tables from pdf files.

Data gets returned as a PDFPage object.

Example Usage

# catalog.yml
invoice_pdf:
    type: de1.pdf.PDFDataSet
    filepath: data/01_raw/invoice.pdf

from de1.pdf import PDFPage

pdf_page: PDFPage = catalog.load('invoice_pdf')
assert type(pdf_page.table) is list
assert type(pdf_page.text) is str

ZipFileDataSet

A dataset that extracts a single file from a zip file and returns the bytes. By default will return a byte array, but a dataset can be passed in to change unzip behavior.

Example Usage

Check out the video: Handling Zip Files in Kedro Using the de1 python package!

invoice_zip:
    type: de1.zip.ZipFileDataSet
    filepath: data/01_raw/invoice.zip
    zipped_filename: invoice.pdf
    dataset: de1.pdf.PDFDataSet

from de1.pdf import PDFPage

pdf_page: PDFPage = catalog.load('invoice_zip')
assert type(pdf_page.table) is list
assert type(pdf_page.text) is str

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
de1		de1
images		images
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
publish.sh		publish.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DE1 Python Package

Installation

DataSets

EmptyPartitionedDataSet

Example usage

LazyPartitionedDataSet

Example Usage

PDFDataSet

Example Usage

ZipFileDataSet

Example Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

dataengineerone/de1-python

Folders and files

Latest commit

History

Repository files navigation

DE1 Python Package

Installation

DataSets

EmptyPartitionedDataSet

Example usage

LazyPartitionedDataSet

Example Usage

PDFDataSet

Example Usage

ZipFileDataSet

Example Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages