Getting Started

Installation

Python 3.8+ is required.

pip install smallpond

Initialization

The first step is to initialize the smallpond session:

import smallpond

sp = smallpond.init()

Loading Data

Create a DataFrame from a set of files:

df = sp.read_parquet("path/to/dataset/*.parquet")

To learn more about loading data, please refer to :ref:`loading_data`.

Partitioning Data

Smallpond requires users to manually specify data partitions for now.

df = df.repartition(3)                 # repartition by files
df = df.repartition(3, by_row=True)    # repartition by rows
df = df.repartition(3, hash_by="host") # repartition by hash of column

To learn more about partitioning data, please refer to :ref:`partitioning_data`.

Transforming Data

Apply python functions or SQL expressions to transform data.

df = df.map('a + b as c')
df = df.map(lambda row: {'c': row['a'] + row['b']})

To learn more about transforming data, please refer to :ref:`transformations`.

Saving Data

Save the transformed data to a set of files:

df.write_parquet("path/to/output")

To learn more about saving data, please refer to :ref:`consuming_data`.

Monitoring

Smallpond uses Ray Core as the task scheduler. You can use Ray Dashboard to monitor the task execution.

When smallpond starts, it will print the Ray Dashboard URL:

... Started a local Ray instance. View the dashboard at http://127.0.0.1:8008

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getstarted.rst

getstarted.rst

Getting Started

Installation

Initialization

Loading Data

Partitioning Data

Transforming Data

Saving Data

Monitoring

Files

getstarted.rst

Latest commit

History

getstarted.rst

File metadata and controls

Getting Started

Installation

Initialization

Loading Data

Partitioning Data

Transforming Data

Saving Data

Monitoring