Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add consolidated structure family #668

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
b22f7ab
REL: v0.1.0a113
danielballan Jan 3, 2024
d5f7ddd
REL: v0.1.0a114
danielballan Feb 5, 2024
68a1696
First pass at 'union' structure family
danielballan Feb 23, 2024
2ad9f3a
Fix mismatch (from rebase, likely).
danielballan Dec 12, 2024
a1a027a
Make grouping clear
danielballan Dec 12, 2024
6ed1167
TMP Fix usage, but this needs re-examined.
danielballan Dec 12, 2024
63ddea6
FIX: pass all query params as kwargs
genematx Dec 12, 2024
210dc65
Merge pull request #4 from genematx/consolidated-structure
danielballan Dec 12, 2024
eb0c9f2
Container may have structure (inlined contents).
danielballan Dec 12, 2024
9dc3a1f
Set all_keys correctly.
danielballan Dec 12, 2024
16c79d2
MNT: rename UnionStructure to ConsolidatedStructure
genematx Dec 12, 2024
af06a43
MNT: rename UnionStructure to ConsolidatedStructure
genematx Dec 12, 2024
d8ef72c
MNT: rename CatalogUnionAdapter and UnionLinks
genematx Dec 12, 2024
2380800
MNT: typing and lint
genematx Dec 12, 2024
b76e8fd
MNT: typing and lint
genematx Dec 12, 2024
1fcc9dc
ENH: refactor creation of ConsolidatedStructure as a classmethod
genematx Dec 13, 2024
bdbb965
ENH: allow iterating over ConsolidatedClient and its parts
genematx Dec 13, 2024
275f42c
DOC: add Consolidated Structure to the docs
genematx Dec 13, 2024
bb8fcaf
MNT: remove dims from the Container client signature
genematx Dec 13, 2024
f3ec5f5
TST: add tests for writing/reading consolidated structures
genematx Dec 13, 2024
7e74997
MNT: lint
genematx Dec 13, 2024
04efd91
MNT: typing
genematx Dec 13, 2024
c7ee962
FIX: reading string-dtype columns from dataframes individually
genematx Dec 14, 2024
2548cea
Merge pull request #5 from genematx/consolidated-structure
danielballan Dec 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/explanations/catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ and `assets`, describes the format, structure, and location of the data.
to the Adapter
- `management` --- enum indicating whether the data is registered `"external"` data
or `"writable"` data managed by Tiled
- `structure_family` --- enum of structure types (`"container"`, `"array"`, `"table"`, ...)
- `structure_family` --- enum of structure types (`"container"`, `"array"`, `"table"`,
etc. -- except for `consolidated`, which can not be assigned to a Data Source)
- `structure_id` --- a foreign key to the `structures` table
- `node_id` --- foreign key to `nodes`
- `id` --- integer primary key
Expand Down
76 changes: 75 additions & 1 deletion docs/source/explanations/structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ The structure families are:

* array --- a strided array, like a [numpy](https://numpy.org) array
* awkward --- nested, variable-sized data (as implemented by [AwkwardArray](https://awkward-array.org/))
* container --- a of other structures, akin to a dictionary or a directory
* consolidated --- a container-like structure to combine tables and arrays in a common namespace
* container --- a collection of other structures, akin to a dictionary or a directory
* sparse --- a sparse array (i.e. an array which is mostly zeros)
* table --- tabular data, as in [Apache Arrow](https://arrow.apache.org) or
[pandas](https://pandas.pydata.org/)
Expand Down Expand Up @@ -575,3 +576,76 @@ response.
"count": 5
}
```

### Consolidated

This is a specialized container-like structure designed to link together multiple tables and arrays that store
related scientific data. It does not support nesting but provides a common namespace across all columns of the
contained tables along with the arrays (thus, name collisions are forbidden). This allows to further abstract out
the disparate internal storage mechanisms (e.g. Parquet for tables and zarr for arrays) and present the user with a
smooth homogeneous interface for data access. Consolidated structures do not support pagination and are not
recommended for "wide" datasets with more than ~1000 items (cloumns and arrays) in the namespace.

Below is an example of a Consolidated structure that describes two tables and two arrays of various sizes. Their
respective structures are specfied in the `parts` list, and `all_keys` defines the internal namespace of directly
addressible columns and arrays.

```json
{
"parts": [
{
"structure_family": "table",
"structure": {
"arrow_schema": "data:application/vnd.apache.arrow.file;base64,/////...FFFF",
"npartitions": 1,
"columns": ["A", "B"],
"resizable": false
},
"name": "table1"
},
{
"structure_family": "table",
"structure": {
"arrow_schema": "data:application/vnd.apache.arrow.file;base64,/////...FFFF",
"npartitions": 1,
"columns": ["C", "D", "E"],
"resizable": false
},
"name": "table2"
},
{
"structure_family": "array",
"structure": {
"data_type": {
"endianness": "little",
"kind": "f",
"itemsize": 8,
"dt_units": null
},
"chunks": [[3], [5]],
"shape": [3, 5],
"dims": null,
"resizable": false
},
"name": "F"
},
{
"structure_family": "array",
"structure": {
"data_type": {
"endianness": "not_applicable",
"kind": "u",
"itemsize": 1,
"dt_units": null
},
"chunks": [[5], [7], [3]],
"shape": [5, 7, 3],
"dims": null,
"resizable": false
},
"name": "G"
}
],
"all_keys": ["A", "B", "C", "D", "E", "F", "G"]
}
```
38 changes: 37 additions & 1 deletion docs/source/how-to/register.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,10 @@ Sometimes it is necessary to take more manual control of this registration
process, such as if you want to take advantage of particular knowledge
about the files to specify particular `metadata` or `specs`.

Use the Python client, as in this example.
#### Registering external data

To register data from external files in Tiled, one can use the Python client and
construct Data Source object explicitly passing the list of assets, as in the following example.

```py
import numpy
Expand Down Expand Up @@ -112,3 +115,36 @@ client.new(
specs=[],
)
```

#### Writing a consolidated structure

Similarly, to create a consolidated container structure, one needs to specify
its constituents as separate Data Sources. For example, to consolidate a table
and an array, consider the following example

```python
import pandas

rng = numpy.random.default_rng(12345)
arr = rng.random(size=(3, 5), dtype="float64")
df = pandas.DataFrame({"A": ["one", "two", "three"], "B": [1, 2, 3]})

node = client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
structure=TableStructure.from_pandas(df),
name="table1",
),
DataSource(
structure_family=StructureFamily.array,
structure=ArrayStructure.from_array(arr),
name="C",
)
]
)

# Write the data
node.parts["table1"].write(df)
node.parts["C"].write_block(arr, (0, 0))
```
2 changes: 2 additions & 0 deletions docs/source/reference/service.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ See {doc}`../explanations/structures` for more context.
tiled.structures.array.BuiltinDtype
tiled.structures.array.Endianness
tiled.structures.array.Kind
tiled.structures.consolidated.ConsolidatedStructure
tiled.structures.consolidated.ConsolidatedStructurePart
tiled.structures.core.Spec
tiled.structures.core.StructureFamily
tiled.structures.table.TableStructure
Expand Down
88 changes: 88 additions & 0 deletions tiled/_tests/test_consolidated.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import numpy
import pandas
import pandas.testing
import pytest

from ..catalog import in_memory
from ..client import Context, from_context
from ..server.app import build_app
from ..structures.array import ArrayStructure
from ..structures.core import StructureFamily
from ..structures.data_source import DataSource
from ..structures.table import TableStructure

rng = numpy.random.default_rng(12345)

df1 = pandas.DataFrame({"A": ["one", "two", "three"], "B": [1, 2, 3]})
df2 = pandas.DataFrame(
{
"C": ["red", "green", "blue", "white"],
"D": [10.0, 20.0, 30.0, 40.0],
"E": [0, 0, 0, 0],
}
)
arr1 = rng.random(size=(3, 5), dtype="float64")
arr2 = rng.integers(0, 255, size=(5, 7, 3), dtype="uint8")
md = {"md_key1": "md_val1", "md_key2": 2}


@pytest.fixture(scope="module")
def tree(tmp_path_factory):
return in_memory(writable_storage=tmp_path_factory.getbasetemp())


@pytest.fixture(scope="module")
def context(tree):
with Context.from_app(build_app(tree)) as context:
client = from_context(context)
x = client.create_consolidated(
[
DataSource(
structure_family=StructureFamily.table,
structure=TableStructure.from_pandas(df1),
name="table1",
),
DataSource(
structure_family=StructureFamily.table,
structure=TableStructure.from_pandas(df2),
name="table2",
),
DataSource(
structure_family=StructureFamily.array,
structure=ArrayStructure.from_array(arr1),
name="F",
),
DataSource(
structure_family=StructureFamily.array,
structure=ArrayStructure.from_array(arr2),
name="G",
),
],
key="x",
metadata=md,
)
# Write by data source.
x.parts["table1"].write(df1)
x.parts["table2"].write(df2)
x.parts["F"].write_block(arr1, (0, 0))
x.parts["G"].write_block(arr2, (0, 0, 0))

yield context


def test_iterate_parts(context):
client = from_context(context)
for part in client["x"].parts:
client["x"].parts[part].read()


def test_iterate_columns(context):
client = from_context(context)
for col in client["x"]:
client["x"][col].read()
client[f"x/{col}"].read()


def test_metadata(context):
client = from_context(context)
assert client["x"].metadata == md
22 changes: 22 additions & 0 deletions tiled/_tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,17 @@
pandas.DataFrame({f"column_{i:03d}": i * numpy.ones(5) for i in range(10)}),
npartitions=1,
),
# a dataframe with mixed types
"diverse": DataFrameAdapter.from_pandas(
pandas.DataFrame(
{
"A": numpy.array([1, 2, 3], dtype="|u8"),
"B": numpy.array([1, 2, 3], dtype="<f8"),
"C": ["one", "two", "three"],
}
),
npartitions=1,
),
}
)

Expand Down Expand Up @@ -100,6 +111,17 @@ def test_dataframe_single_partition(context):
pandas.testing.assert_frame_equal(actual, expected)


def test_reading_diverse_dtypes(context):
client = from_context(context)
expected = tree["diverse"].read()
actual = client["diverse"].read()
pandas.testing.assert_frame_equal(actual, expected)

for col in expected.columns:
actual = client["diverse"][col].read()
assert numpy.array_equal(expected[col], actual)


def test_dask(context):
client = from_context(context, "dask")["basic"]
expected = tree["basic"].read()
Expand Down
Loading
Loading