Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality enhancements to address lazy loading of chunked data, variable length strings, and other minor bug fixes #68

Open
wants to merge 121 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
02fca54
Using s3 to get at some real data for testing
Feb 22, 2024
df3669a
Getting the address as well as size into the index
Feb 22, 2024
16c0e81
With timer
Feb 22, 2024
c464be8
Not working yet. Don't reckon I have the arguments to OrthogonalIndex…
Feb 23, 2024
afaa4f5
A few more notes in the code so I can come back to it anon.
Feb 23, 2024
18bc37c
Woops. Need this.
Feb 23, 2024
4b0ac08
First working lazy read (only reads chunks needed for selection)
Feb 24, 2024
5356aa0
Woops didnt' commit the real oil
Feb 24, 2024
9fe2394
Should now support filtering chunks in the partical chunk loading. H…
Feb 24, 2024
dafb3c9
Some additional documentation
Feb 25, 2024
53e4ebe
Seems to work, prior to re-integration
Feb 29, 2024
9ac0bbd
Moved chunk support into standard API
Mar 1, 2024
a88a150
removing playing code
Mar 1, 2024
89aafe3
Merge branch 'jjhelmus:master' into issue6
bnlawrence Mar 1, 2024
96dc178
Fixes bug which stops the selection read from actually occurring and …
Mar 3, 2024
eb44c15
Hack to avoid reference datatypes in chunk by chunk selections.
Mar 3, 2024
51f7cca
Remove obsolete function
Mar 5, 2024
1f61d6c
Support for third party access to contiguous data address and size. A…
Mar 5, 2024
e6217b5
First cut, fails references and classic, even with new stuff turned off?
Mar 7, 2024
67c93e0
This version appears to now support failing over from a memory map to…
Mar 7, 2024
a08ee20
First cut, no tests yet
Apr 21, 2024
dc00503
Improvements
Apr 22, 2024
9ffb5b2
With some failing tests
Apr 22, 2024
223a931
Fixed one test
Apr 22, 2024
3a256ab
All tests for new functionality pass, but I've broken something old
Apr 22, 2024
32d83dd
Now passing all tests
Apr 22, 2024
f5f89c5
Checking coverage of get_chunk_info_by_coord(method)
Apr 22, 2024
2c8f59c
Missing docstring
Apr 22, 2024
013ce62
Cleaning up
Apr 26, 2024
400c798
Merge pull request #5 from bnlawrence/h5pyapi
bnlawrence Jul 9, 2024
c21ee63
Ok, these were the pull request fixes that I thought I'd merged
Jul 9, 2024
f87b9d1
Merge branch 'h5pyapi'
Jul 9, 2024
3c3f6d6
Adding Datatype and check_enum_dtype in a minimal manner - closes #8
Jul 10, 2024
9994598
Basic support for elements of h5netcdf and what it expects to be able…
Jul 12, 2024
c12b5b3
Test support for graceful enum failure
Jul 12, 2024
c80ed92
Committing to the dtype returned as a numpy dtype, and the extra h5t …
Jul 14, 2024
04bbef6
Test for reference_list
Jul 19, 2024
552c463
(New reference list test still broken) H5D has been disconnected from…
Dec 19, 2024
e32a1b0
Interim commit so we have something to point to in a discussion aroun…
Dec 20, 2024
2d88101
Transition to H5D cached backend is complete, though we still have th…
Dec 21, 2024
2678022
Removed obsolete DatasetDataObjects
Dec 21, 2024
0078956
Expose package version in code, and separate testing requirements out…
Dec 22, 2024
503cb45
Attempt to get b-tree logging in h5d
Dec 22, 2024
ac96f46
Cleared a few bugs and misunderstandings which arose from workign wit…
Dec 28, 2024
b586db0
Continue to use open file in h5d, closes #18
Dec 30, 2024
7f17cc8
Test for true bytes-io testing (needed for h5netcdf test compatability)
Jan 1, 2025
fd13670
Deals with filename issues (closes #19) (and deals with another ioby…
Jan 2, 2025
32ad75d
Addressing, I think, upstream issue 53, and includes a test case I sh…
Jan 2, 2025
48b7b9a
Fix location of files so tests run properly from parent directory.
Jan 3, 2025
2233395
Well, I think this is a fix for #23, and it's so complex I'm committi…
Jan 7, 2025
1e2c424
Cleaned up issue23 fixes, all tests pass
Jan 8, 2025
6da5fda
Test localisation, and a new test for laziness outside a context manager
Jan 8, 2025
59e8667
Changes to support out of context variable access as described in #24
Jan 8, 2025
34a684a
Removing the pseudo chunking stuff that snuck into the last commit
Jan 8, 2025
64827c4
catching up to the main trunk in h5netcdf
Jan 8, 2025
20693b9
Starting to sketch out the pseudo chunking
Jan 8, 2025
4126e2b
threadsafe data access
davidhassell Jan 8, 2025
8a7e1dc
merge from h5netcdf
davidhassell Jan 8, 2025
87a1980
add deps for mock s3 test
valeriupredoi Jan 9, 2025
c7058b6
add mock s3 test
valeriupredoi Jan 9, 2025
df81faf
posix & s3
davidhassell Jan 9, 2025
53ff9df
tidy
davidhassell Jan 9, 2025
ee0995b
tidy up
davidhassell Jan 10, 2025
43a8e9c
tidy up
davidhassell Jan 10, 2025
7462033
add test reports to gitignore for now
valeriupredoi Jan 13, 2025
0c8ffc5
add conftest
valeriupredoi Jan 13, 2025
8cc2363
minimize conftest
valeriupredoi Jan 13, 2025
3086211
make use of conftest and add minimal test for mock s3 fs
valeriupredoi Jan 13, 2025
ed0f117
upgrade actions versions
valeriupredoi Jan 13, 2025
6843567
add flask dep
valeriupredoi Jan 13, 2025
88752d1
restrict to python 3.10
valeriupredoi Jan 13, 2025
ddeb0ea
add flask-cors
valeriupredoi Jan 13, 2025
522bf7a
add h5 modules
valeriupredoi Jan 13, 2025
22476e8
mark test as xfailed
valeriupredoi Jan 13, 2025
f28c68d
add dosctrings
valeriupredoi Jan 13, 2025
03183b7
Merge pull request #26 from NCAS-CMS/mock_s3fs
bnlawrence Jan 15, 2025
2d312c1
Minor changes following V's S3 testing merge
Jan 15, 2025
742faf8
A framework for testing laziness.
Jan 15, 2025
6cd74e7
Merge branch 'h5netcdf' into h5netcdf-fh-2
davidhassell Jan 15, 2025
7a13108
add test for threadsafe data access on posix and s3
davidhassell Jan 15, 2025
1b6d670
note on number of threadsafe test iterations
davidhassell Jan 15, 2025
0ec45d7
Test framework for pseudochunking plus starting to migrate test data …
Jan 15, 2025
344573b
Ok, we pass the pseudochunking test, but we don't actually do it yet.
Jan 15, 2025
f450776
Pseudo chunking in, with test support (and a missing make data file t…
Jan 15, 2025
007ac56
Merge branch 'pseudo_chunking' into h5netcdf
bnlawrence Jan 15, 2025
33e21e1
Merge pull request #28 from NCAS-CMS/h5netcdf
bnlawrence Jan 15, 2025
89ebd2f
Merge branch 'h5netcdf' into h5netcdf-fh-2
bnlawrence Jan 15, 2025
fee0759
Merge pull request #27 from NCAS-CMS/h5netcdf-fh-2
bnlawrence Jan 15, 2025
13e5e39
Tidy up dependencies for testing
Jan 15, 2025
c4a38b9
Minor changes which come from upstream advice on my two pull requests…
Jan 15, 2025
49aa794
Suppress reference list warning. It's useless
Jan 15, 2025
dba2683
Using context manager for threadsafe test
Jan 16, 2025
2ba27dc
no returned memory maps
davidhassell Jan 20, 2025
e6b518b
Test for #16 and #29
Jan 20, 2025
b1ae323
More versions of the #29 tests
Jan 20, 2025
c7e157c
Better .gitignore
Jan 20, 2025
40c898b
Giving up on in-memory netcdf tests for #29
Jan 20, 2025
8acf067
explicitly close POSIX files
davidhassell Jan 20, 2025
526c642
vlen strings data test case, vanilla version, and version with missin…
Jan 21, 2025
0e4a45b
add extra posix test for file closure
davidhassell Jan 21, 2025
298edc7
Merge branch 'h5netcdf' into fix-memmap
davidhassell Jan 21, 2025
1fb9c98
More on h5d and testing. The iter_chunks method is broken and we now …
Jan 21, 2025
82dc2a9
Support for pyactivestorage via a bespoke `get_chunk_info_from_chunk_…
Jan 21, 2025
838b0a5
better ignore
Jan 21, 2025
a633683
The first vlen data test passes with this code
Jan 21, 2025
fbdda40
closer to a solution for #29. These tests pass, but we need to deal w…
Jan 21, 2025
a20763f
Partially working vlen string support, issues with global heap usage …
Jan 22, 2025
e7c465e
Passing all vlen tests for #29, though we are ignoreing the dtype of …
Jan 22, 2025
580e3df
Merge pull request #33 from NCAS-CMS/vlen
bnlawrence Jan 22, 2025
0a4c801
Merge remote-tracking branch 'refs/remotes/origin/h5netcdf' into h5ne…
Jan 22, 2025
b955b4b
Merge branch 'fix-memmap' into h5netcdf
Jan 22, 2025
e40c7d7
Remaining tests for vlen and iterchunks, support for vlen dtypes (clo…
Jan 22, 2025
599db7b
dev
davidhassell Jan 23, 2025
4b4fbc3
dev
davidhassell Jan 23, 2025
a2cfaeb
dev
davidhassell Jan 23, 2025
bd16147
vlen related fixes
davidhassell Jan 24, 2025
a50204d
Update pyfive/indexing.py
davidhassell Jan 28, 2025
1f9b2c0
Merge pull request #37 from NCAS-CMS/vlen-dtype
bnlawrence Jan 28, 2025
6c02408
Merge branch 'master' into wacasoft
Jan 30, 2025
eed7e99
install only in test mode
valeriupredoi Jan 30, 2025
6255fc0
actual correct name for testing regime
valeriupredoi Jan 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .coveragerc

This file was deleted.

10 changes: 5 additions & 5 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ name: Python package

on:
push:
branches: [ master ]
branches: [ master, mock_s3fs ]
pull_request:
branches: [ master ]

Expand All @@ -16,19 +16,19 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13"]
python-version: ["3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest
python -m pip install -e .
python -m pip install .[testing] # install in test mode
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
Expand Down
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
.coverage
.pyc
build
__pycache__/
*.egg-info
.idea
.DS_Store
test-reports/
<_io.Bytes*>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If filenames with < in them are generated, I'd like to see them.

tests/__pycache__/
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# Include the license file
include LICENSE.txt
include README.rst
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ implemented.
Dependencies
============

pyfive is tested to work with Python 3.8 to 3.13. It may also work
pyfive is tested to work with Python 3.10 to 3.13. It may also work
with other Python versions.

The only dependencies to run the software besides Python is NumPy.
Expand Down
112 changes: 112 additions & 0 deletions bnl/opening_speed.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import h5py
import pyfive
from pathlib import Path
import time
import s3fs

S3_URL = 'https://uor-aces-o.s3-ext.jc.rl.ac.uk/'
S3_BUCKET = 'bnl'

def test_speed(s3=False):

mypath = Path(__file__).parent
fname1 = 'da193o_25_day__grid_T_198807-198807.nc'
vname1 = 'tos'
p1 = mypath/fname1

fname2 = 'ch330a.pc19790301-def-short.nc'
vname2 = 'UM_m01s16i202_vn1106'
p2 = Path.home()/'Repositories/h5netcdf/h5netcdf/tests/'/fname2

do_run(p1, fname1, vname1, s3)

do_run(p2, fname2, vname2, s3)


def do_s3(package, fname, vname):

fs = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': S3_URL})
uri = S3_BUCKET + '/' + fname
with fs.open(uri,'rb') as p:
t_opening, t_var, t_calc, t_tot = do_inner(package, p, vname)

return t_opening, t_var, t_calc, t_tot

def do_inner(package, p, vname, withdask=False):
h0 = time.time()
pf1 = package.File(p)
h3 = time.time()
t_opening = 1000* (h3-h0)

h5a = time.time()
vp = pf1[vname]
h5 = time.time()
t_var = 1000* (h5-h5a)

h6a = time.time()
sh = sum(vp)
h6 = time.time()
t_calc = 1000* (h6-h6a)

t_tot = t_calc+t_var+t_opening

pf1.close()
return t_opening, t_var, t_calc, t_tot



def do_run(p, fname, vname, s3):

if s3:
import s3fs


# for posix force this to be a comparison from memory
# by ensuring file is in disk cache and ignore first access
# but we then do an even number of accesses to make sure we are not
# biased by caching.
n = 0
datanames = ['h_opening','p_opening','h_var','p_var','h_calc','p_calc','h_tot','p_tot']
results = {x:0.0 for x in datanames}
while n <2:
n+=1

if s3:
h_opening, h_var, h_calc, h_tot = do_s3(h5py, fname, vname)
p_opening, p_var, p_calc, p_tot = do_s3(pyfive, fname, vname)

else:
h_opening, h_var, h_calc, h_tot = do_inner(h5py, p, vname)
p_opening, p_var, p_calc, p_tot = do_inner(pyfive, p, vname)

if n>1:
for x,r in zip(datanames,[h_opening,p_opening,h_var,p_var,h_calc,p_calc,h_tot,p_tot]):
results[x] += r

for v in results.values():
v = v/(n-1)


print("File Opening Time Comparison ", fname, f' (ms, S3={s3})')
print(f"h5py: {results['h_opening']:9.6f}")
print(f"pyfive: {results['p_opening']:9.6f}")

print(f'Variable instantiation for [{vname}]')
print(f"h5py: {results['h_var']:9.6f}")
print(f"pyfive: {results['p_var']:9.6f}")

print('Access and calculation time for summation')
print(f"h5py: {results['h_calc']:9.6f}")
print(f"pyfive: {results['p_calc']:9.6f}")

print('Total times')
print(f"h5py: {results['h_tot']:9.6f}")
print(f"pyfive: {results['p_tot']:9.6f}")

if __name__=="__main__":
test_speed()
test_speed(s3=True)




8 changes: 7 additions & 1 deletion pyfive/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
"""
pyfive : a pure python HDF5 file reader.
This is the public API exposed by pyfive,
which is a small subset of the H5PY API.
"""

from .high_level import File
from pyfive.high_level import File, Group, Dataset
from pyfive.h5t import check_enum_dtype, check_string_dtype, check_dtype
from pyfive.h5py import Datatype, Empty
from importlib.metadata import version

__version__ = '0.5.0.dev'

37 changes: 37 additions & 0 deletions pyfive/btree.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from .core import _padded_size
from .core import _unpack_struct_from_file
from .core import _unpack_struct_from
from .core import Reference


Expand Down Expand Up @@ -440,6 +441,25 @@ def _parse_record(self, record):
creationorder = struct.unpack_from("<Q", record, 0)[0]
return {'creationorder': creationorder, 'heapid':record[8:8+7]}

class BTreeV2AttrCreationOrder(BTreeV2):
"""
HDF5 version 2 B-Tree storing attribute creation orders (type 9).
See the Type 9 Record Layout, note the different ordering from type 6.
"""
NODE_TYPE = 9

def _parse_record(self, record):
return _unpack_struct_from(V2_BTREE_NODE_TYPE_9_LAYOUT,record)

class BTreeV2AttrNames(BTreeV2):
"""
HDF5 version 2 B-Tree storing attribute names (type 8).
"""
NODE_TYPE = 8

def _parse_record(self, record):
return _unpack_struct_from(V2_BTREE_NODE_TYPE_8_LAYOUT,record)


# IV.A.2.l The Data Storage - Filter Pipeline message
RESERVED_FILTER = 0
Expand All @@ -449,3 +469,20 @@ def _parse_record(self, record):
SZIP_FILTER = 4
NBIT_FILTER = 5
SCALEOFFSET_FILTER = 6


# Attribute message B-Tree node types
# haven't tested type 8 yet, not sure how to get some.
#
V2_BTREE_NODE_TYPE_8_LAYOUT = OrderedDict((
('heapid','8s'),
('flags','B'),
('creationorder','I'),
('namehash','I')
))

V2_BTREE_NODE_TYPE_9_LAYOUT = OrderedDict((
('heapid','8s'),
('flags','B'),
('creationorder','I')
))
2 changes: 2 additions & 0 deletions pyfive/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,5 @@ def _unpack_integer(nbytes, buf, offset=0):
fmt = "{}s".format(nbytes)
values = struct.unpack_from(fmt, buf, offset=offset)
return int.from_bytes(values[0], byteorder="little", signed=False)


Loading