Implement lazy loading of raw data for local files #48

marcoross · 2024-05-02T11:29:43Z

No description provided.

CHANGELOG.md

cbrnr · 2024-05-02T13:10:37Z

This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded). Sometimes, it would also be useful to know the total size of the Edf object in memory. I don't think this is currently possible, right?

Also, I assume loaded vs. not loaded applies to all signals that are part of an Edf objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?

Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as Edf.data, and Edf.get_signal() requires a label. Sorry if I'm missing something obvious!

edfio/edf.py

edfio/edf_signal.py

CHANGELOG.md

edfio/edf.py

hofaflo · 2024-05-02T14:28:07Z

This looks really good! Not entirely related, but I was wondering if there is some public attribute that reflects the status of the data (i.e. loaded vs. not loaded).

Yes, this could be useful, I'm not sure how to best show it though, since (with this PR) lazy loading is done on a per-signal basis 🤔

Sometimes, it would also be useful to know the total size of the Edf object in memory. I don't think this is currently possible, right?

Currently not, no... maybe this could be combined with your first suggestion, i.e. display the size of the loaded signals and the total size on disk? E.g. <Edf 20 signals 0 annotations 50/100 MB loaded>

Also, I assume loaded vs. not loaded applies to all signals that are part of an Edf objects? That is, either the entire object is loaded or not (as opposed to some selected signals). Correct?

With the implementation suggested in this PR, it could be anything in between as well.

Finally, and this is now least related to this PR (so let me know if you would like me to open a new issue), what is the public way to access the actual data? There's no such thing as Edf.data, and Edf.get_signal() requires a label. Sorry if I'm missing something obvious!

Right, currently there's nothing like that. While this could be nice for recordings with uniform sampling frequencies (just return a 2d-array), I'm not sure how to best treat those with differing ones (a list of arrays? fail? ...?). What use case are you thinking about for this? (-> a new issue would be nice here, yes)

cbrnr · 2024-05-02T14:53:57Z

I've created a new issue to discuss accessing the data array in #49.

Regarding the other points, I think it would be nice not to overcomplicate the API. So you are saying it is currently possible to have some signals loaded in memory and some not, because each signal is treated as a separate EdfSignal (which is basically a memory-mapped NumPy array)? I'm thinking that for users, it might be useful to know how much memory their entire Edf object currently uses.

On the other hand, when loading an EDF file, either no signals or all signals are loaded in memory, so there is no way to influence individual signals being loaded with read_edf(). I guess the only way to do load individual signals (from a lazy object) is to access a signal, but then the question is which methods trigger/require loading the data?

hofaflo · 2024-05-03T15:44:31Z

[...] each signal is treated as a separate EdfSignal (which is basically a memory-mapped NumPy array)?

Exactly, with this PR the array stays memory mapped until the data is accessed for the first time.

I'm thinking that for users, it might be useful to know how much memory their entire Edf object currently uses.

Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?

[...] which methods trigger/require loading the data?

With the currently suggested implementation, this would be

EdfSignal.data
EdfSignal.digital_data
EdfSignal.update_data()
Edf.update_data_record_duration() (all signals)
Edf.write() (all signals)
Edf.annotations (all annotation signals)
Edf.starttime (timekeeping annotation signal)

EDIT: slicing operations would also require loading the data (thanks @cbrnr!):

Edf.slice_between_annotations() (all signals)
Edf.slice_between_seconds() (all signals)

cbrnr · 2024-05-05T07:58:11Z

Definitely! Is the suggestion for the extended repr in my above comment more or less what you're thinking about here?

Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying EdfSignal should also show its memory consumption in its repr (and attribute).

With the currently suggested implementation, this would be
...

What about slicing between seconds or annotations? In any case, if the current memory consumption is available, users will always be able to find out which operation loads the data!

hofaflo · 2024-05-05T08:05:34Z

Yes, this would be useful! I'd also expose this in an attribute for convenience. Plus, each underlying EdfSignal should also show its memory consumption in its repr (and attribute).

👍 Feel free to open a PR for that, ideally once this one is merged!

What about slicing between seconds or annotations?

Right, thanks for pointing this out! I'll edit the above list. For non-annotation signals this could even be done without loading the data (as long as the desired slice only contains complete datarecords), though that would complicate the implementation a bit.

cbrnr · 2024-05-07T15:57:46Z

Another question that I had was that if read_edf() loads the signal into memory, it actually loads the digital data, right? So I'm not sure how surprising it will be that the physical data consumes four times as much?

What I'm trying to say is that this is becoming a bit complicated already, given that the original intention of lazy loading was that people sometimes work with EDF headers only, so they don't need the data. Maybe a simpler option would be to add a separate function read_edf_header() instead?

hofaflo · 2024-05-13T11:02:24Z

@cbrnr, let's move the discussion about memory consumption information into a new issue!

cbrnr · 2024-05-13T11:10:38Z

Sure! I just thought that maybe the current implementation is not even necessary because it adds too much complexity...

hofaflo · 2024-05-13T11:14:30Z

For the original use case described in #47 you're definitely right :D However, it's a nice way to also speed up working with only a (small) subset of signals.

marcoross linked an issue May 2, 2024 that may be closed by this pull request

Lazy load raw data #47

Closed

Implement lazy loading of raw data for local files

91e6348

marcoross force-pushed the 47-lazy-load-raw-data branch from b40dddb to 91e6348 Compare May 2, 2024 11:33

Update Changelog

4c33f55

marcoross requested a review from hofaflo May 2, 2024 11:38

cbrnr reviewed May 2, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

hofaflo reviewed May 2, 2024

View reviewed changes

cbrnr mentioned this pull request May 2, 2024

Add public interface to access data array #49

Open

marco.ross added 3 commits May 13, 2024 11:45

Refactor lazyloading, replace unnecessary _LazyLoader class by function.

a5d568b

Updated changelog to include entry for digital property

25350d2

simplified code for 'auto' mode for lazy_loading

0bfbad6

marcoross requested a review from hofaflo May 13, 2024 09:58

marcoross self-assigned this May 13, 2024

hofaflo approved these changes May 13, 2024

View reviewed changes

marcoross merged commit 70e9de8 into main May 13, 2024
9 checks passed

marcoross deleted the 47-lazy-load-raw-data branch May 13, 2024 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement lazy loading of raw data for local files #48

Implement lazy loading of raw data for local files #48

marcoross commented May 2, 2024

cbrnr commented May 2, 2024

hofaflo commented May 2, 2024 •

edited

Loading

cbrnr commented May 2, 2024

hofaflo commented May 3, 2024 •

edited

Loading

cbrnr commented May 5, 2024

hofaflo commented May 5, 2024

cbrnr commented May 7, 2024

hofaflo commented May 13, 2024

cbrnr commented May 13, 2024

hofaflo commented May 13, 2024

Implement lazy loading of raw data for local files #48

Implement lazy loading of raw data for local files #48

Conversation

marcoross commented May 2, 2024

cbrnr commented May 2, 2024

hofaflo commented May 2, 2024 • edited Loading

cbrnr commented May 2, 2024

hofaflo commented May 3, 2024 • edited Loading

cbrnr commented May 5, 2024

hofaflo commented May 5, 2024

cbrnr commented May 7, 2024

hofaflo commented May 13, 2024

cbrnr commented May 13, 2024

hofaflo commented May 13, 2024

hofaflo commented May 2, 2024 •

edited

Loading

hofaflo commented May 3, 2024 •

edited

Loading