Return a structured array for the spikes in python #285

jorblancoa · 2023-07-26T11:03:11Z

Modify Spike object from std::pair to struct

Following the discussions in [BBPBGLIB-1044]

ferdonline · 2023-07-27T14:39:09Z

Thanks @jorblancoa. I think the interface is really nice. I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".
To recap we are converting from two columns in hdf5 to a vector of struct, then here again to columns. Besides, in get(), there is an unconditional copy of all spikes for a filtering that is hardly requested. report_reader.cpp#L136
Maybe we should go back to requirements. Do we need the get() interface as is? is it really worth to have columnar data for data which is invariably two columns?

mgeplf · 2023-07-28T07:31:28Z

I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".

I expect there needs to be a new c++ method that builds these separated vectors which is waht I was getting at here: https://github.com/BlueBrain/neurodamus/pull/8/files#discussion_r1273085740

Besides, in get(), there is an unconditional copy of all spikes for a filtering that is hardly requested. report_reader.cpp#L136

That copy should be fixed. The filtering, however, is important.

Maybe we should go back to requirements. Do we need the get() interface as is? is it really worth to have columnar data for data which is invariably two columns?

On the C++ side, I think it's fine; keeping that API stable is important. On the python side, a list of tuples is not great, but API stability is more important.

Once we have the new python methods in place, we can start issuing warnings that people may get better performance from switching to it. Then, perhaps, we can deprecate it in the future.

python/bindings.cpp

jorblancoa · 2023-08-02T09:49:39Z

Shall we merge this in order to create a tag together with the LFP changes? @mgeplf

mgeplf · 2023-08-03T07:24:24Z

I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".

I think @ferdonline's point of " I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library"." still stands. I think we should do it with a less convoluted data path before it gets committed.

jorblancoa · 2023-08-03T07:43:28Z

Does that belong to this PR?

mgeplf · 2023-08-04T09:34:47Z

I would say so; the extra array created by https://github.com/BlueBrain/libsonata/pull/285/files#diff-cc07100b7c7235ddf263fda0d515a7b40ed1fe5b871714e0ab5c1ec5298ddb0dR1193 can be avoided if some internal methods were added to get the id and timestamp separately

jorblancoa · 2023-08-07T09:19:03Z

I would say so; the extra array created by https://github.com/BlueBrain/libsonata/pull/285/files#diff-cc07100b7c7235ddf263fda0d515a7b40ed1fe5b871714e0ab5c1ec5298ddb0dR1193 can be avoided if some internal methods were added to get the id and timestamp separately

Makes sense, but then the best would be to have them as members in the report_reader.h right? Otherwise the spikes_ need to be used anyway to filter and then converted to the 2 arrays.

mgeplf · 2023-08-07T09:22:28Z

Makes sense, but then the best would be to have them as members in the report_reader.h right?

Yeah, I think I understand what you mean and that would make sense.

jorblancoa · 2023-08-07T10:59:17Z

Since we cant reuse the existing code for filtering and fernando's usecase only needs the 2 raw arrays, would you be ok for that? @ferdonline @mgeplf

include/bbp/sonata/report_reader.h

mgeplf · 2023-08-10T07:25:36Z

Don't we still have the convoluted path of the data which was of the main things that needed to be improved?
It looks like getArrays calls get, which makes an unconditional copy of all spikes by createSpikes creating pairs from the underlying node_id and timestamps.
Back in these pairs are destructured into two new arrays. Can’t we create arrays of only the data that is required (ie: based on the gids/times), so they aren’t large, and create them in the format that is required? Having an Iterator would probably make that easier.

ferdonline · 2023-08-10T12:30:02Z

@mgeplf IIUC at line 162 there's the fast path. If the client wants to filter data then the copy happens to vector<pair> which IMO is fair since that layout is ideal for sorting together. We could potentially further optimize the case of only filtering without ordering and avoid the copy, but I don't think we have that use case now, so I'd agree to leave as is and touch at a later stage.

mgeplf · 2023-08-10T12:48:05Z

But we want to be able to filter things; that’s an important use case. It seems weird to me to have users of a library have to know that there is a fast-path if the client wants to sort the data rather than doing the right thing on the library side.

On the neurodamus side, isn’t it useful to only load spikes that are going to be used in the simulation rather than loading all of them? For instance, if the simulation is going to run for 1s, but the spikes file contains 10s worth of spikes? What about the case where the simulation is only doing a subset of node_ids? Wouldn’t filtering the data before it all gets loaded be valuable? These are the sorts of optimizations that can be done here, and then the would benefit everyone using the library.

They’re also useful right now, as when people do analysis, a subset of the data is usually looked at.

jorblancoa · 2023-08-10T13:03:26Z

Don't we still have the convoluted path of the data which was of the main things that needed to be improved?
It looks like getArrays calls get, which makes an unconditional copy of all spikes by createSpikes creating pairs from the underlying node_id and timestamps.
Back in these pairs are destructured into two new arrays. Can’t we create arrays of only the data that is required (ie: based on the gids/times), so they aren’t large, and create them in the format that is required? Having an Iterator would probably make that easier.

It makes sense, I wanted to reuse code but is true in case of calling getArrays it doesnt make sense to call get() to make the pairs only for the filtering. I pushed some changes to filter directly in the getArrays() method.

ferdonline · 2023-08-10T15:08:59Z

On the neurodamus side, isn’t it useful to only load spikes that are going to be used in the simulation rather

In neurodamus we can't always filter ahead of time because of setup like coreneuron+save-restore, so it was ok. But I get it that maybe for other uses we want filters more often.

- Modify Spike object from std::pair to struct

…he C++ side

…in the file

python/tests/test_reports.py

mgeplf · 2023-08-22T10:56:33Z

Comparing the get_dict to the get() method, it seems like there is a performance regression when getting a subset of ids:

    def read_all_subset_nodes():
        sp = sr['All']
        node_ids = libsonata.Selection(ids)
        spikes = sp.get(node_ids)

    def read_all_subset_nodes_structure():
        sp = sr['All']
        node_ids = libsonata.Selection(ids)
        spikes = sp.get_dict(node_ids)

    print("read_all_subset_nodes")
    timeit(read_all_subset_nodes)

    print("read_all_subset_nodes_structure")
    timeit(read_all_subset_nodes_structure)

Gives:

************* spikes-decimated-unsorted.h5
read_all_subset_nodes
1 loop, best of 5: 130.22718537412584 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.7277841931209 per loop

************* spikes-decimated-sorted-by-time.h5
read_all_subset_nodes
1 loop, best of 5: 130.0559630645439 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.6187489759177 per loop

************* spikes-decimated-sorted-by-ids.h5
read_all_subset_nodes
1 loop, best of 5: 96.04782155901194 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.36246633064 per loop

I'm not a big fan of different access mechanisms having much different execution profiles, because that means library consumers have to benchmark everything to know what to use - which isn't very ergonomic.

In this case, we'll have to fix it later.

include/bbp/sonata/report_reader.h

mgeplf · 2023-08-22T13:03:34Z

Comparing the get_dict to the get() method, it seems like there is a performance regression when getting a subset of ids:

I should also add that the fast path (ie: get_dict()) is much faster: (new vs old): 0.36195594631135464 vs 9.994007629342377, so it's very worthwhile.

- Move createSpikes to private scope - Use a struct instead of a pair for the SpikeTimes - Return a const ref when getting the raw arrays

mgeplf · 2023-08-23T08:52:12Z

Nice, fixing the regression makes a big difference:

************* spikes-decimated-unsorted.h5
read_all_subset_nodes
1 loop, best of 5: 131.17122020898387 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.58019144897116 per loop
************* spikes-decimated-sorted-by-time.h5
read_all_subset_nodes
1 loop, best of 5: 131.06627149000997 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.91921105398796 per loop
************* spikes-decimated-sorted-by-ids.h5
read_all_subset_nodes
1 loop, best of 5: 96.16886089701438 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.54882664396428 per loop

mgeplf · 2023-08-23T09:07:23Z

Thanks @jorblancoa!

## Context Use libsonata instead of h5py in order to read the spikes. The new 'get_dict()' method is used to retrieve the 'node_ids' and the 'timestamps' of a spikes report. (BlueBrain/libsonata#285) ## Review * [x] PR description is complete * [x] Coding style (imports, function length, New functions, classes or files) are good * [x] Unit/Scientific test added * [ ] Updated Readme, in-code, developer documentation

jorblancoa requested review from ferdonline and mgeplf July 26, 2023 11:03

jorblancoa marked this pull request as ready for review July 26, 2023 14:21

jorblancoa mentioned this pull request Jul 27, 2023

[BBPBGLIB-1044] Sonata Replay BlueBrain/neurodamus#8

Merged

mgeplf reviewed Jul 28, 2023

View reviewed changes

python/bindings.cpp Outdated Show resolved Hide resolved

jorblancoa force-pushed the jblanco/spikes_structured_array branch from d5be763 to 4b30e15 Compare August 2, 2023 09:49

ferdonline reviewed Aug 7, 2023

View reviewed changes

include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved

ferdonline previously approved these changes Aug 9, 2023

View reviewed changes

jorblancoa dismissed ferdonline’s stale review via 51993e6 August 10, 2023 13:02

jorblancoa closed this Aug 11, 2023

jorblancoa reopened this Aug 11, 2023

jorblancoa added 5 commits August 16, 2023 15:10

Return a structured array for the spikes in python

b4b6238

- Modify Spike object from std::pair to struct

Create new python method to get a structured array without touching t…

a8eabf4

…he C++ side

Remove ==operator

0c19086

Create python method for the spikes to return dict of numpy arrays

af251d1

Remove get_array()

fb4dcab

jorblancoa added 3 commits August 16, 2023 15:10

Save nodeids and timestamps arrays in case are requested as they are …

b7df274

…in the file

Remove spikes_ and build on the fly from raw vectors

e141d30

Avoid calling get() in the getArrays() method

3394927

jorblancoa force-pushed the jblanco/spikes_structured_array branch from 51993e6 to 3394927 Compare August 16, 2023 13:10

jorblancoa requested review from ferdonline and mgeplf August 17, 2023 08:24

mgeplf reviewed Aug 18, 2023

View reviewed changes

python/tests/test_reports.py Show resolved Hide resolved

Allow getting only times or nodes in the getArrays function

1d083b9

mgeplf reviewed Aug 22, 2023

View reviewed changes

include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved

include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved

include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved

jorblancoa added 2 commits August 22, 2023 18:50

Address comments

5cbb959

- Move createSpikes to private scope - Use a struct instead of a pair for the SpikeTimes - Return a const ref when getting the raw arrays

Fix performance regression

1085cc0

mgeplf approved these changes Aug 23, 2023

View reviewed changes

mgeplf merged commit ab6c782 into master Aug 23, 2023
27 checks passed

mgeplf deleted the jblanco/spikes_structured_array branch August 23, 2023 08:58

jorblancoa mentioned this pull request Oct 25, 2023

[BBPBGLIB-1044] Use libsonata to read the spikes BlueBrain/neurodamus#70

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return a structured array for the spikes in python #285

Return a structured array for the spikes in python #285

jorblancoa commented Jul 26, 2023

ferdonline commented Jul 27, 2023

mgeplf commented Jul 28, 2023

jorblancoa commented Aug 2, 2023 •

edited

Loading

mgeplf commented Aug 3, 2023

jorblancoa commented Aug 3, 2023

mgeplf commented Aug 4, 2023

jorblancoa commented Aug 7, 2023

mgeplf commented Aug 7, 2023

jorblancoa commented Aug 7, 2023

mgeplf commented Aug 10, 2023

ferdonline commented Aug 10, 2023

mgeplf commented Aug 10, 2023

jorblancoa commented Aug 10, 2023

ferdonline commented Aug 10, 2023

mgeplf commented Aug 22, 2023

mgeplf commented Aug 22, 2023

mgeplf commented Aug 23, 2023

mgeplf commented Aug 23, 2023

Return a structured array for the spikes in python #285

Return a structured array for the spikes in python #285

Conversation

jorblancoa commented Jul 26, 2023

ferdonline commented Jul 27, 2023

mgeplf commented Jul 28, 2023

jorblancoa commented Aug 2, 2023 • edited Loading

mgeplf commented Aug 3, 2023

jorblancoa commented Aug 3, 2023

mgeplf commented Aug 4, 2023

jorblancoa commented Aug 7, 2023

mgeplf commented Aug 7, 2023

jorblancoa commented Aug 7, 2023

mgeplf commented Aug 10, 2023

ferdonline commented Aug 10, 2023

mgeplf commented Aug 10, 2023

jorblancoa commented Aug 10, 2023

ferdonline commented Aug 10, 2023

mgeplf commented Aug 22, 2023

mgeplf commented Aug 22, 2023

mgeplf commented Aug 23, 2023

mgeplf commented Aug 23, 2023

jorblancoa commented Aug 2, 2023 •

edited

Loading