Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return a structured array for the spikes in python #285

Merged
merged 11 commits into from
Aug 23, 2023

Conversation

jorblancoa
Copy link
Contributor

  • Modify Spike object from std::pair to struct

Following the discussions in [BBPBGLIB-1044]

@jorblancoa jorblancoa marked this pull request as ready for review July 26, 2023 14:21
@ferdonline
Copy link
Contributor

Thanks @jorblancoa. I think the interface is really nice. I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".
To recap we are converting from two columns in hdf5 to a vector of struct, then here again to columns. Besides, in get(), there is an unconditional copy of all spikes for a filtering that is hardly requested. report_reader.cpp#L136
Maybe we should go back to requirements. Do we need the get() interface as is? is it really worth to have columnar data for data which is invariably two columns?

@mgeplf
Copy link
Contributor

mgeplf commented Jul 28, 2023

I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".

I expect there needs to be a new c++ method that builds these separated vectors which is waht I was getting at here: https://github.com/BlueBrain/neurodamus/pull/8/files#discussion_r1273085740

Besides, in get(), there is an unconditional copy of all spikes for a filtering that is hardly requested. report_reader.cpp#L136

That copy should be fixed. The filtering, however, is important.

Maybe we should go back to requirements. Do we need the get() interface as is? is it really worth to have columnar data for data which is invariably two columns?

On the C++ side, I think it's fine; keeping that API stable is important. On the python side, a list of tuples is not great, but API stability is more important.

Once we have the new python methods in place, we can start issuing warnings that people may get better performance from switching to it. Then, perhaps, we can deprecate it in the future.

python/bindings.cpp Outdated Show resolved Hide resolved
@jorblancoa jorblancoa force-pushed the jblanco/spikes_structured_array branch from d5be763 to 4b30e15 Compare August 2, 2023 09:49
@jorblancoa
Copy link
Contributor Author

jorblancoa commented Aug 2, 2023

Shall we merge this in order to create a tag together with the LFP changes? @mgeplf

@mgeplf
Copy link
Contributor

mgeplf commented Aug 3, 2023

I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library".

I think @ferdonline's point of " I'm however not very happy with the convoluted path of the data, which I know it's not from this PR, but I would expect better from an "optimized library"." still stands. I think we should do it with a less convoluted data path before it gets committed.

@jorblancoa
Copy link
Contributor Author

Does that belong to this PR?

@mgeplf
Copy link
Contributor

mgeplf commented Aug 4, 2023

I would say so; the extra array created by https://github.com/BlueBrain/libsonata/pull/285/files#diff-cc07100b7c7235ddf263fda0d515a7b40ed1fe5b871714e0ab5c1ec5298ddb0dR1193 can be avoided if some internal methods were added to get the id and timestamp separately

@jorblancoa
Copy link
Contributor Author

I would say so; the extra array created by https://github.com/BlueBrain/libsonata/pull/285/files#diff-cc07100b7c7235ddf263fda0d515a7b40ed1fe5b871714e0ab5c1ec5298ddb0dR1193 can be avoided if some internal methods were added to get the id and timestamp separately

Makes sense, but then the best would be to have them as members in the report_reader.h right? Otherwise the spikes_ need to be used anyway to filter and then converted to the 2 arrays.

@mgeplf
Copy link
Contributor

mgeplf commented Aug 7, 2023

Makes sense, but then the best would be to have them as members in the report_reader.h right?

Yeah, I think I understand what you mean and that would make sense.

@jorblancoa
Copy link
Contributor Author

Since we cant reuse the existing code for filtering and fernando's usecase only needs the 2 raw arrays, would you be ok for that? @ferdonline @mgeplf

ferdonline
ferdonline previously approved these changes Aug 9, 2023
@mgeplf
Copy link
Contributor

mgeplf commented Aug 10, 2023

Don't we still have the convoluted path of the data which was of the main things that needed to be improved?
It looks like getArrays calls get, which makes an unconditional copy of all spikes by createSpikes creating pairs from the underlying node_id and timestamps.
Back in these pairs are destructured into two new arrays. Can’t we create arrays of only the data that is required (ie: based on the gids/times), so they aren’t large, and create them in the format that is required? Having an Iterator would probably make that easier.

@ferdonline
Copy link
Contributor

@mgeplf IIUC at line 162 there's the fast path. If the client wants to filter data then the copy happens to vector<pair> which IMO is fair since that layout is ideal for sorting together. We could potentially further optimize the case of only filtering without ordering and avoid the copy, but I don't think we have that use case now, so I'd agree to leave as is and touch at a later stage.

@mgeplf
Copy link
Contributor

mgeplf commented Aug 10, 2023

But we want to be able to filter things; that’s an important use case. It seems weird to me to have users of a library have to know that there is a fast-path if the client wants to sort the data rather than doing the right thing on the library side.

On the neurodamus side, isn’t it useful to only load spikes that are going to be used in the simulation rather than loading all of them? For instance, if the simulation is going to run for 1s, but the spikes file contains 10s worth of spikes? What about the case where the simulation is only doing a subset of node_ids? Wouldn’t filtering the data before it all gets loaded be valuable? These are the sorts of optimizations that can be done here, and then the would benefit everyone using the library.

They’re also useful right now, as when people do analysis, a subset of the data is usually looked at.

@jorblancoa
Copy link
Contributor Author

Don't we still have the convoluted path of the data which was of the main things that needed to be improved?
It looks like getArrays calls get, which makes an unconditional copy of all spikes by createSpikes creating pairs from the underlying node_id and timestamps.
Back in these pairs are destructured into two new arrays. Can’t we create arrays of only the data that is required (ie: based on the gids/times), so they aren’t large, and create them in the format that is required? Having an Iterator would probably make that easier.

It makes sense, I wanted to reuse code but is true in case of calling getArrays it doesnt make sense to call get() to make the pairs only for the filtering. I pushed some changes to filter directly in the getArrays() method.

@ferdonline
Copy link
Contributor

On the neurodamus side, isn’t it useful to only load spikes that are going to be used in the simulation rather

In neurodamus we can't always filter ahead of time because of setup like coreneuron+save-restore, so it was ok. But I get it that maybe for other uses we want filters more often.

@jorblancoa jorblancoa closed this Aug 11, 2023
@jorblancoa jorblancoa reopened this Aug 11, 2023
@mgeplf
Copy link
Contributor

mgeplf commented Aug 22, 2023

Comparing the get_dict to the get() method, it seems like there is a performance regression when getting a subset of ids:

    def read_all_subset_nodes():
        sp = sr['All']
        node_ids = libsonata.Selection(ids)
        spikes = sp.get(node_ids)

    def read_all_subset_nodes_structure():
        sp = sr['All']
        node_ids = libsonata.Selection(ids)
        spikes = sp.get_dict(node_ids)

    print("read_all_subset_nodes")
    timeit(read_all_subset_nodes)

    print("read_all_subset_nodes_structure")
    timeit(read_all_subset_nodes_structure)

Gives:

************* spikes-decimated-unsorted.h5
read_all_subset_nodes
1 loop, best of 5: 130.22718537412584 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.7277841931209 per loop

************* spikes-decimated-sorted-by-time.h5
read_all_subset_nodes
1 loop, best of 5: 130.0559630645439 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.6187489759177 per loop

************* spikes-decimated-sorted-by-ids.h5
read_all_subset_nodes
1 loop, best of 5: 96.04782155901194 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 972.36246633064 per loop

I'm not a big fan of different access mechanisms having much different execution profiles, because that means library consumers have to benchmark everything to know what to use - which isn't very ergonomic.

In this case, we'll have to fix it later.

include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
include/bbp/sonata/report_reader.h Outdated Show resolved Hide resolved
@mgeplf
Copy link
Contributor

mgeplf commented Aug 22, 2023

Comparing the get_dict to the get() method, it seems like there is a performance regression when getting a subset of ids:

I should also add that the fast path (ie: get_dict()) is much faster: (new vs old): 0.36195594631135464 vs 9.994007629342377, so it's very worthwhile.

 - Move createSpikes to private scope
 - Use a struct instead of a pair for the SpikeTimes
 - Return a const ref when getting the raw arrays
@mgeplf
Copy link
Contributor

mgeplf commented Aug 23, 2023

Nice, fixing the regression makes a big difference:

************* spikes-decimated-unsorted.h5
read_all_subset_nodes
1 loop, best of 5: 131.17122020898387 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.58019144897116 per loop
************* spikes-decimated-sorted-by-time.h5
read_all_subset_nodes
1 loop, best of 5: 131.06627149000997 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.91921105398796 per loop
************* spikes-decimated-sorted-by-ids.h5
read_all_subset_nodes
1 loop, best of 5: 96.16886089701438 per loop
read_all_subset_nodes_structure
1 loop, best of 5: 129.54882664396428 per loop

@mgeplf mgeplf merged commit ab6c782 into master Aug 23, 2023
27 checks passed
@mgeplf mgeplf deleted the jblanco/spikes_structured_array branch August 23, 2023 08:58
@mgeplf
Copy link
Contributor

mgeplf commented Aug 23, 2023

Thanks @jorblancoa!

jorblancoa added a commit to BlueBrain/neurodamus that referenced this pull request Oct 26, 2023
## Context
Use libsonata instead of h5py in order to read the spikes.
The new 'get_dict()' method is used to retrieve the 'node_ids' and the
'timestamps' of a spikes report.
(BlueBrain/libsonata#285)

## Review
* [x] PR description is complete
* [x] Coding style (imports, function length, New functions, classes or
files) are good
* [x] Unit/Scientific test added
* [ ] Updated Readme, in-code, developer documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants