Investigate using different Collections libraries for filter speed-up #230

pjeli · 2019-05-31T17:09:02Z

I have been reading up on and looking into the following repo for a while now: https://github.com/lacuna/bifurcan/.

There are some interesting properties here similar to the optimization done in the NameNode's LightweightGSet implementation; namely the use of contiguous memory regions to reduce memory size and speed up iteration.

I believe it may be possible to use this library to speedup:
(1) Bootstrap time.
(2) General collection time for all queries.
(3) Reduce memory usage overall from intermittent Map collections.

I believe #224 has done enough optimization for now and is not necessary for the next immediate release, but perhaps this can be for a 1.6.4 or 1.7.x release.

pjeli · 2019-05-31T17:11:08Z

Some early numbers show that for a GSet of 1M inodes, the lacuna.bifurcan.Map implementation is 1/4 the memory cost of a java.util.HashMap and also about 4x as fast at filtering and collecting files and directories.

This could be pretty substantial.

pjeli · 2019-06-05T01:45:22Z

Unfortunately the current lacuna.bifurcan Map, List, and Set implementations, and their derivatives, do not implement the java.util.Collection interface and therefore supporting them would involve re-writing the majority of NNA. Something I am not too keen on doing given the QueryEngine re-write I just went through.

I think I will hold off on this for now. Here's hoping that I find either a similar library that maps to Collection or that the author of lacuna.bifurcan's repo adds such implementations. 🤞

pjeli · 2020-01-06T21:58:12Z

I've been experimenting with microbenchmarks and a couple libraries. I think the path forward is going to be allowing a configuration option to experiment with different Collections.

I think a simple path forward is, by default, to use ConcurrentHashMap (the current Collection). Then to allow regular HashMap. Then Trove.

Trove has shown to take about as long as ConcurrentHashMap for loading while using alot less memory. It make take longer to run queries on however; need to experiment at scale.

Meanwhile, HashMap is the fastest for loading - uses less memory but close to Trove. However I expect to see issues when iterating over HashMap for queries.

…lter speed-up

…peed-up

pjeli added enhancement New feature or request help wanted Extra attention is needed labels May 31, 2019

pjeli added this to the 1.6.4 milestone Jun 4, 2019

pjeli removed this from the 1.6.4 milestone Jun 5, 2019

pjeli changed the title ~~Investigate using lacuna.bifurcan Maps for filter speed-up~~ Investigate using different Collections libraries for filter speed-up Jun 13, 2019

pjeli added a commit to pjeli/NNAnalytics that referenced this issue Jan 7, 2020

[paypal#230] Investigate using different Collections libraries for fi…

1329b31

…lter speed-up

pjeli mentioned this issue Jan 7, 2020

[#230] Investigate using different Collections libraries for filter speed-up #292

Merged

4 tasks

pjeli closed this as completed in #292 Jan 7, 2020

pjeli added a commit that referenced this issue Jan 7, 2020

[#230] Investigate using different Collections libraries for filter s…

5958598

…peed-up

pjeli mentioned this issue Apr 6, 2022

Introduce AirConcurrentMap as possible backing HashMap #308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate using different Collections libraries for filter speed-up #230

Investigate using different Collections libraries for filter speed-up #230

pjeli commented May 31, 2019

pjeli commented May 31, 2019

pjeli commented Jun 5, 2019 •

edited

Loading

pjeli commented Jan 6, 2020

Investigate using different Collections libraries for filter speed-up #230

Investigate using different Collections libraries for filter speed-up #230

Comments

pjeli commented May 31, 2019

pjeli commented May 31, 2019

pjeli commented Jun 5, 2019 • edited Loading

pjeli commented Jan 6, 2020

pjeli commented Jun 5, 2019 •

edited

Loading