-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: caching version of org.spdx.storage.listedlicense.SpdxListedLicenseWebStore and associated classes #193
Comments
Thanks @pmonks for the suggested solution. I've also run into this performance problem and would welcome a solution. If we could make the opt-in version a superclass of the existing webstore and just add caching it would be a pretty straight-forward implementation - we would need to investigate a bit more to see if this approach would work. Once implemented, we could add an option or environment variable for tools-java to use the cache - it would speed up the command line utility quite a bit. |
FWIW I implemented exactly this kind of caching before I discovered Am I right in thinking that this enhancement might be possible solely via internal implementation changes within the |
I think that should work. I do want to make this optional with the default being uncached - there are some scenerios where having a cache could introduce security vulnerabilities. |
Oh yeah I agree with making it optional - was more just trying to understand the effort involved in implementing this. If it's just internal implementation of a single method I can probably give it a go, despite my very rusty Java. |
@pmonks - I think your implementation approach should work, so feel free to create a PR. We should also add documentation to the README on the caching and how to enable / disable it. |
So before I launch into this, I wanted to discuss some micro-design questions with you @goneall:
|
@goneall I went ahead with a first cut at this, based on my thoughts on each of the above 5 points - you can find it here. Would love any/all comments you might like to share! You can test this branch out with the command: $ mvn -Dorg.spdx.storage.listedlicense.SpdxListedLicenseWebStore.enableCache=true test -Ptest |
@pmonks - sorry for the late reply - really busy day yesterday. I'll take a look at the code later today - but I do want to point you to some existing code to handle properties you can leverage: Spdx-Java-Library/src/main/java/org/spdx/library/model/license/ListedLicenses.java Line 77 in ae14e24
It solves a similar problem where we have a property to force using a local copy of the licenses - it was requested by someone who wanted to use the library totally offline (air gap security). I forgot about that implementation until you raised the questions above. |
@goneall absolutely no problem at all! None of this work is urgent on my end, so please don't feel compelled to even reply until you have an appropriate time to look at this!
I might be misunderstanding, but that sounds like a different use case to mine. I do want to use the online versions of the various files (rather than the ones baked into the library's JAR file), but I want them cached locally on disk so that the download cost is only incurred once per SPDX license list version (rather than once per JVM invocation, as is currently the case). Separately (and unrelated to this branch), I've noticed that some URLs are downloaded repeatedly during the unit tests. Does that suggest there's a bug in the in-memory storage of this downloaded content (i.e. something doesn't realise that that content is already downloaded, and therefore downloads it again)? It seems to mostly (only?) be the JSON files for individual licenses and exception (e.g. https://spdx.org/licenses/Apache-2.0.json is downloaded several times, redundantly). While this new caching logic mitigates that bug, I'm pretty certain there would be a benefit to fixing it in the non-caching code paths too. Perhaps I should raise this in a separate issue? |
Agree - different use case, but I think you can use the same infrastructure for properties since they both involve how we store and access the license information - e.g., add an additional property to the property file.
I don't think the individual JSON files are cached, so it may re-fetch every time. I'm not sure if I could call it a bug - but it certainly could be optimized. |
Ah ok - I'll check that code out. That said, property files are difficult to manipulate from downstream code that wants to programmatically set these behaviours (rather than relying on a human operator to do so), so that rings little alarm bells for me. A more typical way to do this is to use Java properties, or (better yet) provide "init" type methods (or even constructor args) that accept these options - those are also a lot more amenable for use in DI frameworks (such as Spring). |
Latest update adds a "Configuration options" section to the readme, and also enables control over how often each cache entry gets rechecked for staleness (via an ETag request). |
@pmonks Thanks for pulling together the caching implementation. I took a look at the proposed changes and the overall approach looks good. I have a couple of suggestions:
I'm not sure if we need to add any locks around the caching - the class should support multi-threading and we are accessing some share metadata files, but I can't think of any scenarios where we would have a damaging race condition. If we want to be safe and add locking, we can use the |
BTW - I looked into the previously mentioned properties file implementation for the offline license model store. It turns out that it does accept system property It's been a while, but I think the reason we used both approach is to allow someone to make a change in the local properties file and not have to always specify a system property when running. I should probably document this in the README similar to how you documented the caching properties. |
The behaviour I've implemented is faithful to the XDG Base Directory Specification, which says that if the
✅ - I'll make that change now.
✅ - I'll make that change now. Also, interesting trivia from my testing this morning: enabling the cache avoids downloading 613 files totaling ~20MB on each use of the library, which (on my laptop) takes the initialisation time from ~4 minutes down to ~1 second. |
Ok cool - I'll go look into that now and try to replicate it. As long as downstream code can programmatically modify the behaviour, I'm good with whatever approach you decide we should go with.
Yeah makes sense. Putting this in constructor args / setters and then punting the specifics of how those methods get called right out of the library is probably ideal going forward - that's the most DI-friendly approach, and lets DI users leverage all of the weird and wonderful mechanisms DI frameworks support for setting these kinds of configuration options (which go well beyond just properties files and Java system properties). That's probably a longer term / bigger refactor type of activity though, so I'd probably prefer not to tackle it as part of this cache work.
I was wondering whether the GitHub project might benefit from a wiki for this kind of thing? While the README is great, it might become unwieldy as we start adding this kind of information, and a wiki lets us start to structure the information into separate categories / pages (e.g. "usage guide", "configuration guide", etc.). |
Actually - the code there is fine. I missed the enclosing parenthesis around the conditional - I originally thought it was just defaulting to I did notice one new issue when looking at this. Your should change the hard coded file separators to system dependent separators - you can use Java Path or replace the |
Good catch, and now fixed! It's clearly been too long since I had to use Windows. ;-) |
@goneall I've taken a look at how the existing properties configuration logic functions, and think it will become pretty messy to extend it in place to support the WebStore. Instead, I think we should create a separate, dedicated configuration management class that any other classes ( |
Completely agree with refactoring out the config into a separate class. Your proposed naming sounds good. |
Ok I think that change is now done too. Let me know if you have any feedback on it! I also merged in the latest changes to |
I'm thinking there may be a cleaner way to handle the confusing different prefixes on the system properties. In the properties file, we could just use a simple unqualified string (e.g. Let me know what you think. BTW - the old prefix SPDXParser was the name of the java package years ago - we just kept it in the property name for compatibility. It would be nice to change it to match the current code, but I'm afraid it may break some implementations. |
My sense is that having "smart" logic like that may actually be confusing for an implementer who's looking at someone's configuration i.e. it wouldn't be obvious that Then there's also the possibility of possible conflicts in the file itself. If some new class elsewhere in the library also has a configuration property called And then there's the bigger question of potentially ditching all of this anyway in favour of a more DI-compatible approach (which wouldn't use system properties or property files at all - that gets externalised out of the library and becomes the sole responsibility of the DI framework). IOW how much do we want to invest in code that might eventually go away anyway? tl;dr - while I'm ok with keeping that one property special cased (to preserve backwards compatibility), I think going forward we should discourage that kind of thing |
All good points - and I agree. If you can create a PR, I'll take one more review before merging. |
@goneall ok PR created, after some more refactoring and a lot of polishing of JavaDocs. 😉 Probably the biggest change since you last looked was that I took the liberty of refactoring out the entire DownloadCache into its own class (a bit like the Configuration class) - this makes both the cache logic and the WebStore logic a lot more self-explanatory, and also opens up the possibility of using the DownloadCache for any other classes elsewhere in the library that might retrieve content from URLs (though there aren't any other examples of that at this point, as best I can tell). |
I'm using
org.spdx.storage.listedlicense.SpdxListedLicenseWebStore
as a way to access all of the listed license and listed exception information (including templates etc.), and am finding it quite slow to download everything each time it runs - over a minute on my laptop. This is obviously a problem for short-lived processes (such as unit tests) that use SPDX listed license and exception information.In profiling the code I see there's an opportunity for a substantial performance saving by caching all of the files the
SpdxListedLicenseWebStore
and other supporting classes download locally, and subsequently reading them from the local cache (if present), instead of re-downloading them each time.Proposed Solution
Implement a new opt-in version of
org.spdx.storage.listedlicense.SpdxListedLicenseWebStore
that utilises a local cache. This would involve:If-None-Match
HTTP requests for the license and exception list JSON files, using the ETags from the metadata files saved in step 1. Basically we use these two JSON files as a proxy for whether any of the online data has changed and therefore the local cache is stale.Footnotes
* the XDG Base Directory Specification should be followed here, so the cache directory might be determined using code such as:
The text was updated successfully, but these errors were encountered: