Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: check artifacts in DB to reduce number of sha1 queries #50

Merged

Conversation

DmitriyLewen
Copy link
Collaborator

@DmitriyLewen DmitriyLewen commented Dec 26, 2024

Description

Recently, we’ve been encountering too many requests error more frequently (https://github.com/aquasecurity/trivy-java-db/actions/runs/12458901714).
So we need to check the version in the DB.
If the version for this artifact is already added - do not request sha1 for this version.

test builds:

@DmitriyLewen DmitriyLewen self-assigned this Dec 26, 2024
@DmitriyLewen
Copy link
Collaborator Author

This is weird...
even with these changes I get too many requests... - https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12502776627/job/34882197681

But an hour ago I ran the action with the wrong DB path (so the scanner checked and saved all versions) and there were no errors -https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12502425532/job/34881350678

@DmitriyLewen DmitriyLewen marked this pull request as ready for review January 9, 2025 04:29
@DmitriyLewen DmitriyLewen marked this pull request as draft January 9, 2025 04:32
@knqyf263
Copy link
Collaborator

knqyf263 commented Jan 9, 2025

This is weird... even with these changes I get too many requests... - https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12502776627/job/34882197681

But an hour ago I ran the action with the wrong DB path (so the scanner checked and saved all versions) and there were no errors -https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12502425532/job/34881350678

Does it mean Maven Central Repository counts the overall number of requests, not by IP address?

@DmitriyLewen
Copy link
Collaborator Author

Does it mean Maven Central Repository counts the overall number of requests, not by IP address?

I'm not sure about that.
Is it possible that some runners use the same IP?

I saw that before and rechecked now:
maven central currently returns 429 error - https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12683666619/job/35351108235

But locally it works for me without errors.

@knqyf263
Copy link
Collaborator

knqyf263 commented Jan 9, 2025

OK, anyway, it's better to reduce HTTP requests. Did you confirm the database with this change is identical to the existing one?

@DmitriyLewen
Copy link
Collaborator Author

I still can't build new DB - https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12683666619/job/35351108235

I will let you know when build and compare DBs.

@DmitriyLewen
Copy link
Collaborator Author

@knqyf263
I tested DBs from this PR.

I can confirm that new DBs contain all artifacts from actual DB (with new artifacts of course).
Also new DBs built faster (1.30 - 2.30 hours) https://github.com/DmitriyLewen/trivy-java-db/actions

@knqyf263
Copy link
Collaborator

I'm looking into the central index, but unpacking never ended. I again kicked unpacking on the server and am waiting for it to complete now.

$ java -jar indexer-cli.jar --unpack nexus-maven-repository-index.gz   --destination ./data/central-lucene-index --type full

@DmitriyLewen
Copy link
Collaborator Author

DmitriyLewen commented Jan 13, 2025

IICR we already thought about that when started working on trivy-java-db.
I even started writing code on Java (https://github.com/DmitriyLewen/maven-indexes-saver).

But IIRC there were problem with Luke and converting this into json/saving into sql db.
And for Go there was no normal adapter for working with Luke

@DmitriyLewen DmitriyLewen marked this pull request as ready for review January 13, 2025 07:05
@knqyf263
Copy link
Collaborator

Since the purpose is different this time, I thought it would be great if we could simply retrieve the updated artifact with the Luke CLI. In any case, it's been two years since we started trivy-java-db, so I'm willing to give it another try.

@DmitriyLewen
Copy link
Collaborator Author

Okay 👍
Let me know if you need help.
Maybe I can even remember something about my work on this 😄

@knqyf263
Copy link
Collaborator

My server has 40GB of space, but it failed due to out of disk...

Exception in thread "Lucene Merge Thread #150" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: No space left on device
        at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:509)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
Caused by: java.io.IOException: No space left on device
        at java.base/java.io.RandomAccessFile.writeBytes(Native Method)
        at java.base/java.io.RandomAccessFile.write(RandomAccessFile.java:559)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:448)
        at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:99)
        at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:88)
        at org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:113)
        at org.apache.lucene.store.FSDirectory$FSIndexOutput.close(FSDirectory.java:458)
        at org.apache.lucene.index.TermInfosWriter.close(TermInfosWriter.java:248)
        at org.apache.lucene.index.TermInfosWriter.close(TermInfosWriter.java:251)
        at org.apache.lucene.util.IOUtils.close(IOUtils.java:141)
        at org.apache.lucene.index.FormatPostingsFieldsWriter.finish(FormatPostingsFieldsWriter.java:70)
        at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:432)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:108)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4263)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3908)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:388)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:456)
        Suppressed: java.io.IOException: No space left on device
                at java.base/java.io.RandomAccessFile.writeBytes(Native Method)
                at java.base/java.io.RandomAccessFile.write(RandomAccessFile.java:559)
                at org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:448)
                at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:99)
                at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:88)
                at org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:113)
                at org.apache.lucene.store.FSDirectory$FSIndexOutput.close(FSDirectory.java:458)
                at org.apache.lucene.util.IOUtils.close(IOUtils.java:141)
                at org.apache.lucene.index.FormatPostingsDocsWriter.close(FormatPostingsDocsWriter.java:134)
                at org.apache.lucene.index.FormatPostingsTermsWriter.close(FormatPostingsTermsWriter.java:71)
                ... 8 more
                Suppressed: java.io.IOException: No space left on device
                        at java.base/java.io.RandomAccessFile.writeBytes(Native Method)
                        at java.base/java.io.RandomAccessFile.write(RandomAccessFile.java:559)
                        at org.apache.lucene.store.FSDirectory$FSIndexOutput.flushBuffer(FSDirectory.java:448)
                        at org.apache.lucene.store.BufferedIndexOutput.flushBuffer(BufferedIndexOutput.java:99)
                        at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:88)
                        at org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:113)
                        at org.apache.lucene.store.FSDirectory$FSIndexOutput.close(FSDirectory.java:458)
                        at org.apache.lucene.util.IOUtils.close(IOUtils.java:141)
                        at org.apache.lucene.index.FormatPostingsPositionsWriter.close(FormatPostingsPositionsWriter.java:87)
                        ... 11 more

@DmitriyLewen
Copy link
Collaborator Author

DmitriyLewen commented Jan 13, 2025

I downloaded nexus-maven-repository-index.866.gz
archive size is 9.7 MB. After extraction - 139 MB.

It looks like all indexes need at least 35 GB
But this is just an estimate. It is possible that you will need more space (possibly much more)

PS
Maven central says about 50TB (https://mvnrepository.com/repos/central)
Also goneall said about 45 TB (aquasecurity/trivy#8118 (reply in thread))

Copy link
Collaborator

@knqyf263 knqyf263 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave up the maven index again in 2025 😄 Feel free to merge this PR.

@DmitriyLewen DmitriyLewen merged commit 793673c into aquasecurity:main Jan 14, 2025
3 checks passed
@DmitriyLewen DmitriyLewen deleted the fix/check-artifacts-in-db branch January 14, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants