Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: add only new artifacts #48

Closed

Conversation

DmitriyLewen
Copy link
Collaborator

@DmitriyLewen DmitriyLewen commented Dec 23, 2024

Description

Recently, we’ve been encountering too many requests error more frequently (/~https://github.com/aquasecurity/trivy-java-db/actions/runs/12458901714).
Increasing the number of parallel goroutines or reducing the delay between requests doesn't resolve the issue.

Therefore, this PR adds functionality to add only new artifacts to the database. The need for adding is determined using the updateAt field from the cache/db/metadata.json file (the date minus one day to account for the time needed to build the database).
If the cache/db dir is empty, the crawler will save all indexes.

Test runs:

  1. add new artifacts from 2024-10-13 - /~https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12465006507/job/34790062252
  2. add new artifacts from 2024-12-22 - /~https://github.com/DmitriyLewen/trivy-java-db/actions/runs/12465457228/job/34791277401

@DmitriyLewen DmitriyLewen self-assigned this Dec 23, 2024
@@ -221,6 +227,27 @@ func (c *Crawler) Visit(ctx context.Context, url string) error {
return nil
}

// To avoid a large number of requests to the server, we should skip already saved artifacts (if the start date is specified).
// P.S. We do not need to check for updates, since artifacts are immutable
// see https://central.sonatype.org/publish/requirements/immutability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document seems relevant to the sonatype repository. Is this repository also immutable?
https://mvnrepository.com/repos/central

Copy link
Collaborator Author

@DmitriyLewen DmitriyLewen Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find official info about that. But i think rules should be same.
But i saw answer about that - https://stackoverflow.com/questions/40739939/dropping-a-release-from-public-maven-central

Also indirect evidence is this answer (they had the same sha1 for several artifacts):
instead of changing the file - they release a new version

Copy link
Collaborator

@knqyf263 knqyf263 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you compare the database from scratch and one from the previous one? Did they match?

@knqyf263
Copy link
Collaborator

I remember we created a small script to compare two databases, but didn't find it...

@DmitriyLewen
Copy link
Collaborator Author

Did you compare the database from scratch and one from the previous one? Did they match?

I will do that and write to you

@DmitriyLewen
Copy link
Collaborator Author

I have bad news...
I built multiple DBs and compared them.

I found problem with this idea - it looks like this date is not date it was added to maven central.
e.g.:

  1. https://repo.maven.apache.org/maven2/io/github/lubase/lubase-orm/1.4.4/ has 2024-11-20
    But DBs from 18-12-2024 and 08-12-2024 don't have this artifact.
    This artifact was added between 18-12-2024 and 20-12-2024.

So i close this PR.


I have another idea.
I will check and open new PR.

@DmitriyLewen
Copy link
Collaborator Author

@knqyf263 I opened #50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants