Improve how geolocation DB files are downloaded/updated #2308

acelaya · 2024-12-15T09:06:29Z

Part of #2124

Change logic to determine if the GeoLite2 db file needs to be downloaded, in an attempt to address recurrent issues where people report they are hitting GeoLite's download API limits.

Current approach has two main problems:

The condition to perform the download is based on the existing GeoLite db file metadata, which includes the time in which it was built. If it's older than 35 days, Shlink tries to download it.
This has presented problems in the past, due to stateful services which continue referencing an old version of the metadata, and database builds released with incorrect metadata, which make Shlink think they are outdated.
There's no historical information of previous downloads, which means Shlink could retry and fail indefinitely in case of an error.

This PR introduces a new table in the database where Shlink tracks download attempts, and their results (success or error).

Using this, Shlink can avoid subsequent attempts if there has been a number of consecutive download errors, making sure there's no more than a small number of attempts per day.

Additionally, it can use this to also know when was the last successful attempt, and download again after a reasonable number of days.

The exact rules of the algorithm introduced here are:

Get the last 15 download attempts.
If the max amount of consecutive errors has been reached (15), skip the download if the last one is less than 2 days old.
If there are no attempts at all or the database file itself does not exist, try to download the database.
If the last attempt is an error but we haven't reached the max amount of errors, try to download the database.
If the last attempt was successful and is older than 30 days, try to download the database.

Todo

Test performance as the table grows (EXPLAIN in MySQL shows the index on date_updated may not be used when ordering the result. Maybe id needs to be used instead).
EDIT: The index is in fact being used and making it very efficient. With 2M rows, sorting by date_updated (which is indexed) takes 0.001s, while sorting by date_created (which is NOT indexed) takes more than 3s.
Wrap process in transaction and lock using the database rather than a symfony locker.
EDIT: I tried to lock via SELECT ... FOR UPDATE and a database transaction, but didn't result in properly locked rows. I need to investigate further but I'll keep the filesystem lock for now.
Verify the mechanism to resolve the filesystem ID is working as expected and resolves different values for, eg. docker containers.
Log reason for which a database download was triggered:
- No download attempts exist.
- The database file does not exist.
- Last attempt was error but no max consecutive errors were reached.
- Last attempt was successful but it's older than 30 days.

…rogress

codecov · 2024-12-16T18:53:06Z

Codecov Report

Attention: Patch coverage is 98.86364% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.71%. Comparing base (9e34183) to head (509ef66).
Report is 9 commits behind head on develop.

Files with missing lines	Patch %	Lines
...ore/src/Geolocation/Entity/GeolocationDbUpdate.php	96.15%	1 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             develop    #2308      +/-   ##
=============================================
+ Coverage      93.66%   93.71%   +0.05%     
- Complexity      1660     1680      +20     
=============================================
  Files            275      276       +1     
  Lines           5791     5839      +48     
=============================================
+ Hits            5424     5472      +48     
  Misses           367      367

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

acelaya added 7 commits December 13, 2024 10:33

Create new table to track geolocation updates

d4d97c3

Refactor geolocation download logic based on database table

a77e07f

Simplify geolocation_db_updates indexes

f10a9d3

Fix some cases of database download in GeolocationDbUpdater

853c50a

Handle differently when trying to update geolocation and already in p…

72a962e

…rogress

Track reason for which a geolocation db download was attempted

e715a0f

Fix GeolocationDbUpdater test

509ef66

acelaya marked this pull request as ready for review December 16, 2024 19:15

acelaya changed the title ~~Feature/geolocation updates~~ Improve how geolocation DB files are downloaded/updated Dec 16, 2024

acelaya merged commit d533adf into shlinkio:develop Dec 16, 2024
23 checks passed

acelaya deleted the feature/geolocation-updates branch December 16, 2024 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve how geolocation DB files are downloaded/updated #2308

Improve how geolocation DB files are downloaded/updated #2308

acelaya commented Dec 15, 2024 •

edited

Loading

codecov bot commented Dec 16, 2024 •

edited

Loading

Improve how geolocation DB files are downloaded/updated #2308

Improve how geolocation DB files are downloaded/updated #2308

Conversation

acelaya commented Dec 15, 2024 • edited Loading

Todo

codecov bot commented Dec 16, 2024 • edited Loading

Codecov Report

acelaya commented Dec 15, 2024 •

edited

Loading

codecov bot commented Dec 16, 2024 •

edited

Loading