-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload: make "digesting" in-band for zarr uploads OR just speed/cache zarr checksumming? #903
Comments
@yarikoptic Is Lines 790 to 793 in 49e33af
While the digest for the Zarr as a whole is not cached, the digests for the individual files are cached via fscacher. |
oh -- so it is just to check if file (well -- directory) was not modified since the time we "looked" at it to get the digest in metadata, right? So may be we should sacrifice that, and instead compare to the final checksum (after confirming that server has the same) and if mismatch just issue a warning (would be lame to completely discard the upload of possibly many GBs).
ah, good to know. But I think that in longer term we should fscache the state/checksum of the entire zarr, since caching individual files of zarr might actually be not really a benefit since would be coming with high cost of the cache needing thousands of individual records (inodes), whenever modifications within zarr of individual files are rare/non-existent, and thus fscaching individual files generally would be "too costly". WDYT? |
@yarikoptic If we don't cache any of the digests for the files in a Zarr, then whenever a single file in a Zarr is changed, we would have to redigest every file again.
Correct. We do the same thing for blobs. |
i'll "digest" the rest of the post later, but on the caching issue, i think there should be a single tree-cache in a datastructure stored in a binary file that's efficiently edited when files change (added/deleted). could be something as simple as an sqlite record. storing each file's checksum is indeed an inode catastrophe on most filesystems. i'm sure there are solutions out there, and it seems this will also be similar to the API server. however, s3 itself doesn't have any inode limits. |
@yarikoptic Can this be considered resolved by #915 and #923? |
let's consider resolved indeed for that moment |
.zarr folder could be large in size/number of files and require substantial investment to compute (even once) before upload even starts to commence.
Some summary notes on current design
blobs
zarr
filesGET /zarr
records to find one 'matching' but those could mutate in place ATM, thus it must not be relied onpublish
ingzarr_id
(UUID) for any givenname
name
provided by dandi-cli is just a file name (lacks path to it)zarr_id
is used for "prefix", not the name, so we cannot conflict/modify existing zarrUpload logic and checksumming
Overall ATM
upload
relies on the computed digest to decide either a given asset (blob or zarr) is to be uploaded at all since we are getting information from the server/metadata first if an asset with the etag already exists. I think this logic is generally applicable/beneficial to both blobs and zarr folders e.g. in cases of copying/moving zarr folders within/across dandisets. But we might need to do some sacrifices in case of zarr, especially since we are not "deduplicating" them ATM.In detail, while going through upload code logic some notes/observations:
C1
: Having an etag (unconditionally) computed,upload
code then uses it in case of an asset already existing (inoverwrite
andrefresh
modes) at the path on either that upload/change to asset should be skipped entirely. So, in principle,get_dandietag
, so not really). server can return 409 (and a blob_id) if blob with that etag exists, and we produce an asset which uses that blob_id.C2
: we do use full zarr checksum to error out if zarr checksum is different from the oneon server since ATM we do not support "updates"computed before, i.e. that zarr did not mutate while we were at it.It seems there is no fscacher involved sowe do it 2nd time (in addition toC1
) but fscacher cached results per each file in zarr, so some speed up might be there at the cost of many inodes!dandi-zarr-checksum
already to be known to metadata (on server?) so checksumming could in principle be delayed (but shouldn't be done 2nd time if was done in C1) till this point, to similarly as inC1
decide if we need to error out/complete
for the last batch) we verify that locally known checksum is identical to the one on serverC1
for zarr, alwaysiter_upload
ing even if there is a zarr asset at the path already. Then we can avoid doing expensive "digesting".upload
logic will guard against uploading unless "overwrite" or "refresh" mode are specifiedSolution
#1
: avoid pre-checksumming of zarr filesC1
should completely avoid checksumming of zarr archivesC2
safeguard for zarrs,@jwodder - is my analysis correct above (note
Q1
question) or what other aspects am I missing, and either needed changes/developments make sense to avoid zarr checksum compute before upload?Solution
#2
: do not make "zarr" too special, just speed things upIn principle, if checksumming of entire directories is made more efficient and fscaching of them more efficient (con/fscacher#66) , I do not see how checksumming of 'zarr' directory of e.g. X GB is performance-wise different from checksumming of a blob of the same X GB, and we do/require it. Above "avoidance" of checksumming zarr directories prior upload makes them "special", but may be not for so good reason -- in principle we already can make them "deduplicated" quite easily and may be could eventually add some flag "immutable: bool" which would be set to
True
whenever full upload is finalized, thus letting them to be immutable forever (API should the refuse to modify an immutable zarr_id) -- needed for publishing anyways. May be we should just keep that digesting, speed it up/fscache the result, and add API to dandi-archive to get a zarr given a checksum (as how we do for blobs)?Also in some cases pre-checksumming of zarrs could be avoided, e.g. if know that it is different (based on size, number of files) so we could instead optimize logic of upload to delay checksumming until really needed.
WDYT @satra ? Also
dandiset_id
field for the point of this particular issue.The text was updated successfully, but these errors were encountered: