Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload: make "digesting" in-band for zarr uploads OR just speed/cache zarr checksumming? #903

Closed
yarikoptic opened this issue Feb 9, 2022 · 6 comments

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Feb 9, 2022

.zarr folder could be large in size/number of files and require substantial investment to compute (even once) before upload even starts to commence.

Some summary notes on current design

  • blobs
    • globally de-duplicated across dandisets based on their checksums
  • zarr files
    • are not intended to be de-duplicated ATM, so there is per se no API to get the one "matching"
      • it is possible to go through checksums of GET /zarr records to find one 'matching' but those could mutate in place ATM, thus it must not be relied on
      • eventually we must make some zarrs immutable for publishing
    • current implementation always mints a new zarr_id (UUID) for any given name
    • name provided by dandi-cli is just a file name (lacks path to it)
    • on S3 zarr_id is used for "prefix", not the name, so we cannot conflict/modify existing zarr

Upload logic and checksumming

Overall ATM upload relies on the computed digest to decide either a given asset (blob or zarr) is to be uploaded at all since we are getting information from the server/metadata first if an asset with the etag already exists. I think this logic is generally applicable/beneficial to both blobs and zarr folders e.g. in cases of copying/moving zarr folders within/across dandisets. But we might need to do some sacrifices in case of zarr, especially since we are not "deduplicating" them ATM.

In detail, while going through upload code logic some notes/observations:

  • C1: Having an etag (unconditionally) computed, upload code then uses it in case of an asset already existing (in overwrite and refresh modes) at the path on either that upload/change to asset should be skipped entirely. So, in principle,
    • etag computation could be delayed until if needed -- an asset at the path exists, is of the same size and in case of zarr -- that zarr has the same number of files (if different -- we know that it is different -- upload regardless of checksum match)
  • if we are to proceed with upload,
    • for a blob we do need etag to initiate upload, we "compute" it again (fscached call of get_dandietag, so not really). server can return 409 (and a blob_id) if blob with that etag exists, and we produce an asset which uses that blob_id.
      • so for blobs we cannot avoid not pre-computing etag but it also could be delayed until/if needed
    • for zarr upload
      • C2: we do use full zarr checksum to error out if zarr checksum is different from the one on server since ATM we do not support "updates" computed before, i.e. that zarr did not mutate while we were at it.
        • It seems there is no fscacher involved so we do it 2nd time (in addition to C1) but fscacher cached results per each file in zarr, so some speed up might be there at the cost of many inodes!
        • It is conditional on dandi-zarr-checksum already to be known to metadata (on server?) so checksumming could in principle be delayed (but shouldn't be done 2nd time if was done in C1) till this point, to similarly as in C1 decide if we need to error out
        • Q1: it is not clear to me (@jwodder?) why we would like to proceed with the upload if checksum is known and "the same" since how server would know full checksum unless we finished uploading entire zarr properly?
      • We do not really need full zarr's checksum/digest to initiate upload . We do provide per file digests here.
      • At the end (after the /complete for the last batch) we verify that locally known checksum is identical to the one on server
      • If checksum identical -- we create an asset, if not -- we error out
      • Let's sacrifice the check of checksum at C1 for zarr, always iter_uploading even if there is a zarr asset at the path already. Then we can avoid doing expensive "digesting". upload logic will guard against uploading unless "overwrite" or "refresh" mode are specified

Solution #1: avoid pre-checksumming of zarr files

  • C1 should completely avoid checksumming of zarr archives
  • proceed to upload in "overwrite" unconditionally (no checksum for zarr),
  • and in "refresh" only if newer (but again - no checksum)
  • remove C2 safeguard for zarrs,
  • compute zarr checksum "in-band" while uploading
  • verify correctness of the upload based on that. UX will be

@jwodder - is my analysis correct above (note Q1 question) or what other aspects am I missing, and either needed changes/developments make sense to avoid zarr checksum compute before upload?

Solution #2: do not make "zarr" too special, just speed things up

In principle, if checksumming of entire directories is made more efficient and fscaching of them more efficient (con/fscacher#66) , I do not see how checksumming of 'zarr' directory of e.g. X GB is performance-wise different from checksumming of a blob of the same X GB, and we do/require it. Above "avoidance" of checksumming zarr directories prior upload makes them "special", but may be not for so good reason -- in principle we already can make them "deduplicated" quite easily and may be could eventually add some flag "immutable: bool" which would be set to True whenever full upload is finalized, thus letting them to be immutable forever (API should the refuse to modify an immutable zarr_id) -- needed for publishing anyways. May be we should just keep that digesting, speed it up/fscache the result, and add API to dandi-archive to get a zarr given a checksum (as how we do for blobs)?
Also in some cases pre-checksumming of zarrs could be avoided, e.g. if know that it is different (based on size, number of files) so we could instead optimize logic of upload to delay checksumming until really needed.

WDYT @satra ? Also

  • I do not think there is much of a point to provide any longer "name" (e.g. include folder from the top of the dataset) or demand dandiset_id field for the point of this particular issue.
@jwodder
Copy link
Member

jwodder commented Feb 9, 2022

@yarikoptic Is C2 referring to this code?

dandi-cli/dandi/files.py

Lines 790 to 793 in 49e33af

digest = metadata.get("digest", {})
if "dandi:dandi-zarr-checksum" in digest:
if digest["dandi:dandi-zarr-checksum"] != filetag:
raise RuntimeError(

metadata is here the metadata for the asset that's about to be uploaded, not the metadata for a pre-existing asset on the server.

It seems there is no fscacher involved so we do it 2nd time (in addition to C1)!

While the digest for the Zarr as a whole is not cached, the digests for the individual files are cached via fscacher.

@yarikoptic
Copy link
Member Author

metadata is here the metadata for the asset that's about to be uploaded, not the metadata for a pre-existing asset on the server.

oh -- so it is just to check if file (well -- directory) was not modified since the time we "looked" at it to get the digest in metadata, right? So may be we should sacrifice that, and instead compare to the final checksum (after confirming that server has the same) and if mismatch just issue a warning (would be lame to completely discard the upload of possibly many GBs).

It seems there is no fscacher involved so we do it 2nd time (in addition to C1)!

While the digest for the Zarr as a whole is not cached, the digests for the individual files are cached via fscacher.

ah, good to know. But I think that in longer term we should fscache the state/checksum of the entire zarr, since caching individual files of zarr might actually be not really a benefit since would be coming with high cost of the cache needing thousands of individual records (inodes), whenever modifications within zarr of individual files are rare/non-existent, and thus fscaching individual files generally would be "too costly". WDYT?

@jwodder
Copy link
Member

jwodder commented Feb 9, 2022

@yarikoptic If we don't cache any of the digests for the files in a Zarr, then whenever a single file in a Zarr is changed, we would have to redigest every file again.

oh -- so it is just to check if file (well -- directory) was not modified since the time we "looked" at it to get the digest in metadata, right?

Correct. We do the same thing for blobs.

@satra
Copy link
Member

satra commented Feb 9, 2022

i'll "digest" the rest of the post later, but on the caching issue, i think there should be a single tree-cache in a datastructure stored in a binary file that's efficiently edited when files change (added/deleted). could be something as simple as an sqlite record.

storing each file's checksum is indeed an inode catastrophe on most filesystems. i'm sure there are solutions out there, and it seems this will also be similar to the API server. however, s3 itself doesn't have any inode limits.

@jwodder
Copy link
Member

jwodder commented Jul 27, 2022

@yarikoptic Can this be considered resolved by #915 and #923?

@yarikoptic
Copy link
Member Author

let's consider resolved indeed for that moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants