-
Notifications
You must be signed in to change notification settings - Fork 789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: Avoid clearing the artifacts cache on restart #685
Comments
As far as I understand, the main concern about not clearing the artifact cache is:
Not that we can guarantee nobody will overwrite them during the node run time. I believe the cache pruning was implemented as a measure to mitigate those unwanted outcomes. But I don't have a strong opinion if it makes any sense. If someone wants to screw things up, he will. |
This will then require that we don't dispute when loading from the cache fails (which should be implemented nevertheless) and that invalid cache values are being recreated. |
I'm curious, how deterministic are artifact builds? |
Right now, they are not deterministic at all, but we're actively fighting for that. There are several levels of non-determinism, and we're trying to find a way to handle each of them.
|
Around 1, is there some need for randomness like to prevent the PVF doing rowhammer attacks upon nodes or whatever? If so, we could select the epoch randomness from the epoch after the PVF was uploaded, so the flow goes: PVF uploaded during epoch n. Wait for randomness during epoch n+1. Artifact builds start in epoch n+2 after the end of epoch n+1 is finalized. We could optimize this flow considerably if you actually want this. Around 3 & 4, there are not so many combinations here really, assuming 1 gets solved, right? If so, then we could sign a hash of our artifact build when we vote for the PVF. I'm not sure if this really helps that much, but it's maybe useful information when debugging, and maybe relevant elsewhere. I donno.. |
Yeah, this can already happen if the artifact goes missing for some reason. We assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there. We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for AmbiguousWorkerDeath). And when enqueuing an execute job we check for the artifact on-disk, and start preparation if not found. Of course, on node restart we would just recreate the list of known artifacts based on what's on-disk. But someone could still wipe the cache between that point and an execution attempt, which indeed would lead to disputes.
Yep, kind of similar to above point, as someone can screw it up by just modifying/deleting files. But the concern of overwritten files can also be mitigated. One simple way is to keep track of exact file modified time, and we could theoretically even make this persist across restart. (Though at that point we should move to using a persistent database.) When loading a file we make sure the times match what we expect. Someone would really have to be intentionally sabotaging their own system to change the file attributes. And speaking of time attributes, we could also use the last-read attribute to recompute the TTL of artifacts on restart. |
A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check. |
Yes, we believe we, or someone like us, built the artifact in the past. We'll de facto no-show if we cannot rebuild the artifact, which sounds fine I guess. |
Well, currently I'm not even sure we can influence the randomness inside Cranelift, but the idea is interesting, I'll investigate.
I hope 3) is not a problem anymore, and 4) is not a problem yet. But if we see at some point in time that people try to run validators on ARM or whatever, we'll need to find a way to handle it properly.
Do you mean the filesystem's "last access" attribute? It cannot be relied on. In server configurations, its usage is often switched off through filesystem mounting options to increase the filesystem's throughput. |
Ah yeah, absolutely. Raised paritytech/polkadot#6959 as this should be fixed regardless.
Maybe we can use it if present, and otherwise just reset the artifact to the default TTL. Alternatively, we could address this, as well as data integrity concerns, by storing last time accessed and a hash of the file contents + last time accessed, in the file name itself. |
As a hack, you could cargo patch getrandom to replace As an aside, I find this cap-rand crate bizarre btw, like you might derandomize something, but it should always be secure to give out system randomness, so restricting it as "ambient authority" makes no sense. |
We cannot tell if it's active or not If we want to be on the safe side, it makes sense to implement @koute's proposal to add version number to the artifact path, it solves all the possible issues |
I don't think it solves all the data integrity issues that have been brought up, though IMO those are almost orthogonal to this issue. I raised a separate ticket: #677. And IMO, to keep the scope low here we should just reset the artifact TTL to the default on restart. For this issue we can add the version number to the filename, simple enough. Did I miss any other objections to clearing the cache? |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/exploring-alternatives-to-wasm-for-smart-contracts/2434/11 |
…h#699) * SNO-324: Enable multiple accounts in outbound channel contract (paritytech#685) * Track nonces by source address * Remove principal from basic outbound channel * Update relayer bindings * Track nonces by origin address * Fix outbound channel tests * SNO-325: Enable multiple accounts in inbound channel pallet (paritytech#687) * Add origin field to message decoding * Track nonces by message origin * Include origin address in basic channel message id * Update basic inbound fixture data * Add note about updating inbound channel test data * Mention message data encoding in comment * Update test data in envelope test * Remove channel id from message id * s/origin/user/g * Swap the order of source & user * Reword test data generation section * s/value/nonce * Use named fields in MessageId enum * Switch to Twox64Concat hasher This prevents accidental expensive lookups while remaining relatively fast vs Blake2_128 and resistant to attacks that cause prefix collisions, thanks to the security of keccak used to create the Ethereum address. * Remove rogue print statement * SNO-326: Forward messages by address in relayer (paritytech#695) * Remove redundant slice * Add initial address filtering * Remove extra nonce var * Clean up basic channel address config * Rename parachain relayer account config * Remove unused method * Fix up comments * Add default eth addresses for basic channel * Improve map log readability * Fix parachain writer error messages * Bump lodestar version in example command * Switch to default eth account for E2E script * Increase default timeout from 6m 40s to 25m This helps with some of the longer running tests that are timing out when given 400s. * Rename mapping to nonce * Rename origin to account in basic outbound channel * Remove principal from outbound channel contract * Rename user to account in basic inbound channel * Fix event account field rename in test
Bumps [clap](/~https://github.com/clap-rs/clap) from 3.1.6 to 3.1.18. - [Release notes](/~https://github.com/clap-rs/clap/releases) - [Changelog](/~https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](clap-rs/clap@v3.1.6...v3.1.18) --- updated-dependencies: - dependency-name: clap dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
ISSUE
Overview
Avoid clearing the artifacts cache on restart when the host did not change
Issue extracted from discussion in paritytech/polkadot#6551 (review). Let's continue here to unblock that PR.
Previous discussion
We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.
We should've timings of course, but we'll never stop people building wasm blobs that screw up build times intentionally.
Interpreting kinda works. We have consensus upon who gets compiled vs interpreted, so interpreted then runs with different approval time parameters. We could similarly adjust approval parameters to include recompiling parathreads each block. This makes parathreads more expensive and second class though. We could've parathreads that "buy" being compiled in advance like parachains.
We do still have everyone compile the parathread when the PVF initially gets uploaded though, yes? I'd think this suggests parathreads and parachains should be all be precompiled, which just makes uploading a PVF more expensive. Implicitly then host upgrades become relatively more expensive, but this makes sense too.
Originally posted by @burdges in paritytech/polkadot#6551 (comment)
[...] I actually don't see why we want to clear the artifacts cache. On start-up, we could instead re-populate the
Artifacts
table from the compiled artifacts on disk -- the PVF hash should already be in the filename -- and re-start the 24-hour TTL timers for each artifact. Or we could even use the system's last-modified/accessed metadata for the files (with some sanity checks). Then instead of lazily re-compiling the PVFs, we would lazily delete the ones we end up not needing, which seems a lot more efficient. 🙂Originally posted by @mrcnski in paritytech/polkadot#6551 (comment)
[...] Do I understand correctly that the proposal is to keep artifacts only if the node is not upgraded? In that case, it might work. But we should always purge the artifacts if the node is upgraded.
Originally posted by @s0me0ne-unkn0wn in paritytech/polkadot#6551 (comment)
Related issue
/~https://github.com/paritytech/polkadot/issues/6941 (PVF preparation in advance)
The text was updated successfully, but these errors were encountered: