Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Avoid clearing the artifacts cache on restart #685

Closed
mrcnski opened this issue Mar 23, 2023 · 14 comments · Fixed by #1918
Closed

PVF: Avoid clearing the artifacts cache on restart #685

mrcnski opened this issue Mar 23, 2023 · 14 comments · Fixed by #1918
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Mar 23, 2023

ISSUE

Overview

Avoid clearing the artifacts cache on restart when the host did not change

Issue extracted from discussion in paritytech/polkadot#6551 (review). Let's continue here to unblock that PR.

Previous discussion

We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.

We should've timings of course, but we'll never stop people building wasm blobs that screw up build times intentionally.

Interpreting kinda works. We have consensus upon who gets compiled vs interpreted, so interpreted then runs with different approval time parameters. We could similarly adjust approval parameters to include recompiling parathreads each block. This makes parathreads more expensive and second class though. We could've parathreads that "buy" being compiled in advance like parachains.

We do still have everyone compile the parathread when the PVF initially gets uploaded though, yes? I'd think this suggests parathreads and parachains should be all be precompiled, which just makes uploading a PVF more expensive. Implicitly then host upgrades become relatively more expensive, but this makes sense too.

Originally posted by @burdges in paritytech/polkadot#6551 (comment)


We should imho avoid clearing the artifacts cache when the host did not change. We could recompile only the parachain blocks when the host version did change, then lazily recompile the parathread ones.

[...] I actually don't see why we want to clear the artifacts cache. On start-up, we could instead re-populate the Artifacts table from the compiled artifacts on disk -- the PVF hash should already be in the filename -- and re-start the 24-hour TTL timers for each artifact. Or we could even use the system's last-modified/accessed metadata for the files (with some sanity checks). Then instead of lazily re-compiling the PVFs, we would lazily delete the ones we end up not needing, which seems a lot more efficient. 🙂

Originally posted by @mrcnski in paritytech/polkadot#6551 (comment)


[...] Do I understand correctly that the proposal is to keep artifacts only if the node is not upgraded? In that case, it might work. But we should always purge the artifacts if the node is upgraded.

Originally posted by @s0me0ne-unkn0wn in paritytech/polkadot#6551 (comment)

Related issue

/~https://github.com/paritytech/polkadot/issues/6941 (PVF preparation in advance)

@s0me0ne-unkn0wn
Copy link
Contributor

As far as I understand, the main concern about not clearing the artifact cache is:

  1. Unsafe functions are used to import the artifact
  2. Their safety guarantees rely on the fact the node has produces them with prepare() and they are not just some random files put there by someone.

Not that we can guarantee nobody will overwrite them during the node run time. I believe the cache pruning was implemented as a measure to mitigate those unwanted outcomes. But I don't have a strong opinion if it makes any sense. If someone wants to screw things up, he will.

@bkchr
Copy link
Member

bkchr commented Mar 24, 2023

This will then require that we don't dispute when loading from the cache fails (which should be implemented nevertheless) and that invalid cache values are being recreated.

@burdges
Copy link

burdges commented Mar 25, 2023

I'm curious, how deterministic are artifact builds?

@s0me0ne-unkn0wn
Copy link
Contributor

I'm curious, how deterministic are artifact builds?

Right now, they are not deterministic at all, but we're actively fighting for that. There are several levels of non-determinism, and we're trying to find a way to handle each of them.

  1. Non-determinism inside a single Wasmtime version on a single platform. When doing register allocation, Cranelift shuffles registers. We don't care much about this one. From the point of view of the CPU, general-purpose registers are equal. So we don't care which one is used for a concrete value. But that means that on two different nodes, the same Wasm code compiled with the same version of Wasmtime on the same platform produces different artifacts; that is, their execution result is the same, but when compared byte-to-byte, they are different. We're taking that into account (e.g., we're not trying to calculate artifact hashes and compare them), but we're not trying to avoid this non-determinism.
  2. Non-determinism between different Wasmtime versions. That's something that should be handled by Explicit versioning of PVF execution environment #917. Right now, when we upgrade the Wasmtime version, we're just praying for the best. Nobody really guarantees that a function compiled with Wasmtime version N would use the same amount of native stack as a function compiled with version N-1. Thus, having half of the nodes upgraded to a new version and half of the nodes on an old version, we can theoretically get a situation when one-half of the nodes execute PVF successfully, and the other half fails. Executor environment parametrization is the first step to mitigate that, the next will be the versification that hopefully will allow us to eliminate this non-determinism completely.
  3. Non-determinism between Wasmtimes with different environment semantics. It's not a problem right now as we don't change environment semantics, but we want to be able to. That is why execution environment parametrization was introduced in the first place in Executor Environment parameterization  polkadot#6161. For now, this type of non-determinism is hopefully adequately handled.
  4. Non-determinism between different platforms. There's no clear pathway to how to handle that yet, but we don't care much right now as the vast majority of validators are running on the same platform.

@burdges
Copy link

burdges commented Mar 26, 2023

Around 1, is there some need for randomness like to prevent the PVF doing rowhammer attacks upon nodes or whatever? If so, we could select the epoch randomness from the epoch after the PVF was uploaded, so the flow goes: PVF uploaded during epoch n. Wait for randomness during epoch n+1. Artifact builds start in epoch n+2 after the end of epoch n+1 is finalized. We could optimize this flow considerably if you actually want this.

Around 3 & 4, there are not so many combinations here really, assuming 1 gets solved, right? If so, then we could sign a hash of our artifact build when we vote for the PVF. I'm not sure if this really helps that much, but it's maybe useful information when debugging, and maybe relevant elsewhere. I donno..

@mrcnski
Copy link
Contributor Author

mrcnski commented Mar 26, 2023

This will then require that we don't dispute when loading from the cache fails (which should be implemented nevertheless) and that invalid cache values are being recreated.

Yeah, this can already happen if the artifact goes missing for some reason. We assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there. We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for AmbiguousWorkerDeath). And when enqueuing an execute job we check for the artifact on-disk, and start preparation if not found.

Of course, on node restart we would just recreate the list of known artifacts based on what's on-disk. But someone could still wipe the cache between that point and an execution attempt, which indeed would lead to disputes.

Not that we can guarantee nobody will overwrite them during the node run time. I believe the cache pruning was implemented as a measure to mitigate those unwanted outcomes. But I don't have a strong opinion if it makes any sense. If someone wants to screw things up, he will.

Yep, kind of similar to above point, as someone can screw it up by just modifying/deleting files. But the concern of overwritten files can also be mitigated. One simple way is to keep track of exact file modified time, and we could theoretically even make this persist across restart. (Though at that point we should move to using a persistent database.) When loading a file we make sure the times match what we expect. Someone would really have to be intentionally sabotaging their own system to change the file attributes.

And speaking of time attributes, we could also use the last-read attribute to recompute the TTL of artifacts on restart.

@bkchr
Copy link
Member

bkchr commented Mar 26, 2023

Yeah, this can already happen if the artifact goes missing for some reason. We assume that if we prepared an artifact, it remains there on-disk until we prune it, i.e. we never check again if it's still there. We can change it so that instead of artifact-not-found triggering a dispute, we retry once (like we do for AmbiguousWorkerDeath). And when enqueuing an execute job we check for the artifact on-disk, and start preparation if not found.

A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check.

@burdges
Copy link

burdges commented Mar 26, 2023

Yes, we believe we, or someone like us, built the artifact in the past. We'll de facto no-show if we cannot rebuild the artifact, which sounds fine I guess.

@s0me0ne-unkn0wn
Copy link
Contributor

s0me0ne-unkn0wn commented Mar 27, 2023

Around 1, is there some need for randomness

Well, currently I'm not even sure we can influence the randomness inside Cranelift, but the idea is interesting, I'll investigate.

Around 3 & 4, there are not so many combinations here really, assuming 1 gets solved, right?

I hope 3) is not a problem anymore, and 4) is not a problem yet. But if we see at some point in time that people try to run validators on ARM or whatever, we'll need to find a way to handle it properly.

And speaking of time attributes, we could also use the last-read attribute to recompute the TTL of artifacts on restart.

Do you mean the filesystem's "last access" attribute? It cannot be relied on. In server configurations, its usage is often switched off through filesystem mounting options to increase the filesystem's throughput.

@mrcnski
Copy link
Contributor Author

mrcnski commented Mar 27, 2023

A dispute should never be raised if the local cache doesn't provide a certain artifact. You can not dispute based on this reason, as it is a local hardware issue and not related to the candidate to check.

Ah yeah, absolutely. Raised paritytech/polkadot#6959 as this should be fixed regardless.

Do you mean the filesystem's "last access" attribute? It cannot be relied on. In server configurations, its usage is often switched off through filesystem mounting options to increase the filesystem's throughput.

Maybe we can use it if present, and otherwise just reset the artifact to the default TTL.

Alternatively, we could address this, as well as data integrity concerns, by storing last time accessed and a hash of the file contents + last time accessed, in the file name itself.

@burdges
Copy link

burdges commented Mar 27, 2023

I'm not even sure we can influence the randomness inside Cranelift

As a hack, you could cargo patch getrandom to replace OsRng by thread local SeedableRng which you seed, but maybe only if you've some deterministic thread identifier, otherwise all threads return the same, or else their startup might create non-determinism. Just please never do this inside code that signs anything. lol

As an aside, I find this cap-rand crate bizarre btw, like you might derandomize something, but it should always be secure to give out system randomness, so restricting it as "ambient authority" makes no sense.

@s0me0ne-unkn0wn
Copy link
Contributor

s0me0ne-unkn0wn commented Mar 27, 2023

Maybe we can use it if present, and otherwise just reset the artifact to the default TTL.

We cannot tell if it's active or not ☹️ It's always present, just not getting updated if switched off.

If we want to be on the safe side, it makes sense to implement @koute's proposal to add version number to the artifact path, it solves all the possible issues

@mrcnski
Copy link
Contributor Author

mrcnski commented Mar 28, 2023

If we want to be on the safe side, it makes sense to implement @koute's proposal to add version number to the artifact path, it solves all the possible issues

I don't think it solves all the data integrity issues that have been brought up, though IMO those are almost orthogonal to this issue. I raised a separate ticket: #677. And IMO, to keep the scope low here we should just reset the artifact TTL to the default on restart.

For this issue we can add the version number to the filename, simple enough. Did I miss any other objections to clearing the cache?

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/exploring-alternatives-to-wasm-for-smart-contracts/2434/11

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T8-parachains_engineering and removed I10-optimisation labels Aug 25, 2023
bkchr pushed a commit that referenced this issue Oct 15, 2023
closes #695

Could potentially be helpful to preserving caches when applicable, as
discussed in #685

kusama address: FvpsvV1GQAAbwqX6oyRjemgdKV11QU5bXsMg9xsonD1FLGK
@s0me0ne-unkn0wn s0me0ne-unkn0wn moved this from Backlog to In Progress in parachains team board Nov 2, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
…h#699)

* SNO-324: Enable multiple accounts in outbound channel contract (paritytech#685)

* Track nonces by source address

* Remove principal from basic outbound channel

* Update relayer bindings

* Track nonces by origin address

* Fix outbound channel tests

* SNO-325: Enable multiple accounts in inbound channel pallet (paritytech#687)

* Add origin field to message decoding

* Track nonces by message origin

* Include origin address in basic channel message id

* Update basic inbound fixture data

* Add note about updating inbound channel test data

* Mention message data encoding in comment

* Update test data in envelope test

* Remove channel id from message id

* s/origin/user/g

* Swap the order of source & user

* Reword test data generation section

* s/value/nonce

* Use named fields in MessageId enum

* Switch to Twox64Concat hasher

This prevents accidental expensive lookups while remaining relatively
fast vs Blake2_128 and resistant to attacks that cause prefix
collisions, thanks to the security of keccak used to create the Ethereum
address.

* Remove rogue print statement

* SNO-326: Forward messages by address in relayer (paritytech#695)

* Remove redundant slice

* Add initial address filtering

* Remove extra nonce var

* Clean up basic channel address config

* Rename parachain relayer account config

* Remove unused method

* Fix up comments

* Add default eth addresses for basic channel

* Improve map log readability

* Fix parachain writer error messages

* Bump lodestar version in example command

* Switch to default eth account for E2E script

* Increase default timeout from 6m 40s to 25m

This helps with some of the longer running tests that are timing out
when given 400s.

* Rename mapping to nonce

* Rename origin to account in basic outbound channel

* Remove principal from outbound channel contract

* Rename user to account in basic inbound channel

* Fix event account field rename in test
@eskimor eskimor moved this from In Progress to Completed in parachains team board Jan 30, 2024
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
Bumps [clap](/~https://github.com/clap-rs/clap) from 3.1.6 to 3.1.18.
- [Release notes](/~https://github.com/clap-rs/clap/releases)
- [Changelog](/~https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@v3.1.6...v3.1.18)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

6 participants