-
Notifications
You must be signed in to change notification settings - Fork 49
Manage tx validation and propagation with its files #85
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through most of the PR, but there are some random spots I had to skip because I ran out of time. Most notable thing I didn't have time to carefully go through was the handling of transactions when they originate from different sources. The type system denotes these differences, but the event loop in txvalidation
only seemed to handle P2P transactions. I'll get back to this probably tomorrow.
The general direction here is very good. I like this a lot.
Couple notes on:
- Minor nitpicks on comment style.
- Some
From
/Into
trait implementation could probably make the code cleaner? - The type modeling for tx state domain & similarly for file source as well.
5e76ada
to
3a1f086
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally went through everything here. I haven't had time for a test run yet, but I reviewed the whole PR.
I think right now there's only one major thing here: The file storage scheme. Instead of UUID, I feel like we should use checksum instead and leverage filesystem features to allow natural de-duplication of data (by storing the data under the checksum and only creating symlinks from human-readable name).
Rest of the points are just very minor nitpicks on formatting.
The naming in the txvalidation
module is a bit backwards I think, but I'd like to make this code work first and then fix it separately afterwards - especially as I don't have strong idea for a good naming right now.
Overall I second the earlier verdict that in general this looks good 👍
@@ -116,4 +116,4 @@ CREATE TABLE proof_key ( | |||
CONSTRAINT fk_transaction | |||
FOREIGN KEY (tx) | |||
REFERENCES transaction (hash) ON DELETE CASCADE | |||
); | |||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe fix this EOF diff so the final commit doesn't have unnecessary entry to this file as there are no changes here. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
crates/node/src/main.rs
Outdated
self.add_asset(&tx_hash).await?; | ||
self.mark_asset_complete(&tx_hash).await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something behind this or is this just a leftover? If the asset manager is now deleted, shouldn't we also cleanup the DB + drop the references into it (like in this case)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove the asset manager because its role was taken by the validation process. For the database, I don't know. It depends on why asset are store in the db. If we need to keep trace of them (history) we should keep it. If we only need the file part, describe in the Tx data, they are not needed.
I think we should keep them for now until we define the Tx execution and storage domains.
crates/node/src/scheduler/mod.rs
Outdated
@@ -423,13 +422,84 @@ impl TaskManager for Scheduler { | |||
); | |||
} | |||
|
|||
//move and build output files of the Tx execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//move and build output files of the Tx execution | |
// Handle tx execution's result files so that they are available as an input for next task if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
crates/node/src/scheduler/mod.rs
Outdated
//format file_name to keep the current filename and the uuid. | ||
//put uuid at the end so that the uuid is use to save the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//format file_name to keep the current filename and the uuid. | |
//put uuid at the end so that the uuid is use to save the file. | |
// Format `file_name` to keep the current filename and the uuid. | |
// Put uuid at the end so that the uuid is used as the actual filename. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
crates/node/src/scheduler/mod.rs
Outdated
let new_file_name = format!("{}/{}", file.path, uuid); | ||
let dest = File::<ProofVerif>::new( | ||
file.path, | ||
self.http_download_host.clone(), | ||
file.checksum[..].into(), | ||
); | ||
(vm_file, dest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design wise, what's the reason for the UUID when the checksum is available?
I'd personally prefer the use of the checksum and then having a symlink to filename so that we can deduplicate the data automatically.
Possible schemes:
{checksum}/file.data
and then symlink to that from{filename}/{checksum}
files/{checksum}
andfiles/{filename}
, though this is vulnerable for file name collision in some cases.{checksum}/file.data
and symlink to that from{checksum}/{filename}
Requirements that the scheme must fulfill:
- Don't store same data twice.
- Don't allow collisions for same file name with different content.
- Make it as ubiquitous as possible. Checksum as a filename allows for content verification without the system.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the uuid because from my understanding all nodes execute the same program that generate the same file and all the generated files are downloaded on all nodes, so I need a way to have a unique name for the same file content (same hash). I only use the uuid and not a more complex file path because I suppose that files are not directly accessed by the user but only with a RPC request that map the content with the name. So all file data are stored in the db and the filename is only an uuid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from my understanding all nodes execute the same program that generate the same file and all the generated files are downloaded on all nodes, so I need a way to have a unique name for the same file content (same hash).
But why the file name has to be unique if the content is same? What use case requires that there are copies of same file for each node's execution, if the file content is bit-by-bit same?
So all file data are stored in the db and the filename is only an uuid.
File data or metadata? Please, put the file contents on disk in file and only use DB for possibly helpful metadata 🙂
This makes debugging, backups and tool development a bit easier 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was thinking that we need to keep all node generated data for verification / history reason. In my understanding, 2 same files generated on 2 distinct nodes wasn't the same because we need to be able to prove or give all the data of both node.
So I change to the checksum. I put the file in {txhash}/{checksum}/{filename}
crates/node/src/workflow/mod.rs
Outdated
@@ -245,13 +245,14 @@ impl WorkflowEngine { | |||
continue; | |||
} | |||
Payload::Verification { parent, .. } => { | |||
//if we return the parent Tx it's reexecuted an generate a duplicate key value violates unique constraint error in the db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//if we return the parent Tx it's reexecuted an generate a duplicate key value violates unique constraint error in the db | |
// XXX: If we return the parent tx, it gets re-executed and would generate | |
// a duplicate key violation (unique constraint error in the DB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
self.0.send(tx).await.expect("sink send"); | ||
tracing::debug!("sink submitted tx to channel"); | ||
Ok(()) | ||
//TODO change by impl From when module declaration between main and lib are solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//TODO change by impl From when module declaration between main and lib are solved. | |
// TODO: Change to `impl From` form when module declaration between main and lib is solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
crates/node/src/txvalidation/mod.rs
Outdated
pub struct P2pSender; | ||
pub struct TxResultSender; | ||
|
||
//use to send a tx to the event process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//use to send a tx to the event process. | |
// `TxEventSender` holds the received transaction of a specific state together with an optional callback interface. |
…work/gevulot into manage_tx_consistency
crates/node/src/main.rs
Outdated
// self.add_asset(&tx_hash).await?; | ||
self.mark_asset_complete(&tx_hash).await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was the conflict with an asset entry in DB. Now the asset creation is commented out so should all of this be just deleted?
// self.add_asset(&tx_hash).await?; | |
self.mark_asset_complete(&tx_hash).await |
.gitignore
Outdated
@@ -3,6 +3,7 @@ | |||
.DS_Store | |||
.idea | |||
*.pki | |||
**/.sqlx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Why is this here? AFAIK you need these files in order to be able to compile the code that uses sqlx
. These also need to be updated regularly when changing DB (though the build usually fails when there's need).
See /~https://github.com/launchbadge/sqlx/?tab=readme-ov-file#compile-time-verification and /~https://github.com/launchbadge/sqlx/blob/main/sqlx-cli/README.md#enable-building-in-offline-mode-with-query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was sqlx temporary file folder. I've one deleted by sqlx, so I thought that it's a working directory of sqlx and not part of the code. I remove.
// pub async fn add_asset(&self, tx_hash: &Hash) -> Result<()> { | ||
// sqlx::query!( | ||
// "INSERT INTO assets ( tx ) VALUES ( $1 ) RETURNING *", | ||
// tx_hash.to_string(), | ||
// ) | ||
// .fetch_one(&self.pool) | ||
// .await?; | ||
// Ok(()) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// pub async fn add_asset(&self, tx_hash: &Hash) -> Result<()> { | |
// sqlx::query!( | |
// "INSERT INTO assets ( tx ) VALUES ( $1 ) RETURNING *", | |
// tx_hash.to_string(), | |
// ) | |
// .fetch_one(&self.pool) | |
// .await?; | |
// Ok(()) | |
// } |
Two major changes in this commit: - Terminate the VM if it queries for a new task when it already has a task assigned (in running_tasks). - Fix the `task_queue` operation: `front()` -> `pop_front()`. - When pulling a task from the queue, it must be removed from it to prevent re-execution. This mistake happened to accidentally work because the VM was stopped after successful task execution (so it didn't ask for a new task again).
…work/gevulot into manage_tx_consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general.
If you could go through the comment style notions and check them in one sweep that would be great. You can just click "Add suggestion to batch" when going through them in PR and then commit in one go. No need to do changes separately in editor.
The Tx and file propagation works (I've tested with 3 nodes). The changes are:
It seems to me that Tx belong to 3 domains:
This PR implements the Validate domain.
To separate the Validate modules from the other domains, I define a stream or sink(channel) interface between the TxValidation domain and the other. I've updated the P2P, RPC and mempool accordingly.
I've removed File Storage and AssetManager that are useless now. I still save the asset in the Db, but it's not use.
From my test, the Tx and file are propagated among all connected nodes.
There's still an issue that belong to the execution domain. When 2 or more nodes are connected, a loop is created with the way Tx are created. The loop end with an error when recreated Tx are saved. The db generate the error
duplicate key value violates unique constraint "assets_pkey"
. The scenario I see is:The error doesn't block the process because the Db stop the loop and the data are fine on all node at the end, but I think it should be solved.
Beside what has been done, I think we should discuss the execution model and see how to manage: