Manage tx validation and propagation with its files #85

musitdev · 2024-02-13T10:30:18Z

The Tx and file propagation works (I've tested with 3 nodes). The changes are:

define a state transition process for Tx validation in the txvalidation module. Use type state programming to define the different state of the Tx during its validation.
Change file management to differentiate the files used cases.
Change Transaction do define its state. Change for every Tx. Tx structure should be differentiated between its domain, see next.

It seems to me that Tx belong to 3 domains:

Validate: incoming data domain to everything needed to execute a Tx is ok.
Execute: Execute the Tx. The current workflow, scheduler, vm, ... implementation.
Storage: answer to history (RPC get_tx) and bootstrap needs.

This PR implements the Validate domain.
To separate the Validate modules from the other domains, I define a stream or sink(channel) interface between the TxValidation domain and the other. I've updated the P2P, RPC and mempool accordingly.

I've removed File Storage and AssetManager that are useless now. I still save the asset in the Db, but it's not use.

From my test, the Tx and file are propagated among all connected nodes.

There's still an issue that belong to the execution domain. When 2 or more nodes are connected, a loop is created with the way Tx are created. The loop end with an error when recreated Tx are saved. The db generate the error duplicate key value violates unique constraint "assets_pkey". The scenario I see is:

Node 1 receive a Run Tx. It executes the proof task.
Node 1 propagate the Run Tx to node 2
Node 1 create a Proof Tx after the Run Tx execution.
Node 1 propagate the Proof Tx to node 2 and save it
Node 2 receive the Run Tx and execute it.
Node 2 create a Proof Tx after the Run Tx execution with the same data has to Proof tx of the node1, same execution.
Node 2 propagate the Proof Tx to node 1
Node 1 receive the node2 Proof Tx that has the same data as the first one created. Only the sign is different. So when it's saved, the db return the error and stop tx execution and the loop.

The error doesn't block the process because the Db stop the loop and the data are fine on all node at the end, but I think it should be solved.

Beside what has been done, I think we should discuss the execution model and see how to manage:

the execution of several Tasks for one Tx and output management
Tx creation by the nodes: which a why
the dependence between task or Tx
the resource management (how to share one GPU between several images that use it).
Tx end of execution management.
Tx execution resource release: Zomby task, file how long to keep them, ....

tuommaki

I went through most of the PR, but there are some random spots I had to skip because I ran out of time. Most notable thing I didn't have time to carefully go through was the handling of transactions when they originate from different sources. The type system denotes these differences, but the event loop in txvalidation only seemed to handle P2P transactions. I'll get back to this probably tomorrow.

The general direction here is very good. I like this a lot.

Couple notes on:

Minor nitpicks on comment style.
Some From / Into trait implementation could probably make the code cleaner?
The type modeling for tx state domain & similarly for file source as well.

crates/node/migrations/20231009111925_tasks-table.sql

crates/node/migrations/20231128120351_transactions-table.sql

crates/node/src/cli.rs

crates/node/src/main.rs

crates/node/src/types/file.rs

crates/node/src/txvalidation/event.rs

crates/node/src/txvalidation/mod.rs

crates/node/src/storage/database/postgres.rs

…tion

tuommaki

I finally went through everything here. I haven't had time for a test run yet, but I reviewed the whole PR.

I think right now there's only one major thing here: The file storage scheme. Instead of UUID, I feel like we should use checksum instead and leverage filesystem features to allow natural de-duplication of data (by storing the data under the checksum and only creating symlinks from human-readable name).

Rest of the points are just very minor nitpicks on formatting.

The naming in the txvalidation module is a bit backwards I think, but I'd like to make this code work first and then fix it separately afterwards - especially as I don't have strong idea for a good naming right now.

Overall I second the earlier verdict that in general this looks good 👍

tuommaki · 2024-02-18T14:53:43Z

crates/node/migrations/20231128120351_transactions-table.sql

@@ -116,4 +116,4 @@ CREATE TABLE proof_key (
    CONSTRAINT fk_transaction
        FOREIGN KEY (tx)
            REFERENCES transaction (hash) ON DELETE CASCADE
-);
+);


Maybe fix this EOF diff so the final commit doesn't have unnecessary entry to this file as there are no changes here. 🙂

tuommaki · 2024-02-18T14:56:38Z

crates/node/src/main.rs

+        self.add_asset(&tx_hash).await?;
+        self.mark_asset_complete(&tx_hash).await


Is there something behind this or is this just a leftover? If the asset manager is now deleted, shouldn't we also cleanup the DB + drop the references into it (like in this case)?

I remove the asset manager because its role was taken by the validation process. For the database, I don't know. It depends on why asset are store in the db. If we need to keep trace of them (history) we should keep it. If we only need the file part, describe in the Tx data, they are not needed.
I think we should keep them for now until we define the Tx execution and storage domains.

tuommaki · 2024-02-18T15:16:34Z

crates/node/src/scheduler/mod.rs

@@ -423,13 +422,84 @@ impl TaskManager for Scheduler {
                );
            }

+            //move and build output files of the Tx execution


Suggested change

//move and build output files of the Tx execution

// Handle tx execution's result files so that they are available as an input for next task if needed.

tuommaki · 2024-02-18T15:18:45Z

crates/node/src/scheduler/mod.rs

+                    //format file_name to keep the current filename and the uuid.
+                    //put uuid at the end so that the uuid is use to save the file.


Suggested change

//format file_name to keep the current filename and the uuid.

//put uuid at the end so that the uuid is use to save the file.

// Format `file_name` to keep the current filename and the uuid.

// Put uuid at the end so that the uuid is used as the actual filename.

tuommaki · 2024-02-18T15:32:27Z

crates/node/src/scheduler/mod.rs

+                    let new_file_name = format!("{}/{}", file.path, uuid);
+                    let dest = File::<ProofVerif>::new(
+                        file.path,
+                        self.http_download_host.clone(),
+                        file.checksum[..].into(),
+                    );
+                    (vm_file, dest)


Design wise, what's the reason for the UUID when the checksum is available?

I'd personally prefer the use of the checksum and then having a symlink to filename so that we can deduplicate the data automatically.

Possible schemes:

{checksum}/file.data and then symlink to that from {filename}/{checksum}

files/{checksum} and files/{filename}, though this is vulnerable for file name collision in some cases.

{checksum}/file.data and symlink to that from {checksum}/{filename}

Requirements that the scheme must fulfill:

Don't store same data twice.

Don't allow collisions for same file name with different content.

Make it as ubiquitous as possible. Checksum as a filename allows for content verification without the system.

WDYT?

I use the uuid because from my understanding all nodes execute the same program that generate the same file and all the generated files are downloaded on all nodes, so I need a way to have a unique name for the same file content (same hash). I only use the uuid and not a more complex file path because I suppose that files are not directly accessed by the user but only with a RPC request that map the content with the name. So all file data are stored in the db and the filename is only an uuid.

from my understanding all nodes execute the same program that generate the same file and all the generated files are downloaded on all nodes, so I need a way to have a unique name for the same file content (same hash).

But why the file name has to be unique if the content is same? What use case requires that there are copies of same file for each node's execution, if the file content is bit-by-bit same?

So all file data are stored in the db and the filename is only an uuid.

File data or metadata? Please, put the file contents on disk in file and only use DB for possibly helpful metadata 🙂

This makes debugging, backups and tool development a bit easier 🙂

Ok, I was thinking that we need to keep all node generated data for verification / history reason. In my understanding, 2 same files generated on 2 distinct nodes wasn't the same because we need to be able to prove or give all the data of both node.
So I change to the checksum. I put the file in {txhash}/{checksum}/{filename}

tuommaki · 2024-02-18T19:01:45Z

crates/node/src/workflow/mod.rs

@@ -245,13 +245,14 @@ impl WorkflowEngine {
                    continue;
                }
                Payload::Verification { parent, .. } => {
+                    //if we return the parent Tx it's reexecuted an generate a duplicate key value violates unique constraint error in the db


Suggested change

//if we return the parent Tx it's reexecuted an generate a duplicate key value violates unique constraint error in the db

// XXX: If we return the parent tx, it gets re-executed and would generate

// a duplicate key violation (unique constraint error in the DB).

crates/node/src/networking/p2p/pea2pea.rs

tuommaki · 2024-02-18T19:05:10Z

crates/node/src/networking/p2p/pea2pea.rs

-            self.0.send(tx).await.expect("sink send");
-            tracing::debug!("sink submitted tx to channel");
-            Ok(())
+    //TODO change by impl From when module declaration between main and lib are solved.


Suggested change

//TODO change by impl From when module declaration between main and lib are solved.

// TODO: Change to `impl From` form when module declaration between main and lib is solved.

tuommaki · 2024-02-18T19:06:16Z

crates/node/src/txvalidation/mod.rs

+pub struct P2pSender;
+pub struct TxResultSender;
+
+//use to send a tx to the event process.


Suggested change

//use to send a tx to the event process.

// `TxEventSender` holds the received transaction of a specific state together with an optional callback interface.

crates/node/src/txvalidation/mod.rs

…work/gevulot into manage_tx_consistency

tuommaki · 2024-02-20T09:57:48Z

crates/node/src/main.rs

+        //        self.add_asset(&tx_hash).await?;
+        self.mark_asset_complete(&tx_hash).await


There was the conflict with an asset entry in DB. Now the asset creation is commented out so should all of this be just deleted?

Suggested change

// self.add_asset(&tx_hash).await?;

self.mark_asset_complete(&tx_hash).await

tuommaki · 2024-02-21T09:36:50Z

.gitignore

@@ -3,6 +3,7 @@
 .DS_Store
 .idea
 *.pki
+**/.sqlx


Hmm... Why is this here? AFAIK you need these files in order to be able to compile the code that uses sqlx. These also need to be updated regularly when changing DB (though the build usually fails when there's need).

See /~https://github.com/launchbadge/sqlx/?tab=readme-ov-file#compile-time-verification and /~https://github.com/launchbadge/sqlx/blob/main/sqlx-cli/README.md#enable-building-in-offline-mode-with-query

I thought it was sqlx temporary file folder. I've one deleted by sqlx, so I thought that it's a working directory of sqlx and not part of the code. I remove.

tuommaki · 2024-02-21T09:42:33Z

crates/node/src/storage/database/postgres.rs

+    // pub async fn add_asset(&self, tx_hash: &Hash) -> Result<()> {
+    //     sqlx::query!(
+    //         "INSERT INTO assets ( tx ) VALUES ( $1 ) RETURNING *",
+    //         tx_hash.to_string(),
+    //     )
+    //     .fetch_one(&self.pool)
+    //     .await?;
+    //     Ok(())
+    // }


Suggested change

// pub async fn add_asset(&self, tx_hash: &Hash) -> Result<()> {

// sqlx::query!(

// "INSERT INTO assets ( tx ) VALUES ( $1 ) RETURNING *",

// tx_hash.to_string(),

// )

// .fetch_one(&self.pool)

// .await?;

// Ok(())

// }

Two major changes in this commit: - Terminate the VM if it queries for a new task when it already has a task assigned (in running_tasks). - Fix the `task_queue` operation: `front()` -> `pop_front()`. - When pulling a task from the queue, it must be removed from it to prevent re-execution. This mistake happened to accidentally work because the VM was stopped after successful task execution (so it didn't ask for a new task again).

…work/gevulot into manage_tx_consistency

tuommaki

LGTM in general.

If you could go through the comment style notions and check them in one sweep that would be great. You can just click "Add suggestion to batch" when going through them in PR and then commit in one go. No need to do changes separately in editor.

musitdev requested a review from tuommaki February 13, 2024 10:30

tuommaki reviewed Feb 14, 2024

View reviewed changes

musitdev added 27 commits February 16, 2024 11:40

first implementation

ece7b89

correct file management. Work locally. Before remote test

00831fd

correct p2p peer list init

0056630

add some logs to follow tx

53d5423

correct rebase error

30f9e48

add checksum to VM file and test

557bc58

add type state to file and Tx

992de81

correct some issues in file path

4fc59b2

add some logs

072235e

update proof generated file path

6428170

remove comments

3dd91cb

add comment, add timeout and error management to http download connec…

2a714a1

…tion

correct http url definition

fdf618f

add timeout to http send

ff9e291

add better download error login and log file checksum

c58ad9d

correct build

b19c14f

desactivate checksum verification for downloaded files

423d573

add some logs for checksum

14bbd3c

change hasher for VM file and activate checksum verification

4557493

correct host detection for download

48aec37

disable proof and verifcation workflow and remove some logs

32c5c0e

rebase from master and correct download file issue

7bce8e6

rebase from master and correct download file issue

f8be0df

pass clippy and tests

3b59b32

correct clippy

803cfe7

do some cleaning

a9a4d87

correct PR remarks

3a1f086

musitdev force-pushed the manage_tx_consistency branch from 5e76ada to 3a1f086 Compare February 16, 2024 10:40

tuommaki reviewed Feb 18, 2024

View reviewed changes

tuommaki assigned musitdev Feb 18, 2024

tuommaki and others added 13 commits February 19, 2024 09:31

Merge branch 'main' into manage_tx_consistency

2a76a8e

add newline to be like the original file

f422f14

change generated file name from uuid to checksum

a0216cc

Avoid to move download existing file

02a1347

correct clippy

d45fb19

remove asset manager call to test

d68428b

correct some path issue

d44e527

add verify tx execution in the workflow.

ff72aed

add mark deploy tx as executed

ee8a376

remove the dbg! that panic

6d563e0

Merge branch 'main' into manage_tx_consistency

e8cce96

change verif tx workflow management and change logs

29a99d8

Merge branch 'manage_tx_consistency' of /~https://github.com/gevulotnet…

a168896

…work/gevulot into manage_tx_consistency

tuommaki reviewed Feb 21, 2024

View reviewed changes

musitdev and others added 8 commits February 21, 2024 12:26

add some logs

8766908

Merge branch 'main' into manage_tx_consistency

9952880

revert .gitignore change

e4481f6

Merge branch 'manage_tx_consistency' of /~https://github.com/gevulotnet…

d9b2683

…work/gevulot into manage_tx_consistency

correct mutex lock issue

80998df

add some logs

3aaab1f

remove asset management and some logs

37d7eb0

tuommaki approved these changes Feb 22, 2024

View reviewed changes

musitdev merged commit 0149a30 into main Feb 22, 2024
4 checks passed

musitdev deleted the manage_tx_consistency branch February 22, 2024 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage tx validation and propagation with its files #85

Manage tx validation and propagation with its files #85

musitdev commented Feb 13, 2024 •

edited

Loading

tuommaki left a comment

tuommaki left a comment

tuommaki Feb 18, 2024

musitdev Feb 19, 2024

tuommaki Feb 18, 2024

musitdev Feb 19, 2024

tuommaki Feb 18, 2024

musitdev Feb 19, 2024

tuommaki Feb 18, 2024

musitdev Feb 19, 2024

tuommaki Feb 18, 2024

musitdev Feb 18, 2024

tuommaki Feb 19, 2024

musitdev Feb 19, 2024

tuommaki Feb 18, 2024

musitdev Feb 19, 2024

tuommaki Feb 18, 2024

musitdev Feb 19, 2024

tuommaki Feb 18, 2024

tuommaki Feb 20, 2024

tuommaki Feb 21, 2024

musitdev Feb 21, 2024

tuommaki Feb 21, 2024

tuommaki left a comment

		self.add_asset(&tx_hash).await?;
		self.mark_asset_complete(&tx_hash).await

	//move and build output files of the Tx execution
	// Handle tx execution's result files so that they are available as an input for next task if needed.

		//format file_name to keep the current filename and the uuid.
		//put uuid at the end so that the uuid is use to save the file.

	//if we return the parent Tx it's reexecuted an generate a duplicate key value violates unique constraint error in the db
	// XXX: If we return the parent tx, it gets re-executed and would generate
	// a duplicate key violation (unique constraint error in the DB).

	//TODO change by impl From when module declaration between main and lib are solved.
	// TODO: Change to `impl From` form when module declaration between main and lib is solved.

	//use to send a tx to the event process.
	// `TxEventSender` holds the received transaction of a specific state together with an optional callback interface.

		// self.add_asset(&tx_hash).await?;
		self.mark_asset_complete(&tx_hash).await

Manage tx validation and propagation with its files #85

Manage tx validation and propagation with its files #85

Conversation

musitdev commented Feb 13, 2024 • edited Loading

tuommaki left a comment

Choose a reason for hiding this comment

tuommaki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuommaki left a comment

Choose a reason for hiding this comment

musitdev commented Feb 13, 2024 •

edited

Loading