Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mempool Collector Stats + Discussions #1

Closed
metachris opened this issue Aug 4, 2023 · 5 comments
Closed

Mempool Collector Stats + Discussions #1

metachris opened this issue Aug 4, 2023 · 5 comments

Comments

@metachris
Copy link
Contributor

metachris commented Aug 4, 2023

Some early stats about transactions collected and stored with the mempool-collector:

Hourly stats:

  • 80k - 120k transactions
  • CSV file
    • <timestampMillis>,<hash>,<rawTx>
    • 150MB uncompressed
    • 54MB gzipped

Extrapolated to a day:

  • Up to 3M transactions
  • Up to 1.5 GB compressed CSV raw data (per collector instance)

Note 2023-08-07: The stats below are outdated as they are based on the test storage method of one JSON file per transaction. Storage has now been updated to write into one CSV file per hour, which has very different compression characteristics.

Data collection

JSON file example: /~https://github.com/flashbots/mempool-archiver/blob/main/docs/example-tx-summary.json

Per hour:

  • 70k - 100k transactions
  • 150 - 500MB of JSON files written

Extrapolated to a day:

  • Up to 2.5M transactions
  • Up to 12 GB disk usage

Data size & compression

Looking at one particular hour specifically: 2023-08-04 UTC between [01:00, 02:00[:

  • Unique tx: 78,757 (find ./ -type f | wc -l)
  • Disk usage: 373 MB (du --si -s)
  • Apparent size: 134 MB (du --si -s --apparent-size)
  • Average file size: 1.584 KB (ls -l | gawk '{sum += $5; n++;} END {print n" "sum" "sum/n;}')

gzip individual JSON files:

  • Typical file size reduction: 50%
  • But: since the files are very small it doesn't actually decrease the disk usage:
$ du --si -s *
373M    h01
350M    h01_gz

$ du --si -s * --apparent-size
134M    h01
76M     h01_gz

more about "apparent-size": https://man7.org/linux/man-pages/man1/du.1.html

   --apparent-size
              print apparent sizes rather than device usage; although
              the apparent size is usually smaller, it may be larger due
              to holes in ('sparse') files, internal fragmentation,
              indirect blocks, and the like

zipping an hourly folder:

  • 80% reduction in disk space needed (373 MB -> 77 MB)
$ zip -r h01 h01

$ ls -alh h01.zip
-rw-r--r-- 1 ubuntu ubuntu 74M Aug  4 10:30 h01.zip

$ du --si h01.zip
77M     h01.zip

$ du --si h01.zip --apparent-size
77M     h01.zip
@0x416e746f6e
Copy link

indeed, disk usage reports the actually used disk space by the file. since the filesystem stored data in blocks, then a non-sparse file will occupy diskBlockSize * roundUp(fileSize / diskBlockSize) >= fileSize. therefore zipping many small (smaller than diskBlockSize) files is not going to change much with respect to actual disk usage.

@metachris
Copy link
Contributor Author

metachris commented Aug 6, 2023

Some stats after creating a summary file of 1,423,508 transactions in JSON and Parquet format:

Format Size Signature Compression
CSV 314 MB No -
Parquet 118 MB No Snappy
CSV 529 MB Yes -
Parquet 248 MB Yes Snappy

@flashbots flashbots deleted a comment Aug 6, 2023
@metachris
Copy link
Contributor Author

Perhaps the individual tx JSON file should also not contain the signature, to save 20-40% of the storage space, and because the signature is part of the rawTx anyway 🤔

@ra--
Copy link

ra-- commented Aug 6, 2023

everything is part of the rawtx, apart from timestamp and chainId

@metachris
Copy link
Contributor Author

metachris commented Aug 6, 2023

It's still convenient for the summarizer service not needing to parse every single rawTx and extracting the fields, although that doesn't seem too much to ask either.

I'm still undecided whether it's preferable to have the collector store some fields, or only store rawTx + timestamp (leaning towards only rawTx+timestamp, and batched+gzipped)

@metachris metachris changed the title Mempool Collector Stats Mempool Collector Stats + Discussions Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@ra-- @metachris @0x416e746f6e and others