Performance numbers for Triton VM proving

Transaction throughput on the Neptune Cash blockchain is restricted by computational power. The more proving power that comes online, the more transactions each block can include.

For this reason, several power users run neptune-core on powerful machines as this gives them income in the form of composer fees and transaction fees. We own a couple of machines with powerful processors and we just ran multiple benchmarks on Triton VM version 0.48.0, the version used by neptune-core.

For the benefit of our power users, we present these numbers below.

Triton VM performance

Threadripper 7995wx, RAM-hungry path

Log_2(padded height) 16 17 18 19 20 21 22 23
Proving time (seconds) 2.65 5.36 11.06 22.80 46.52 99.17 203.43 417.75
Max RAM consumption, approximated (GiB) 4.9 8.6 16.1 29.7 58.4 107.8 199.9 399.5

If you want to participate in competitive composition and transaction upgrading, you should probably be able to hit these numbers, or be close to them.

From these numbers, we can estimate the time it takes to create a block proposal as:

  • one proof of padded height 2^{21} (raise from ProofCollection to SingleProof)
  • one 2^{20} proof (a merge of two SingleProof)
  • one 2^{19} proof (BlockProgram).

Summing the time for each proof, gives a total of 168.47 seconds, a little less than three minutes. This assumes your node’s mempool is non-empty – that there is a synced SingleProof transaction ready to be merged into your coinbase transaction.

To benchmark proving speeds on your setup, you can follow these instructions that @jfs just added.

Happy to report that the newly released Triton VM v0.49.0 reduces proving time by 5-10 %. We can now build a proof with a padded height of 2^{21} in 93 seconds.

Profile:

$ triton-cli --profile prove --program spin.tasm --input 21
### Triton VM – Prove                          93.30s    #Reps   Share  Category          108.2 GiB
├─trace execution                             900.76ms       1   0.97%  (gen  –  5.49%)  +511.8 MiB
├─Fiat-Shamir: claim                            5.66µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─derive additional parameters                132.85µs       1   0.00%                       ±0 B  
├─main tables                                  45.09s        1  48.33%                    +68.4 GiB
│ ├─create                                      7.21s        1   7.73%  (gen  – 43.91%)    +6.3 GiB
│ ├─pad                                       402.28ms       1   0.43%  (gen  –  2.45%)    +5.9 MiB
│ │ ├─pad original tables                     264.67ms       1   0.28%                     +5.9 MiB
│ │ └─fill degree-lowering table              137.46ms       1   0.15%                       ±0 B  
│ ├─LDE                                        23.18s        1  24.84%  (LDE  – 55.00%)   +62.2 GiB
│ │ ├─polynomial zero-initialization            2.23µs       1   0.00%                       ±0 B  
│ │ ├─interpolation                             1.90s        1   2.04%                    +14.5 GiB
│ │ ├─resize                                    1.25s        1   1.34%                    +47.3 GiB
│ │ ├─evaluation                               20.03s        1  21.47%                   +402.5 MiB
│ │ └─memoize                                 931.00ns       1   0.00%                       ±0 B  
│ ├─Merkle tree                                 6.06s        1   6.50%                     +1.7 GiB
│ │ ├─leafs                                     5.82s        1   6.24%                   +496.5 MiB
│ │ │ └─hash rows                               5.82s        1   6.24%  (hash – 51.26%)  +496.5 MiB
│ │ └─Merkle tree                             208.04ms       1   0.22%  (hash –  1.83%)    +1.2 GiB
│ ├─Fiat-Shamir                                24.67µs       1   0.00%  (hash –  0.00%)      ±0 B  
│ └─extend                                      7.91s        1   8.48%  (gen  – 48.16%)    +4.1 GiB
│   ├─initialize master table                 148.73ms       1   0.16%                     +4.1 GiB
│   ├─slice master table                        4.91µs       1   0.00%                       ±0 B  
│   ├─all tables                                7.68s        1   8.24%                     +1.5 MiB
│   └─fill degree lowering table               75.26ms       1   0.08%                       ±0 B  
├─aux tables                                   19.26s        1  20.64%                    +34.3 GiB
│ ├─LDE                                        14.20s        1  15.22%  (LDE  – 33.69%)   +37.2 GiB
│ │ ├─polynomial zero-initialization            1.66µs       1   0.00%                       ±0 B  
│ │ ├─interpolation                             1.83s        1   1.96%                     +4.2 GiB
│ │ ├─resize                                  866.18ms       1   0.93%                    +32.1 GiB
│ │ ├─evaluation                               11.50s        1  12.33%                   +125.8 MiB
│ │ └─memoize                                 991.00ns       1   0.00%                       ±0 B  
│ ├─Merkle tree                                 4.85s        1   5.20%                     +1.2 GiB
│ │ ├─leafs                                     4.62s        1   4.96%                   +547.5 MiB
│ │ │ └─hash rows                               4.62s        1   4.96%  (hash – 40.71%)  +547.5 MiB
│ │ └─Merkle tree                             197.17ms       1   0.21%  (hash –  1.74%)    +1.2 GiB
│ └─Fiat-Shamir                               124.27µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─quotient calculation (cached)                 6.00s        1   6.43%  (CC   – 68.85%)  +379.7 MiB
│ ├─zerofier inverse                            1.66s        1   1.78%                   +510.1 MiB
│ └─evaluate AIR, compute quotient codeword     4.30s        1   4.61%                   +378.0 MiB
├─quotient LDE                                  4.77s        1   5.11%  (LDE  – 11.32%)    +1.4 GiB
├─hash rows of quotient segments              292.75ms       1   0.31%  (hash –  2.58%)  +633.0 MiB
├─Merkle tree                                 212.98ms       1   0.23%  (hash –  1.88%)    +1.3 GiB
├─out-of-domain rows                            4.82s        1   5.17%                   +151.1 MiB
├─Fiat-Shamir                                  66.43µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─linear combination                            3.76s        1   4.03%                   +603.4 MiB
│ ├─main                                      379.16ms       1   0.41%  (CC   –  4.35%)   +18.6 MiB
│ ├─aux                                       332.63ms       1   0.36%  (CC   –  3.82%)    +8.5 MiB
│ └─quotient                                    1.62s        1   1.74%  (CC   – 18.58%)  +239.1 MiB
├─DEEP                                          1.13s        1   1.21%                     +1.4 GiB
│ ├─main&aux curr row                         365.74ms       1   0.39%                   +294.8 MiB
│ ├─main&aux next row                         379.78ms       1   0.41%                   +442.2 MiB
│ └─segmented quotient                        386.25ms       1   0.41%                   +330.4 MiB
├─combined DEEP polynomial                    382.87ms       1   0.41%                   -721.6 MiB
│ └─sum                                       382.82ms       1   0.41%  (CC   –  4.39%)  -721.6 MiB
├─FRI                                           1.65s        1   1.77%                    +32.1 MiB
└─open trace leafs                            846.33µs       1   0.00%                       ±0 B  

### Categories
LDE     42.15s  45.18%
gen     16.42s  17.60%
hash    11.36s  12.17%
CC       8.71s   9.34%

Clock frequency is 7545 Hz (704027 clock cycles / (93301 ms / 1 iterations))
Optimal clock frequency is 22477 Hz (2097152 padded height / (93301 ms / 1 iterations))
FRI domain length is 2^24

And with version 0.49.0, we can now do padded heights of 2^{24} on our machine (when the correct environment variables are set). This will allow for bigger transactions to be mined without going through the merge path.

$ RAYON_NUM_THREADS=90 TVM_LDE_TRACE="no_cache" triton-cli --profile prove --program spin.tasm --input 24
### Triton VM – Prove                             1417.02s    #Reps   Share  Category          395.8 GiB
├─trace execution                                   7.50s        1   0.53%  (gen  –  4.34%)    +3.1 GiB
├─Fiat-Shamir: claim                                7.25µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─derive additional parameters                      1.01ms       1   0.00%                       ±0 B  
├─main tables                                     455.55s        1  32.15%                    +94.9 GiB
│ ├─create                                         95.74s        1   6.76%  (gen  – 55.36%)   +49.9 GiB
│ ├─pad                                             3.92s        1   0.28%  (gen  –  2.27%)  -800.0 KiB
│ │ ├─pad original tables                           2.19s        1   0.15%                   -340.0 KiB
│ │ └─fill degree-lowering table                    1.73s        1   0.12%                   -460.0 KiB
│ ├─Merkle tree                                   290.11s        1  20.47%                    +11.1 GiB
│ │ ├─leafs                                       288.14s        1  20.33%                     +7.1 GiB
│ │ │ ├─LDE                                       205.14s        5  14.48%  (LDE  – 26.88%)  +420.5 MiB
│ │ │ └─hash rows                                  42.05s        5   2.97%  (hash – 49.46%)      ±0 B  
│ │ └─Merkle tree                                   1.72s        1   0.12%  (hash –  2.03%)    +9.1 GiB
│ ├─Fiat-Shamir                                    26.40µs       1   0.00%  (hash –  0.00%)      ±0 B  
│ └─extend                                         65.78s        1   4.64%  (gen  – 38.04%)   +33.3 GiB
│   ├─initialize master table                       1.08s        1   0.08%                    +32.1 GiB
│   ├─slice master table                            4.71µs       1   0.00%                       ±0 B  
│   ├─all tables                                   63.75s        1   4.50%                    +52.1 MiB
│   └─fill degree lowering table                  950.31ms       1   0.07%                     -1.2 MiB
├─aux tables                                      258.34s        1  18.23%                     +9.1 GiB
│ ├─Merkle tree                                   258.34s        1  18.23%                     +9.1 GiB
│ │ ├─leafs                                       256.35s        1  18.09%                     +4.1 GiB
│ │ │ ├─LDE                                       121.23s        1   8.56%  (LDE  – 15.88%)   +44.4 MiB
│ │ │ └─hash rows                                  35.09s        1   2.48%  (hash – 41.28%)      ±0 B  
│ │ └─Merkle tree                                   1.74s        1   0.12%  (hash –  2.05%)    +9.1 GiB
│ └─Fiat-Shamir                                   104.15µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─quotient calculation (just-in-time)             544.45s        1  38.42%                    +15.4 GiB
│ ├─zero-initialization                             2.07s        1   0.15%                    +83.4 GiB
│ ├─fetch trace randomizers                       389.83µs       1   0.00%                       ±0 B  
│ ├─poly interpolate                               22.80s        1   1.61%  (LDE  –  2.99%)      ±0 B  
│ ├─calculate quotients                           466.82s        1  32.94%                    +14.2 MiB
│ │ ├─poly evaluate                               183.02s        8  12.92%  (LDE  – 23.98%)  +102.9 MiB
│ │ ├─trace randomizers                           209.07s        8  14.75%  (LDE  – 27.39%)    -2.1 MiB
│ │ └─AIR evaluation                               74.74s        8   5.27%  (AIR  –100.00%)   -98.1 MiB
│ │   ├─zerofier inverse                           13.80s        8   0.97%                   +432.7 MiB
│ │   └─evaluate AIR, compute quotient codeword    60.35s        8   4.26%                   +363.6 MiB
│ ├─segmentify                                     26.98s        1   1.90%                    +12.4 GiB
│ └─restore original trace                         21.95s        1   1.55%  (LDE  –  2.88%)      ±0 B  
├─hash rows of quotient segments                    2.67s        1   0.19%  (hash –  3.14%)    +4.1 GiB
├─Merkle tree                                       1.74s        1   0.12%  (hash –  2.04%)    +9.1 GiB
├─out-of-domain rows                               43.33s        1   3.06%                   +111.1 MiB
├─Fiat-Shamir                                      68.37µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─linear combination                               38.32s        1   2.70%                     +4.2 GiB
│ ├─main                                            3.34s        1   0.24%  (CC   – 12.38%)  +303.1 MiB
│ ├─aux                                             3.22s        1   0.23%  (CC   – 11.95%)  +363.9 MiB
│ └─quotient                                       17.62s        1   1.24%  (CC   – 65.34%)  +768.3 MiB
├─DEEP                                             11.08s        1   0.78%                     +8.1 GiB
│ ├─main&aux curr row                               3.61s        1   0.26%                     +2.1 GiB
│ ├─main&aux next row                               3.66s        1   0.26%                     +2.1 GiB
│ └─segmented quotient                              3.81s        1   0.27%                     +3.1 GiB
├─combined DEEP polynomial                          2.78s        1   0.20%                     -5.1 GiB
│ └─sum                                             2.78s        1   0.20%  (CC   – 10.32%)    -5.1 GiB
├─FRI                                              11.38s        1   0.80%                     -4.8 MiB
└─open trace leafs                                 32.88s        1   2.32%                     +5.9 GiB
  └─recompute rows                                 32.86s        2   2.32%                     +3.8 GiB

### Categories
LDE    763.21s  53.86%
gen    172.95s  12.20%
hash    85.01s   6.00%
AIR     74.74s   5.27%
CC      26.96s   1.90%

Clock frequency is 3974 Hz (5632027 clock cycles / (1417016 ms / 1 iterations))
Optimal clock frequency is 11839 Hz (16777216 padded height / (1417016 ms / 1 iterations))
FRI domain length is 2^27

As with all other benchmarks, this was performed on a Threadripper 7995wx with 768GB RAM.