Performance numbers for Triton VM proving

And with version 0.49.0, we can now do padded heights of 2^{24} on our machine (when the correct environment variables are set). This will allow for bigger transactions to be mined without going through the merge path.

$ RAYON_NUM_THREADS=90 TVM_LDE_TRACE="no_cache" triton-cli --profile prove --program spin.tasm --input 24
### Triton VM – Prove                             1417.02s    #Reps   Share  Category          395.8 GiB
├─trace execution                                   7.50s        1   0.53%  (gen  –  4.34%)    +3.1 GiB
├─Fiat-Shamir: claim                                7.25µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─derive additional parameters                      1.01ms       1   0.00%                       ±0 B  
├─main tables                                     455.55s        1  32.15%                    +94.9 GiB
│ ├─create                                         95.74s        1   6.76%  (gen  – 55.36%)   +49.9 GiB
│ ├─pad                                             3.92s        1   0.28%  (gen  –  2.27%)  -800.0 KiB
│ │ ├─pad original tables                           2.19s        1   0.15%                   -340.0 KiB
│ │ └─fill degree-lowering table                    1.73s        1   0.12%                   -460.0 KiB
│ ├─Merkle tree                                   290.11s        1  20.47%                    +11.1 GiB
│ │ ├─leafs                                       288.14s        1  20.33%                     +7.1 GiB
│ │ │ ├─LDE                                       205.14s        5  14.48%  (LDE  – 26.88%)  +420.5 MiB
│ │ │ └─hash rows                                  42.05s        5   2.97%  (hash – 49.46%)      ±0 B  
│ │ └─Merkle tree                                   1.72s        1   0.12%  (hash –  2.03%)    +9.1 GiB
│ ├─Fiat-Shamir                                    26.40µs       1   0.00%  (hash –  0.00%)      ±0 B  
│ └─extend                                         65.78s        1   4.64%  (gen  – 38.04%)   +33.3 GiB
│   ├─initialize master table                       1.08s        1   0.08%                    +32.1 GiB
│   ├─slice master table                            4.71µs       1   0.00%                       ±0 B  
│   ├─all tables                                   63.75s        1   4.50%                    +52.1 MiB
│   └─fill degree lowering table                  950.31ms       1   0.07%                     -1.2 MiB
├─aux tables                                      258.34s        1  18.23%                     +9.1 GiB
│ ├─Merkle tree                                   258.34s        1  18.23%                     +9.1 GiB
│ │ ├─leafs                                       256.35s        1  18.09%                     +4.1 GiB
│ │ │ ├─LDE                                       121.23s        1   8.56%  (LDE  – 15.88%)   +44.4 MiB
│ │ │ └─hash rows                                  35.09s        1   2.48%  (hash – 41.28%)      ±0 B  
│ │ └─Merkle tree                                   1.74s        1   0.12%  (hash –  2.05%)    +9.1 GiB
│ └─Fiat-Shamir                                   104.15µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─quotient calculation (just-in-time)             544.45s        1  38.42%                    +15.4 GiB
│ ├─zero-initialization                             2.07s        1   0.15%                    +83.4 GiB
│ ├─fetch trace randomizers                       389.83µs       1   0.00%                       ±0 B  
│ ├─poly interpolate                               22.80s        1   1.61%  (LDE  –  2.99%)      ±0 B  
│ ├─calculate quotients                           466.82s        1  32.94%                    +14.2 MiB
│ │ ├─poly evaluate                               183.02s        8  12.92%  (LDE  – 23.98%)  +102.9 MiB
│ │ ├─trace randomizers                           209.07s        8  14.75%  (LDE  – 27.39%)    -2.1 MiB
│ │ └─AIR evaluation                               74.74s        8   5.27%  (AIR  –100.00%)   -98.1 MiB
│ │   ├─zerofier inverse                           13.80s        8   0.97%                   +432.7 MiB
│ │   └─evaluate AIR, compute quotient codeword    60.35s        8   4.26%                   +363.6 MiB
│ ├─segmentify                                     26.98s        1   1.90%                    +12.4 GiB
│ └─restore original trace                         21.95s        1   1.55%  (LDE  –  2.88%)      ±0 B  
├─hash rows of quotient segments                    2.67s        1   0.19%  (hash –  3.14%)    +4.1 GiB
├─Merkle tree                                       1.74s        1   0.12%  (hash –  2.04%)    +9.1 GiB
├─out-of-domain rows                               43.33s        1   3.06%                   +111.1 MiB
├─Fiat-Shamir                                      68.37µs       1   0.00%  (hash –  0.00%)      ±0 B  
├─linear combination                               38.32s        1   2.70%                     +4.2 GiB
│ ├─main                                            3.34s        1   0.24%  (CC   – 12.38%)  +303.1 MiB
│ ├─aux                                             3.22s        1   0.23%  (CC   – 11.95%)  +363.9 MiB
│ └─quotient                                       17.62s        1   1.24%  (CC   – 65.34%)  +768.3 MiB
├─DEEP                                             11.08s        1   0.78%                     +8.1 GiB
│ ├─main&aux curr row                               3.61s        1   0.26%                     +2.1 GiB
│ ├─main&aux next row                               3.66s        1   0.26%                     +2.1 GiB
│ └─segmented quotient                              3.81s        1   0.27%                     +3.1 GiB
├─combined DEEP polynomial                          2.78s        1   0.20%                     -5.1 GiB
│ └─sum                                             2.78s        1   0.20%  (CC   – 10.32%)    -5.1 GiB
├─FRI                                              11.38s        1   0.80%                     -4.8 MiB
└─open trace leafs                                 32.88s        1   2.32%                     +5.9 GiB
  └─recompute rows                                 32.86s        2   2.32%                     +3.8 GiB

### Categories
LDE    763.21s  53.86%
gen    172.95s  12.20%
hash    85.01s   6.00%
AIR     74.74s   5.27%
CC      26.96s   1.90%

Clock frequency is 3974 Hz (5632027 clock cycles / (1417016 ms / 1 iterations))
Optimal clock frequency is 11839 Hz (16777216 padded height / (1417016 ms / 1 iterations))
FRI domain length is 2^27

As with all other benchmarks, this was performed on a Threadripper 7995wx with 768GB RAM.