holindauer
2024-09-03
👍 2
👎
[op]
Hi, quick update:
As of yesterday, I was able to get .low_degree_extend_all_columns() for the MasterExtTable running on the GPU.
Initially it is much faster than running on the CPU. So far, I have done a few preliminary benchmarks to get a sense for this speedup using the factorial program from the triton-vm examples dir.
Here is a printout of some elapsed times for LDE on factorials of increasing powers of 2.
CPU: Factorial(2 ** 0): 0 s / 3 ms – randomize_trace_table dim: (512, 88)
CPU: Factorial(2 ** 1): 0 s / 2 ms – randomize_trace_table dim: (512, 88)
CPU: Factorial(2 ** 2): 0 s / 2 ms – randomize_trace_table dim: (512, 88)
CPU: Factorial(2 ** 3): 0 s / 2 ms – randomize_trace_table dim: (512, 88)
CPU: Factorial(2 ** 4): 0 s / 1 ms – randomize_trace_table dim: (512, 88)
CPU: Factorial(2 ** 5): 0 s / 3 ms – randomize_trace_table dim: (1024, 88)
CPU: Factorial(2 ** 6): 0 s / 5 ms – randomize_trace_table dim: (2048, 88)
CPU: Factorial(2 ** 7): 0 s / 7 ms – randomize_trace_table dim: (4096, 88)
CPU: Factorial(2 ** 8): 0 s / 13 ms – randomize_trace_table dim: (8192, 88)
Here is what the time becomes when running calling the futhark entry point for LDE on the GPU:
GPU: Factorial(2 ** 0): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
GPU: Factorial(2 ** 1): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
GPU: Factorial(2 ** 2): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
GPU: Factorial(2 ** 3): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
GPU: Factorial(2 ** 4): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
GPU: Factorial(2 ** 5): 0 s / 1 ms – randomize_trace_table dim: (1024, 88)
GPU: Factorial(2 ** 6): 0 s / 1 ms – randomize_trace_table dim: (2048, 88)
GPU: Factorial(2 ** 7): 0 s / 2 ms – randomize_trace_table dim: (4096, 88)
GPU: Factorial(2 ** 8): 0 s / 4 ms – randomize_trace_table dim: (8192, 88)
After a factorial of >= (2 ** 9) I begin to run into issues regarding high memory consumption leading to an unsuccessful exit. I think a potential solution to this would be to run LDE for an amount of columns that does not exceed the threshold of individual XFieldElements within the factorial(2 ** 8) run.
For example, it would be something like this:
On the rust side, gather chunks of randomized trace table columns that do not exceed a total count of (8192 * 88) XFieldElements.
Run LDE on those select columns.
Repeat step 1 and 2 for the rest of the table.
Then assemble the outputs into the expected format
I should also note that the GPU times above only include the time for running a futhark entry point from rust. They do not include the time it takes to convert rust types to the appropriate types needed to call within futhark.
A bit of context for those who are unfamiliar with this pipeline: In order to provide interop between rust and futhark, we are using a library called genfut. Genfut is used to generate a rust library with bindings to entry points in futhark code that has been compiled to Cuda/OpenCL/C.
The biggest bottleneck right now for the LDE accelerator is the time it takes to do these conversions for the randomized_trace_table. The futhark entry point accepts the randomized trace table as a 3d array of u64 (w/ the last dimension bound to 3: [][][3]u64 in futhark). Genfut represents this intermediary type as Array_u64_3d. On the rust side, before calling the entry point, the MasterExtTable.randomized_trace_table is converted from an Array2 to [][][3]u64. The return value, the interpolation polynomials from the LDE is converted from [][][3]u64 to Vec<Polynomial>.
The times with associated with these conversions times are listed here:
Conversion to genfut types: 0 s / 0 ms
GPU: Factorial(2 ** 0): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
Conversion back to rust types: 0 s / 2 ms
Conversion to genfut types: 0 s / 0 ms
GPU: Factorial(2 ** 1): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
Conversion back to rust types: 0 s / 1 ms
Conversion to genfut types: 0 s / 0 ms
GPU: Factorial(2 ** 2): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
Conversion back to rust types: 0 s / 2 ms
Conversion to genfut types: 0 s / 0 ms
GPU: Factorial(2 ** 3): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
Conversion back to rust types: 0 s / 1 ms
Conversion to genfut types: 0 s / 0 ms
GPU: Factorial(2 ** 4): 0 s / 0 ms – randomize_trace_table dim: (512, 88)
Conversion back to rust types: 0 s / 1 ms
Conversion to genfut types: 0 s / 1 ms
GPU: Factorial(2 ** 5): 0 s / 1 ms – randomize_trace_table dim: (1024, 88)
Conversion back to rust types: 0 s / 3 ms
Conversion to genfut types: 0 s / 2 ms
GPU: Factorial(2 ** 6): 0 s / 2 ms – randomize_trace_table dim: (2048, 88)
Conversion back to rust types: 0 s / 6 ms
Randomized Trace Table Conversion to Genfut Types: 0 s / 5 ms
GPU: Factorial(2 ** 7): 0 s / 3 ms – randomize_trace_table dim: (4096, 88)
Conversion back to rust types: 0 s / 13 ms
Conversion to genfut types: 0 s / 9 ms
GPU: Factorial(2 ** 8): 0 s / 4 ms – randomize_trace_table dim: (8192, 88)
Conversion back to rust types: 0 s / 25 ms
I think that this could be optimized primarily by reducing the amount of conversions that happened from rust to futhark.
My next immediate steps will probably be to work to update the Merkle Tree entry points from the original ruthark fork. I believe it was using a different hash function than Tip5. It shouldn’t be too bad to make that change.
Let me know your thoughts on the project at this stage and my suggested improvements.
Thankyou,
-Hunter