N-Body Simulations
INTRODUCTION
In physics, the n-body problem is the problem predicting the individual motions of a group of celestial objects interacting with each other gravitationally. Solving this problem is motivated by the desire to understand the motions of the sun, moon, planets, and stars. There are three main equations for our algorithm. The first equation is to calculate the acceleration of a body with respect to all other bodies in the system. This equation comes from the NVIDA GPU Gems 3 textbook.1
This equation is derived from the equation for force. The epsilon squared factor represents the dampening factor to prevent values from going to infinity. We get acceleration from force by dividing by mass. Finally, our team will not be using the gravitational constant since the values we will be using in our simulation are relative and do not represent real-world distances or masses.
The implementation for this equation from GPU Gems 3 is shown below.1
#define EPS2 0.01 //dampening factor in textbook
__device__ __host__ float3 bodyBodyInteraction(float4 bi, float4 bj, float3 ai)
{
float3 r;
// r_ij [3 FLOPS]
r.x = bj.x - bi.x;
r.y = bj.y - bi.y;
r.z = bj.z - bi.z;
// distSqr = dot(r_ij, r_ij) + EPS^2 [6 FLOPS]
float distSqr = r.x * r.x + r.y * r.y + r.z * r.z + EPS2;
// invDistCube =1/distSqr^(3/2) [4 FLOPS (2 mul, 1 sqrt, 1 inv)]
float distSixth = distSqr * distSqr * distSqr;
float invDistCube = 1.0f/sqrtf(distSixth);
// s = m_j * invDistCube [1 FLOP]
float s = bj.w * invDistCube;
// a_i = a_i + s * r_ij [6 FLOPS]
ai.x += r.x * s;
ai.y += r.y * s;
ai.z += r.z * s;
return ai;
}
We will also be using leapfrog integration to obtain the change in velocity and the change in position. Using the calculated acceleration above, we multiply that by the amount of time to pass which is 0.25 in our simulation, and then add that value to the current velocity. The same logic is then applied to position using the newly calculated velocity.
EXAMPLE OUTPUT
To run our n-body simulation, the number of bodies must be passed in as well as the number of steps for the simulation to run seen below.
The CPU implementation outputs a csv file after the bodies’ positions are updated at the end of each time step. These csv files are comprised of a row for each body with the columns representing the body’s x, y, z, and mass respectively. These csv files are then iterated over to create the renderings. Below is an embedded video of one of the renderings. A playlist has been created with 3 different videos using various rendering libraries and varying amounts of bodies.
CPU IMPLEMENTATION
The CPU implementation has 3 nested for loops. The outermost for loop iterates over each time step until the maximum number of steps. The outer loop computes all the bodies’ accelerations, then velocities, then positions in series. The acceleration computation contains 2 of the nested for loops, one to iterate over everybody, the inner loop to iterate over all bodies again to calculate the current outer loop body’s acceleration. This happens in O(n^2) time. The velocity and position functions iterate over each body to perform the leapfrog iterations. These functions happen in n time. The final time equation would be T(n) = k*(n^2 + 2n) where k is the number of steps. K can be seen as a constant thus the final time analysis is O(n^2). The randomization of the starting positions is also done on the CPU since it seemed overly complicated set up and initialization for the limited scope of this single function. 3 3 The largest bottleneck in the CPU code is the computation of the bodies’ accelerations which is forcing the time complexity to be O(n^2). Parallelizing this should be straight forward and would reduce the largest bottleneck of the CPU implementation.
PARALLEL IMPLEMENTATION
N-body simulations are an inherently parallelable problem due to each body's acceleration calculation being independent of other bodies' acceleration calculation. This means that each body can be represented on a single thread. This would bring the runtime of O(n^2) down to O(n) since now each body’s acceleration will be calculated in parallel. The loop that iterates over each time step of the CUDA implementation was left on the CPU and runs the calculate_forces, calculate_velocity, and calculate_position kernels in order. This was done to ensure that the acceleration was calculated for every body before a body's position was modified.
CALCULATE FORCES KERNELS
The below 2 methods are our implementation for calculating the forces for each body using tiling. With this, we load the positions and accelerations into global memory and do the body-body interaction calculations in tiles. This is a modified version of implementation found in GPU Gem 3.1 The main differences our team included were more boundary checks to avoid illegal memory access errors we experienced while debugging the code.
__global__ void gpu_calculate_forces(void *d_X, void *d_A, int n)
{
extern __shared__ float4 shPosition[];
float4 *globalX = (float4 *)d_X;
float4 *globalA = (float4 *)d_A;
float4 myPosition;
int i, tile;
float3 acc = {0.0f, 0.0f, 0.0f};
int gtid = blockIdx.x * blockDim.x + threadIdx.x;
//Tiling
for (i = 0, tile = 0; i < n; i += blockDim.x, tile++) {
int idx = tile * blockDim.x + threadIdx.x;
if(idx < n){
shPosition[threadIdx.x] = globalX[idx];
}
__syncthreads();
if(gtid < n){
myPosition = globalX[gtid];
acc = tile_calculation(myPosition, acc, n-i);
}
__syncthreads();
}
// Save the result in global memory for the integration step.
if(gtid < n){
float4 acc4 = {acc.x, acc.y, acc.z, 0.0f};
globalA[gtid] = acc4;
}
}
__device__ float3 tile_calculation(float4 myPosition, float3 accel, int bodiesLeft)
{
int i;
extern __shared__ float4 shPosition[];
for (i = 0; i < blockDim.x && i < bodiesLeft; i++) {
accel = bodyBodyInteraction(myPosition, shPosition[i], accel);
}
return accel;
}
For our tile less implementation, acceleration for each body in the simulation is calculated in parallel where each thread will calculate the total acceleration for the body using our body-body interaction method.
__global__ void tileless_gpu_calculate_forces(float4 *d_X, float4 *d_A, int n)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < n){
float4 myPosition = d_X[id];
float3 acc = {0.0f, 0.0f, 0.0f};
for(int i=0; i<n; i++){
acc = bodyBodyInteraction(myPosition, d_X[i], acc);
}
float4 acc4 = {acc.x, acc.y, acc.z, 0.0f};
d_A[id] = acc4;
}
}
CALCULATE VELOCITY KERNEL
Next, from the net accelerations calculated for each n-body, we performed a leapfrog integration to calculate the velocities by multiplying the acceleration vectors by our time step of 0.25 and adding it to the updated velocity vectors for a particular n-body. This works in parallel where each thread will calculate the updated velocity for a particular n-body.
/*
Updates Velocity based on Computed Acceleration
Leapfrog Integration
*/
__global__ void gpu_calculate_velocity(float4 *d_A, float4 *d_V, int bodyCount, float time)
{
int currentBody = blockIdx.x * blockDim.x + threadIdx.x;
if (currentBody<bodyCount){
float3 vel = {0.0f, 0.0f, 0.0f};
vel.x = d_V[currentBody].x + d_A[currentBody].x * time;
vel.y = d_V[currentBody].y + d_A[currentBody].y * time;
vel.z = d_V[currentBody].z + d_A[currentBody].z * time;
float4 vel4 = {vel.x, vel.y, vel.z, 0.0f};
d_V[currentBody] = vel4;
}
}
CALCULATE POSITION KERNEL
Lastly, positions for each n-body were calculated where each thread calculated the new updated position of a certain n-body using leapfrog integration again.
/*
Calculate Postion based on Velocity
Leapfrog Integration
*/
__global__ void gpu_calculate_position(float4 *d_X, float4 *d_V, int bodyCount, float time)
{
int currentBody = blockIdx.x * blockDim.x + threadIdx.x;
if (currentBody<bodyCount){
float4 currentPos = d_X[currentBody];
float4 pos = {0.0f, 0.0f, 0.0f, currentPos.w};
pos.x = currentPos.x + d_V[currentBody].x * time;
pos.y = currentPos.y + d_V[currentBody].y * time;
pos.z = currentPos.z + d_V[currentBody].z * time;
d_X[currentBody] = pos;
}
}
PROFILING AND BOTTLENECKS
OVERALL RUNTIME RESULTS (LOG SCALE)
TILED VS. NON-TILED RESULTS
ANALYSIS
Our current bottlenecks that we have could be the tiled implementation of calculating forces. As seen in our profiling, our tiling GPU performance is faster than the CPU implementation, however, it is slower than our GPU implementation without tiling. We did not have enough time to discover the reason for this, even after a thorough debugging of our code and a deep code analysis. One issue may have been the inclusion of the extra branches which may be forcing the GPU to run in serial. One thing to note is the GPU Gems textbook contained little to no explanation for launching the code and what the global variables should be set to, so we may be launching the kernel slightly incorrectly which also may be slowing it down. Interesting, however, is that consistently when our application is run through compute-sanitizer, our non-tiled kernel runs much slower than the tiled version.
Another interesting note is that the runtime increases at a much higher rate for the non-tiled implementation compared to the tiled version. When comparing the compute-sanitizer run to the program normally, the tiled version only increased about 4000 milliseconds while the non-tiled implementation increased over 10,000 milliseconds shown below.
While our team did not reach a conclusion to this discrepancy, if there was more time for the project, we would pursue the answer for this next.
PITFALLS AND CHALLENGES
One of the main challenges our team noticed while beginning our profiling was that the final CPU results and the final GPU results showed to diverge more with an increase in the amount of floating-point calculations, either through more bodies or more steps. This can be seen in the chart below.
Through some quick internet research, we found that one potential reason for this divergence may be due to the fundamental differences between the CPU and GPU hardware. The GPU performs floating point operations in either 32-bit or 64-bit mode with respect to either single or double precision. The CPU on the other hand uses an 80-bit intermediate for floating operations since Roise’s CPU is x86 based as seen below.1
Another possible explanation for the differences in the floating-point numbers is due to the GPU use of the Fused Multiply-Add optimization. Instead of computing a multiply then and add and rounding after each operation, a fused-multiply add performs both the multiply then the add and only rounds after both have completed.2 Our team believes that the divergence of results is due to a combination of architecture differences and the GPU’s use of Fused Multiply-Add. Rendering speed could be improved by loading our results from our n-body simulation into a visualization tool. With the tools that we used to visualize our n bodies, the time it takes to load it takes a very long time, as seen in our Manim render of 4096 bodies which took ~18 hours. We could play around with other visualizing libraries or different ways to store the results such as storing each frame in a single csv as opposed to separate CSVs. Most likely, the best way to visualize this would be to use a graphics API like OpenGL or Vulkan; however, our team did not believe we would have the time to learn this technology for the project.
CITATIONS
1Chapter 31. Fast N-Body Simulation with CUDA | NVIDIA Developer