Performance
suanPan
prioritizes performance and designs the analysis logic in a parallel context.
Although the majority of the common analysis types can be parallelized, there are still some certain parts that have strong data dependencies that cannot be parallelized. According to Amdahl's law, there would be an upper bound of the theoretical speedup.
For example, for a static analysis of a simple model with a sufficiently large number of elements, there is no local iteration required to update element status, the major tasks are to assemble global stiffness matrix and solve it. In such a case, the performance is likely governed by the CPU capacity and often a large value of GFLOPS can be achieved (close to practical limit).
However, if one choose to perform a dynamic analysis of the same model with a fairly sophisticated time integration algorithm, such as GSSSS, as the effective stiffness would be the summation of the scaled versions of several global matrices, the analysis may be blocked by memory operations, which eventually leads to a lower value of GFLOPS.
In the nonlinear context, it is even more complicated. Several additional factors, such as the complexity of the material models used, the use of constraints, the element type, can all affect the performance.
Nevertheless, experience has shown that the performance is generally good enough for most cases.
Users are encouraged to perf
the performance of various analysis types.
Analysis Configurations
Here are some tips that may improve the performance.
- If the analysis is known to be linear elastic, use
set linear_system true
to skip convergence test and iteration. Note the analysis should be both material and geometric linear. - If the global system is known to be symmetric, use
set symm_mat true
to use a symmetric storage. Analyses involving 1D materials are mostly (not always) symmetric. Analyses involving 2D and 3D materials are mostly (not always) not symmetric. - Consider a proper stepping strategy. A fixed stepping size may be unnecessarily expensive. A proper adaptive stepping strategy can significantly improve the performance.
- Prefer a dense solver over a sparse solver if the system is small. A dense solver is generally faster than a sparse solver for small systems.
- Prefer a mixed-precision algorithm
set precision mixed
over a full-precision algorithm if the system is large. A mixed-precision algorithm is generally faster than a full-precision algorithm for large systems. See following for details. - The performance of various sparser solver can vary significantly. It is recommended to try different solvers to find the best one.
Mixed-Precision Algorithm
On some platforms, the performance of the mixed-precision algorithm can be significantly better than the full-precision algorithm. The mixed-precision algorithm converts the full-precision matrix to a lower precision matrix, and then solves the system using the lower precision matrix. Typically, only two to three iterations are required as each iteration reduces the relative error by a factor around machine epsilon of the lower precision.
The built-in tests consist of benchmarks for mixed-precision algorithms. One can execute the following command to run the tests.
Bash | |
---|---|
One can find the following information.
The mixed-precision algorithm is around three times faster than the full-precision algorithm. Note the results are obtained with MKL on a platform with a 13-th generation Intel CPU. For platforms that have a slow memory bandwidth, the performance gain may not be as significant.
One could always benchmark the platform to find the best algorithm.
Tweaks
It is possible to tweak the performance in the following ways, which may or may not improve the performance.
OpenMP Threads
OpenMP is used by MKL and OpenBLAS to parallelize the matrix operations, alongside with SIMD instructions. It is possible to manually set OMP_NUM_THREADS to control the number of threads used. Pay attention to over-subscription.
OMP_DYNAMIC may affect cache locality and thus the performance. For computation intensive tasks, it is recommended to set it to false.
Affinity
CPU affinity can also affect the performance. Tweaking affinity, for example, with KMP_AFFINITY, can improve performance.
Memory Allocation
Memory fragmentation may downgrade analysis performance, especially for finite element analysis, in which there are a large number of small matrices and vectors. It is recommended to use a performant memory allocator, for example, a general purpose allocator like mimalloc.
On Linux, it is fairly easy to replace the default memory allocator. For example,
Bash | |
---|---|