Performance Tuning
This chapter discusses the various factors that impact GYRE’s execution time (i.e., how long it takes to complete a given calculation), and presents strategies for minimizing this time.
Calculation Resolution
GYRE’s execution time depends on the resolution of the spatial and frequency grids it
uses. Specifically, for the gyre frontend, the time to
process a single &mode namelist group can be approximated
by
where \(N\) is the number of spatial grid points, \(\Nfreq\) is the number of frequency grid points, \(\Nmode\) is the number of modes found, and \(C_{\rm b}\) and \(C_{\rm s}\) are constants. The first (\(C_{\rm b}\)) term represents the time taken to bracket roots of the discriminant function, and the second (\(C_{\rm s}\)) the time taken to solve for these roots (see the Numerical Methods chapter for details).
For the gyre_tides frontend, the time to process a single
&orbit namelist group can likewise be approximated by
where \(C_{\rm t}\) is a constant.
These expressions reveal that execution times can be shortened by
minimizing \(N\) and \(\Nfreq\) — but this optimization
process must account for particular requirements for calculation
accuracy and completeness. Townsend et al. (2025) suggest an
strategy for adjusting \(N\) to achieve a target accuracy, and the
Missing Modes section explains how completeness can be
impacted by option choices in the &scan namelist group.
A subtlety here involves the fact that \(N\) is typically not
specified directly. When scaffold_src = MODEL (the default), the
baseline \(N\) is established by the resolution of the stellar
model. However, the iterative refinement algorithm can increase \(N\) by inserting additional
spatial grid points. This process implicitly depends on the
freq_min and freq_max options that
establish the range of the frequency scan; if this range includes
parts of the star’s oscillation spectrum containing modes with very
large radial orders (whether p modes or g modes), then the refinement
algorithm will insert many points in order to resolve the modes’
wavefunctions. This can ultimately lead to huge \(N\) and very
long execution times.
Calculation Numerics
Another approach to reducing execution times is to change the numerical algorithms employed by GYRE; in effect, this alters the constants \(C_{\rm b}\), \(C_{\rm s}\) and \(C_{\rm t}\) appearing in the above expressions for \(\tau\).
Discretization Scheme
GYRE offers a number of schemes for discretizing the oscillation equations. The choice of scheme
(selected using the diff_scheme option) represents a
trade-off between three factors: speed, accuracy and stability. In
general, higher-order discretization schemes are slower and
potentially less stable, but more accurate.
The performance of a given discretization scheme depends on many
factors, including the properties (e.g., CPU architecture, cache size,
memory bandwidth) of the hardware platform that GYRE runs
on. Therefore, empirical measurement remains the most practical
approach to characterizing behavior. Table 2
summarizes execution times for adiabatic test calculations[1] using the various choices for the
diff_scheme option. These data are calculated by
averaging over 10 separate runs on a single core of an Intel Xeon
E5-2690 v4 CPU. Different
results will be obtained on other hardware, but the relative
performance of each scheme will likely remain similar.
diff_scheme |
order |
runtime |
COLLOC_GL2 |
2 |
0.99 |
COLLOC_GL4 |
4 |
6.51 |
COLLOC_GL6 |
6 |
13.69 |
MAGNUS_GL2 |
2 |
30.65 |
MAGNUS_GL4 |
4 |
32.66 |
MAGNUS_GL6 |
6 |
35.93 |
TRAPZ |
2 |
3.03 |
MIRK |
4 |
2.76 |
As discussed by Townsend et al. (2025), the accuracy of
calculations involving evolutionary and
polytropic stellar models is typically limited by
the truncation error from interpolation, rather the truncation error
due to discretization. In such cases, there is little to be gained
from higher-order discretization schemes, and the
COLLOC_GL2 scheme will remain the best choice for
performance-critical applications.
However, one notable exception to this recommendation concerns
non-adiabatic calculations. These are numerically challenging because
the oscillation equations deep in the stellar interior can be
extremely stiff. In such cases, the
MAGNUS_GL2 scheme, although computationally expensive,
appears to give superior results.
Matrix Solver
To solve the linear system formed by
the discretized oscillation equations, GYRE uses Gaussian
elimination strategies that exploit the sparse matrix structure[2]. These solvers (selected using the
ad_matrix_solver and nad_matrix_solver
options) are numerically equivalent, but lead to different execution
times due to their differing approaches to accessing and manipulating
matrix elements.
Table 3 summarizes execution times for
adiabatic calculations[3] using the various
choices for the ad_matrix_solver option, on the same
Xeon CPU as before.
ad_matrix_solver |
exec. time (s) |
BANDED |
3.19 |
CYCLIC |
2.39 |
ROWPP |
0.99 |
The winner is the ROWPP solver, recommending it as the
choice for all single-core calculations.
Parallel Execution
Parallel execution in GYRE is achieved using OpenMP multi-threading. For adiabatic calculations, GYRE parallelizes each stage of the mode search, yielding speed-ups that scale with the thread count \(\Nthr\). This is illustrated in Table 4, which reprises Table 3 for different choices of \(\Nthr\)[4]. Note that the speed-up doesn’t scale exactly linearly with \(\Nthr\), due to dynamic frequency scaling of the CPU.
ad_matrix_solver |
threads |
exec. time (s) |
BANDED |
1 |
3.20 |
4 |
0.90 |
|
8 |
0.48 |
|
CYCLIC |
1 |
2.38 |
4 |
0.69 |
|
8 |
0.37 |
|
ROWPP |
1 |
0.99 |
4 |
0.31 |
|
8 |
0.18 |
For non-adiabatic calculations, the root-finding stage cannot be
parallelized when deflation is enabled (deflate_roots =
.TRUE., the default), . The resulting performance loss
can partly be mitigated by using the CYCLIC matrix
solver, which is parallelized to yield a speed-up of order
\(\Nthr \log \Nthr\). This is illustrated in
Table 5, which reprises
Table 4 for non-adiabatic
calculations. Clearly, as \(\Nthr\) increases, the
CYCLIC matrix solver overtakes ROWPP as the
fastest.
nad_matrix_solver |
threads |
exec. time (s) |
BANDED |
1 |
3.02 |
4 |
2.60 |
|
8 |
2.49 |
|
CYCLIC |
1 |
2.47 |
4 |
0.84 |
|
8 |
0.54 |
|
ROWPP |
1 |
1.39 |
4 |
1.32 |
|
8 |
1.30 |
To summarize this analysis:
For single-threaded calculations, an optimal choice is
ad_matrix_solver=ROWPPandnad_matrix_solver=ROWPP.For multi-threaded calculations, an optimal choice is
ad_matrix_solver=ROWPPandnad_matrix_solver=CYCLIC.
Edge-cases can certainly arise where these heuristics don’t apply; then, as here, empirical measurement can guide the way.
Footnotes