Performance Tuning

This chapter discusses the various factors that impact GYRE’s execution time (i.e., how long it takes to complete a given calculation), and presents strategies for minimizing this time.

Calculation Resolution

GYRE’s execution time depends on the resolution of the spatial and frequency grids it uses. Specifically, for the gyre frontend, the time to process a single &mode namelist group can be approximated by

\[\tau \approx C_{\rm b} \, N \, \Nfreq + C_{\rm s} \, N \, \Nmode,\]

where \(N\) is the number of spatial grid points, \(\Nfreq\) is the number of frequency grid points, \(\Nmode\) is the number of modes found, and \(C_{\rm b}\) and \(C_{\rm s}\) are constants. The first (\(C_{\rm b}\)) term represents the time taken to bracket roots of the discriminant function, and the second (\(C_{\rm s}\)) the time taken to solve for these roots (see the Numerical Methods chapter for details).

For the gyre_tides frontend, the time to process a single &orbit namelist group can likewise be approximated by

\[\tau \approx C_{\rm t} \, N,\]

where \(C_{\rm t}\) is a constant.

These expressions reveal that execution times can be shortened by minimizing \(N\) and \(\Nfreq\) — but this optimization process must account for particular requirements for calculation accuracy and completeness. Townsend et al. (2025) suggest an strategy for adjusting \(N\) to achieve a target accuracy, and the Missing Modes section explains how completeness can be impacted by option choices in the &scan namelist group.

A subtlety here involves the fact that \(N\) is typically not specified directly. When scaffold_src = MODEL (the default), the baseline \(N\) is established by the resolution of the stellar model. However, the iterative refinement algorithm can increase \(N\) by inserting additional spatial grid points. This process implicitly depends on the freq_min and freq_max options that establish the range of the frequency scan; if this range includes parts of the star’s oscillation spectrum containing modes with very large radial orders (whether p modes or g modes), then the refinement algorithm will insert many points in order to resolve the modes’ wavefunctions. This can ultimately lead to huge \(N\) and very long execution times.

Calculation Numerics

Another approach to reducing execution times is to change the numerical algorithms employed by GYRE; in effect, this alters the constants \(C_{\rm b}\), \(C_{\rm s}\) and \(C_{\rm t}\) appearing in the above expressions for \(\tau\).

Discretization Scheme

GYRE offers a number of schemes for discretizing the oscillation equations. The choice of scheme (selected using the diff_scheme option) represents a trade-off between three factors: speed, accuracy and stability. In general, higher-order discretization schemes are slower and potentially less stable, but more accurate.

The performance of a given discretization scheme depends on many factors, including the properties (e.g., CPU architecture, cache size, memory bandwidth) of the hardware platform that GYRE runs on. Therefore, empirical measurement remains the most practical approach to characterizing behavior. Table 2 summarizes execution times for adiabatic test calculations[1] using the various choices for the diff_scheme option. These data are calculated by averaging over 10 separate runs on a single core of an Intel Xeon E5-2690 v4 CPU. Different results will be obtained on other hardware, but the relative performance of each scheme will likely remain similar.

Table 2 Execution times vs. diff_scheme
diff_scheme	order	exec. time (s)
COLLOC_GL2	2	0.84
COLLOC_GL4	4	1.61
COLLOC_GL6	6	2.51
MAGNUS_GL2	2	27.81
MAGNUS_GL4	4	29.41
MAGNUS_GL6	6	31.96
TRAPZ	2	0.98
MIRK	4	1.87

As discussed by Townsend et al. (2025), the accuracy of calculations involving evolutionary and polytropic stellar models is typically limited by the truncation error from interpolation, rather the truncation error due to discretization. In such cases, there is little to be gained from higher-order discretization schemes, and the COLLOC_GL2 scheme will remain the best choice for performance-critical applications.

However, one notable exception to this recommendation concerns non-adiabatic calculations. These are numerically challenging because the oscillation equations deep in the stellar interior can be extremely stiff. In such cases, the MAGNUS_GL2 scheme, although computationally expensive, appears to give superior results.

Matrix Solver

To solve the linear system formed by the discretized oscillation equations, GYRE uses Gaussian elimination strategies that exploit the sparse matrix structure[2]. These solvers (selected using the ad_matrix_solver and nad_matrix_solver options) are numerically equivalent, but lead to different execution times due to their differing approaches to accessing and manipulating matrix elements.

Table 3 summarizes execution times for adiabatic calculations[3] using the various choices for the ad_matrix_solver option, on the same Xeon CPU as before.

Table 3 Execution times vs. ad_matrix_solver choice
ad_matrix_solver	exec. time (s)
BANDED	2.64
CYCLIC	2.05
ROWPP	0.84

The winner is the ROWPP solver, recommending it as the choice for all single-core calculations.

Parallel Execution

Parallel execution in GYRE is achieved using OpenMP multi-threading. For adiabatic calculations, GYRE parallelizes each stage of the mode search, yielding speed-ups that scale with the thread count \(\Nthr\). This is illustrated in Table 4, which reprises Table 3 for different choices of \(\Nthr\)[4]. Note that the speed-up doesn’t scale exactly linearly with \(\Nthr\), due to dynamic frequency scaling of the CPU.

Table 4 Execution times vs. ad_matrix_solver choice for differing thread counts
ad_matrix_solver	threads	exec. time (s)
BANDED	1	2.64
	4	0.75
	8	0.41
CYCLIC	1	2.06
	4	0.60
	8	0.33
ROWPP	1	0.84
	4	0.26
	8	0.16

For non-adiabatic calculations, the root-finding stage cannot be parallelized when deflation is enabled (deflate_roots = .TRUE., the default), . The resulting performance loss can partly be mitigated by using the CYCLIC matrix solver, which is parallelized to yield a speed-up of order \(\Nthr \log \Nthr\). This is illustrated in Table 5, which reprises Table 4 for non-adiabatic calculations. Clearly, as \(\Nthr\) increases, the CYCLIC matrix solver overtakes ROWPP as the fastest.

Table 5 Execution times vs. nad_matrix_solver choice for differing thread counts
nad_matrix_solver	threads	exec. time (s)
BANDED	1	2.51
	4	2.24
	8	2.21
CYCLIC	1	2.17
	4	0.73
	8	0.48
ROWPP	1	1.25
	4	1.19
	8	1.16

To summarize this analysis:

For single-threaded calculations, an optimal choice is ad_matrix_solver = ROWPP and nad_matrix_solver = ROWPP.
For multi-threaded calculations, an optimal choice is ad_matrix_solver = ROWPP and nad_matrix_solver = CYCLIC.

Edge-cases can certainly arise where these heuristics don’t apply; then, as here, empirical measurement can guide the way.

Footnotes