.. _perfomance: .. nml:group:: num :no-target: ****************** Performance Tuning ****************** This chapter discusses the various factors that impact GYRE's execution time (i.e., how long it takes to complete a given calculation), and presents strategies for minimizing this time. Calculation Resolution ====================== GYRE's execution time depends on the resolution of the :ref:`spatial ` and :ref:`frequency ` grids it uses. Specifically, for the :program:`gyre` frontend, the time to process a single :nml:group:`mode` namelist group can be approximated by .. math:: \tau \approx C_{\rm b} \, N \, \Nfreq + C_{\rm s} \, N \, \Nmode, where :math:`N` is the number of spatial grid points, :math:`\Nfreq` is the number of frequency grid points, :math:`\Nmode` is the number of modes found, and :math:`C_{\rm b}` and :math:`C_{\rm s}` are constants. The first (:math:`C_{\rm b}`) term represents the time taken to bracket roots of the discriminant function, and the second (:math:`C_{\rm s}`) the time taken to solve for these roots (see the :ref:`numerical` chapter for details). For the :program:`gyre_tides` frontend, the time to process a single :nml:group:`orbit` namelist group can likewise be approximated by .. math:: \tau \approx C_{\rm t} \, N, where :math:`C_{\rm t}` is a constant. These expressions reveal that execution times can be shortened by minimizing :math:`N` and :math:`\Nfreq` --- but this optimization process must account for particular requirements for calculation accuracy and completeness. :ads_citet:`townsend:2025` suggest an strategy for adjusting :math:`N` to achieve a target accuracy, and the :ref:`troubleshoot-miss` section explains how completeness can be impacted by option choices in the :nml:group:`scan` namelist group. .. nml:group:: scan :no-target: A subtlety here involves the fact that :math:`N` is typically not specified directly. When :nml:option:`scaffold_src ` = :nml:value:`MODEL` (the default), the baseline :math:`N` is established by the resolution of the stellar model. However, the :ref:`iterative refinement algorithm ` can increase :math:`N` by inserting additional spatial grid points. This process implicitly depends on the :nml:option:`freq_min` and :nml:option:`freq_max` options that establish the range of the frequency scan; if this range includes parts of the star's oscillation spectrum containing modes with very large radial orders (whether p modes or g modes), then the refinement algorithm will insert many points in order to resolve the modes' wavefunctions. This can ultimately lead to huge :math:`N` and very long execution times. Calculation Numerics ==================== Another approach to reducing execution times is to change the numerical algorithms employed by GYRE; in effect, this alters the constants :math:`C_{\rm b}`, :math:`C_{\rm s}` and :math:`C_{\rm t}` appearing in the above expressions for :math:`\tau`. Discretization Scheme --------------------- .. nml:group:: num :no-target: GYRE offers a number of schemes for :ref:`discretizing ` the oscillation equations. The choice of scheme (selected using the :nml:option:`diff_scheme` option) represents a trade-off between three factors: speed, accuracy and stability. In general, higher-order discretization schemes are slower and potentially less stable, but more accurate. The performance of a given discretization scheme depends on many factors, including the properties (e.g., CPU architecture, cache size, memory bandwidth) of the hardware platform that GYRE runs on. Therefore, empirical measurement remains the most practical approach to characterizing behavior. :numref:`diff_scheme-table` summarizes execution times for adiabatic test calculations\ [#test-diff_scheme]_ using the various choices for the :nml:option:`diff_scheme` option. These data are calculated by averaging over 10 separate runs on a single core of an `Intel Xeon E5-2690 v4 CPU `__. Different results will be obtained on other hardware, but the *relative* performance of each scheme will likely remain similar. .. _diff_scheme-table: .. csv-table:: Execution times vs. diff_scheme :file: performance-tuning/stress.diff_scheme.csv :widths: auto :align: center As discussed by :ads_citet:`townsend:2025`, the accuracy of calculations involving :ref:`evolutionary ` and :ref:`polytropic ` stellar models is typically limited by the truncation error from interpolation, rather the truncation error due to discretization. In such cases, there is little to be gained from higher-order discretization schemes, and the :nml:value:`COLLOC_GL2` scheme will remain the best choice for performance-critical applications. However, one notable exception to this recommendation concerns non-adiabatic calculations. These are numerically challenging because the oscillation equations deep in the stellar interior can be extremely :wiki:`stiff `. In such cases, the :nml:value:`MAGNUS_GL2` scheme, although computationally expensive, appears to give superior results. Matrix Solver ------------- To solve the :ref:`linear system ` formed by the discretized oscillation equations, GYRE uses :wiki:`Gaussian elimination` strategies that exploit the sparse matrix structure\ [#sparse-scaling]_. These solvers (selected using the :nml:option:`ad_matrix_solver` and :nml:option:`nad_matrix_solver` options) are numerically equivalent, but lead to different execution times due to their differing approaches to accessing and manipulating matrix elements. :numref:`ad_matrix_solver-table` summarizes execution times for adiabatic calculations\ [#test-matrix_solver]_ using the various choices for the :nml:option:`ad_matrix_solver` option, on the same Xeon CPU as before. .. _ad_matrix_solver-table: .. csv-table:: Execution times vs. ad_matrix_solver choice :file: performance-tuning/stress.ad_matrix_solver.csv :widths: auto :align: center The winner is the :nml:value:`ROWPP` solver, recommending it as the choice for all single-core calculations. .. _performance-parallel: Parallel Execution ================== Parallel execution in GYRE is achieved using :wiki:`OpenMP` multi-threading. For adiabatic calculations, GYRE parallelizes each stage of the :ref:`mode search `, yielding speed-ups that scale with the thread count :math:`\Nthr`. This is illustrated in :numref:`ad_matrix_solver-multi-table`, which reprises :numref:`ad_matrix_solver-table` for different choices of :math:`\Nthr`\ [#nthread]_. Note that the speed-up doesn't scale exactly linearly with :math:`\Nthr`, due to :wiki:`dynamic frequency scaling` of the CPU. .. _ad_matrix_solver-multi-table: .. csv-table:: Execution times vs. ad_matrix_solver choice for differing thread counts :file: performance-tuning/stress.ad_matrix_solver.multi.csv :widths: auto :align: center For non-adiabatic calculations, the root-finding stage cannot be parallelized when deflation is enabled (:nml:option:`deflate_roots` = :nml:value:`.TRUE.`, the default), . The resulting performance loss can partly be mitigated by using the :nml:value:`CYCLIC` matrix solver, which is parallelized to yield a speed-up of order :math:`\Nthr \log \Nthr`. This is illustrated in :numref:`nad_matrix_solver-multi-table`, which reprises :numref:`ad_matrix_solver-multi-table` for non-adiabatic calculations. Clearly, as :math:`\Nthr` increases, the :nml:value:`CYCLIC` matrix solver overtakes :nml:value:`ROWPP` as the fastest. .. _nad_matrix_solver-multi-table: .. csv-table:: Execution times vs. nad_matrix_solver choice for differing thread counts :file: performance-tuning/stress.nad_matrix_solver.multi.csv :widths: auto :align: center To summarize this analysis: * For single-threaded calculations, an optimal choice is :nml:option:`ad_matrix_solver` = :nml:value:`ROWPP` and :nml:option:`nad_matrix_solver` = :nml:value:`ROWPP`. * For multi-threaded calculations, an optimal choice is :nml:option:`ad_matrix_solver` = :nml:value:`ROWPP` and :nml:option:`nad_matrix_solver` = :nml:value:`CYCLIC`. Edge-cases can certainly arise where these heuristics don't apply; then, as here, empirical measurement can guide the way. .. rubric:: Footnotes .. [#test-diff_scheme] See the :file:`$GYRE_DIR/test/stress/diff_scheme/run_stress_tests.sh` script for details on how the test calculations are run. .. [#sparse-scaling] This is why the expressions for :math:`\tau` scale with :math:`N`, rather than the usual :math:`N^{3}` scaling that applies to Gaussian elimination. .. [#test-matrix_solver] See the :file:`$GYRE_DIR/test/stress/matrix_solver/run_stress_tests.sh` script for details on how the test calculations are run. .. [#nthread] As set by the :envvar:`OMP_NUM_THREADS` environment variable.