Determine Hotspots involves identifying areas in your code that consume a significant amount of computational resources.
First of all, we run the code to understand how it behaves without any optimization:
$ icc -qopenmp -diag-disable=10441 -g omp_homework.c
DFTW calculation with N = 10000
DFTW computation in 14.835124 seconds
Xre[0] = 10000.000000
With the -g
flag, which removes all optimizations, we obtain the result in about 15 seconds.
Upon code analysis and leveraging a profiler to generate a report, we identified a performance hotspot in the DFT function. Specifically, there are two nested loops with $O(N^2)$ complexity.
With -no-vec
and optimization factor -O2
:
$ icc -qopenmp -diag-disable=10441 -no-vec -qopt-report=5 -qopt-report-phase=vec
-O2 -g omp_homework.c
DFTW calculation with N = 10000
DFTW computation in 2.037114 seconds
Xre[0] = 10000.000000
Applying the -O2
compiler flag led to a performance improvement, reducing the execution time to 2 seconds. Although this compiler optimization was successful, it's noteworthy that vectorization was not applied due to the -no-vec
flag.
Removing the -no-vec
flag further decrease the execution time.
$ icc -qopenmp -diag-disable=10441 -qopt-report=5 -qopt-report-phase=vec omp_homework.c
DFTW calculation with N = 10000
DFTW computation in 1.018836 seconds
Xre[0] = 10000.000000
With the previous command we obtain the report to understand where the vectorization impacts and how can we organize better the code:
Begin optimization report for: main(int, char **)
...
LOOP BEGIN at omp_homework.c(71,3) inlined into omp_homework.c(43,5)
remark #15389: vectorization support: reference Xi_o[k] has unaligned access
[ omp_homework.c(77,11) ]
remark #15389: vectorization support: reference Xi_o[k] has unaligned access
[ omp_homework.c(77,11) ]
remark #15388: vectorization support: reference Xr_o[k] has aligned access
[ omp_homework.c(75,11) ]
remark #15388: vectorization support: reference Xr_o[k] has aligned access
[ omp_homework.c(75,11) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 0.123
remark #15301: PERMUTED LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 563
remark #15477: vector cost: 130.000
remark #15478: estimated potential speedup: 4.320
remark #15482: vectorized math library calls: 2
remark #15486: divides: 2
remark #15487: type converts: 2
remark #15488: --- end vector cost summary ---
LOOP END
...
===========================================================================
Begin optimization report for: DFT(int, double *, double *, double *, double *, int)
LOOP BEGIN at omp_homework.c(71,3)
remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
LOOP BEGIN at omp_homework.c(73,7)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence:
assumed OUTPUT dependence between Xr_o[k] (75:11) and Xi_o[k] (77:11)
remark #15346: vector dependence:
assumed OUTPUT dependence between Xi_o[k] (77:11) and Xr_o[k] (75:11)
LOOP END
LOOP END
The report identifies a problem in the loop starting at line 71, incorporating a nested loop at line 73, wherein a vector dependence appears to prevent vectorization, specifically involving vectors Xr_o and Xi_o.
This happens because we are inside the function scope where the compiler lacks precise information about the pointers' targets and references (because they are passed as parameters), creating a potential problem with aliasing.
Instead, in the main function, the compiler accurately identifies the declaration and initialization of those vectors enabling the compiler to correctly perform Loop Permutation.
Despite potential unaligned memory accesses, the code execution can still achieve a significant efficiency advantage.