HPC Final Assignment 1

Hotspot analysis

Determine Hotspots involves identifying areas in your code that consume a significant amount of computational resources.

First of all, we run the code to understand how it behaves without any optimization:

$ icc -qopenmp -diag-disable=10441 -g omp_homework.c

DFTW calculation with N = 10000 
DFTW computation in 14.835124 seconds
Xre[0] = 10000.000000

With the -g flag, which removes all optimizations, we obtain the result in about 15 seconds.

Upon code analysis and leveraging a profiler to generate a report, we identified a performance hotspot in the DFT function. Specifically, there are two nested loops with $O(N^2)$ complexity.

With -no-vec and optimization factor -O2:

$ icc -qopenmp -diag-disable=10441 -no-vec -qopt-report=5 -qopt-report-phase=vec 
	-O2 -g omp_homework.c

DFTW calculation with N = 10000 
DFTW computation in 2.037114 seconds
Xre[0] = 10000.000000

Applying the -O2 compiler flag led to a performance improvement, reducing the execution time to 2 seconds. Although this compiler optimization was successful, it's noteworthy that vectorization was not applied due to the -no-vec flag.

Vectorization

Removing the -no-vec flag further decrease the execution time.

$ icc -qopenmp -diag-disable=10441 -qopt-report=5 -qopt-report-phase=vec omp_homework.c

DFTW calculation with N = 10000 
DFTW computation in 1.018836 seconds
Xre[0] = 10000.000000

Vectorization issues

With the previous command we obtain the report to understand where the vectorization impacts and how can we organize better the code:

Begin optimization report for: main(int, char **)
...
LOOP BEGIN at omp_homework.c(71,3) inlined into omp_homework.c(43,5)
      remark #15389: vectorization support: reference Xi_o[k] has unaligned access   
			[ omp_homework.c(77,11) ]
      remark #15389: vectorization support: reference Xi_o[k] has unaligned access   
			[ omp_homework.c(77,11) ]
      remark #15388: vectorization support: reference Xr_o[k] has aligned access   
			[ omp_homework.c(75,11) ]
      remark #15388: vectorization support: reference Xr_o[k] has aligned access   
			[ omp_homework.c(75,11) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.123
      remark #15301: PERMUTED LOOP WAS VECTORIZED
      remark #15442: entire loop may be executed in remainder
      remark #15448: unmasked aligned unit stride loads: 1 
      remark #15449: unmasked aligned unit stride stores: 1 
      remark #15450: unmasked unaligned unit stride loads: 1 
      remark #15451: unmasked unaligned unit stride stores: 1 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 563 
      remark #15477: vector cost: 130.000 
      remark #15478: estimated potential speedup: 4.320 
      remark #15482: vectorized math library calls: 2 
      remark #15486: divides: 2 
      remark #15487: type converts: 2 
      remark #15488: --- end vector cost summary ---
   LOOP END

...

===========================================================================

Begin optimization report for: DFT(int, double *, double *, double *, double *, int)
LOOP BEGIN at omp_homework.c(71,3)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at omp_homework.c(73,7)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization
      remark #15346: vector dependence: 
					assumed OUTPUT dependence between Xr_o[k] (75:11) and Xi_o[k] (77:11)
      remark #15346: vector dependence: 
					assumed OUTPUT dependence between Xi_o[k] (77:11) and Xr_o[k] (75:11)
   LOOP END
LOOP END

The report identifies a problem in the loop starting at line 71, incorporating a nested loop at line 73, wherein a vector dependence appears to prevent vectorization, specifically involving vectors Xr_o and Xi_o.

This happens because we are inside the function scope where the compiler lacks precise information about the pointers' targets and references (because they are passed as parameters), creating a potential problem with aliasing.

Instead, in the main function, the compiler accurately identifies the declaration and initialization of those vectors enabling the compiler to correctly perform Loop Permutation.

Despite potential unaligned memory accesses, the code execution can still achieve a significant efficiency advantage.