# SIMD function vectorization example (polynomial evaluation)

## Files

In the `simd` subdirectory you find files for C and Fortran:

* C
  * `main.c`: main program
  * `poly-eval.c`: polynomial evaluation function
* Fortran
  * `main.F90`: main program
  * `mod-poly-eval.F90`: polynomial evaluation function


## Build

### general

C version:

```bash
# Intel compiler
icc -Ofast -qopenmp-simd -xHost -o main.c.exe main.c poly-eval.c

# GCC compiler
gcc -Ofast -fopenmp-simd -march=native -o main.c.exe main.c poly-eval.c

# clang compiler
clang -Ofast -fopenmp-simd -march=native -o main.c.exe main.c poly-eval.c
```

Fortran version:

```bash
# Intel compiler
ifort -Ofast -qopenmp-simd -xHost -o main.F90.exe mod-poly-eval.F90 main.F90

# GCC compiler
gfortran -Ofast -fopenmp-simd -march=native -o main.F90.exe mod-poly-eval.F90 main.F90
```

### Additional flags for using full AVX512 registers

* for icc, icpc, ifort add `-qopt-zmm-usage=high`
* for gcc, gfortran, clang, icx, ifx add `-mprefer-vector-width=512`

### for Alex cluster

C version:

```bash
# Intel compiler
module load intel
icc -Ofast -mavx2 -mfma -o main.c.exe main.c poly-eval.c

# GCC compiler
module load gcc
gcc -Ofast -fopenmp-simd -march=native -o main.c.exe main.c poly-eval.c

# clang compiler
clang -Ofast -fopenmp-simd -march=native -o main.c.exe main.c poly-eval.c
```

Fortran version:

```bash
# Intel compiler
module load intel
ifort -Ofast -mavx2 -mfma -o main.F90.exe mod-poly-eval.F90 main.F90

# GCC compiler

module load gcc
gfortran -Ofast -fopenmp-simd -march=native -o main.F90.exe mod-poly-eval.F90 main.F90
```


## Run

By default, the code evaluates a polynomial of degree 10 on 1000 x values.
Optionally, the array length and the polynomial degree can be set as command
line options:

```bash
./main.c.exe [<length> [<degree>]]
```

The program prints the performance in terms of polynomial evaluations per
second. It also sums the elements of the result array and prints them so you
can check your code for correctness.


## Exercises

1. Run the code and note the performance for the default array size and
   polynomial degree.  Fix the clock frequency to 2GHz for this benchmark by
   running it through srun:
   ```
   srun --cpu-freq=2000000-2000000:performance ./simd.c.exe
   ```

2. Change the function in `poly-eval.c` and the `main.c` to enable SIMD
   vectorization of the main loop (marked in the code). Make sure the code
   computes the correct result (check the sum).
   
   1. Compare the performance at different SIMD lengths (2,4,8,16) with the
      scalar code. What speedups do you get?
   2. Why is there a speedup from SIMD length 8 to 16 even though the SIMD
      width in the hardware is only 8?   
   3. Increase the polynomial degree to 40 and repeat the SIMD scaling
      experiment. What do you observe?

