Dear PTFS team, I am attempting Task 3 of assigment 5:
Compiling with -O1 -no-vec actually increses the time to solution when increasing the number of threads.
on the other hand -O3 -no-vec gives the expected behaviour. This is curios because the main loop assembly is almost identical (Omp prevents both to use vectorization and unrolling). I am using icx!
Can somebody explain what is happening?
Additional Info:
Run: srun --cpu-freq=2400000-2400000 likwid-pin -c S0:0-x build/executable (x=0,1,2....)Forcing pinning:
size_t thread_numbers = 0;
#pragma omp parallel reduction(+: thread_numbers) // Dummy to force pinning
{
thread_numbers=1;
Kernel:
#pragma omp parallel for reduction(+:sum)
for (size_t i = 0; i < intr_n; i++)
{
x = (i+0.5)*delta;
sum += 4.0 * sqrt(1.0 - x * x);
}
-O3 output (on 1 thread 6 cy/it):
@ Assuming default frequency: 2.400000 Ghz
-> Threads : 2
-> Time : 1.255974688974675
-> Pi : 3.141592653589679
-> Cycles Per Iteration : 3.014339
-O1 output on 2 threads (on 1 thread 6 cy/it):
@ Assuming default frequency: 2.400000 Ghz
-> Threads : 2
-> Time : 16.138914049995947
-> Pi : 3.141592653589679
-> Cycles Per Iteration : 38.733394
-O3 assembly:
vcvtusi2sd xmm5, xmm7, rax
vaddsd xmm5, xmm5, xmm2
vmulsd xmm5, xmm1, xmm5
vmovapd xmm6, xmm5
vfnmadd213sd xmm6, xmm5, xmm3 # xmm6 = -(xmm5 * xmm6) + xmm3
vsqrtsd xmm6, xmm6, xmm6
vfmadd231sd xmm0, xmm6, xmm4 # xmm0 = (xmm6 * xmm4) + xmm0
inc rax
cmp rax, r14
jbe .LBB3_10
-O1 assembly:
vcvtusi2sd xmm4, xmm5, rax
vaddsd xmm4, xmm4, xmm1
vmulsd xmm4, xmm4, qword ptr [rcx]
vmovsd qword ptr [rdx], xmm4
vfnmadd213sd xmm4, xmm4, xmm2 # xmm4 = -(xmm4 * xmm4) + xmm2
vsqrtsd xmm4, xmm4, xmm4
vfmadd231sd xmm0, xmm4, xmm3 # xmm0 = (xmm4 * xmm3) + xmm0
inc rax
cmp r14, rax
jne .LBB3_3