Compiling Sqrt benchmark with -O1

Compiling Sqrt benchmark with -O1

by Erik Fabrizzi -
Number of replies: 3


Dear PTFS team, I am attempting Task 3 of assigment 5:

Compiling with -O1 -no-vec actually increses the time to solution when increasing the number of threads.
on the other hand -O3 -no-vec gives the expected behaviour.  This is curios because the main loop assembly is almost  identical (Omp prevents both to use vectorization and unrolling). I am using icx!
Can somebody explain what is happening? 

Additional Info: 
Run:  srun --cpu-freq=2400000-2400000 likwid-pin -c S0:0-x build/executable (x=0,1,2....)
Forcing pinning:
    size_t thread_numbers = 0;
    #pragma omp parallel reduction(+: thread_numbers) // Dummy to force pinning
    {
        thread_numbers=1;
 Kernel: 

            #pragma omp parallel for reduction(+:sum)
            for (size_t i = 0; i < intr_n; i++)
            {
                x = (i+0.5)*delta;
                sum += 4.0 * sqrt(1.0 - x * x);
            }
-O3 output (on 1 thread 6 cy/it):
@ Assuming default frequency: 2.400000 Ghz
 -> Threads              : 2 
 -> Time                 : 1.255974688974675 
 -> Pi                   : 3.141592653589679
 -> Cycles Per Iteration : 3.014339

-O1 output on 2 threads (on 1 thread 6 cy/it):
@ Assuming default frequency: 2.400000 Ghz
 -> Threads              : 2 
 -> Time                 : 16.138914049995947 
 -> Pi                   : 3.141592653589679
 -> Cycles Per Iteration : 38.733394
-O3 assembly:
vcvtusi2sd xmm5, xmm7, rax
vaddsd xmm5, xmm5, xmm2
vmulsd xmm5, xmm1, xmm5
vmovapd xmm6, xmm5
vfnmadd213sd xmm6, xmm5, xmm3        # xmm6 = -(xmm5 * xmm6) + xmm3
vsqrtsd xmm6, xmm6, xmm6
vfmadd231sd xmm0, xmm6, xmm4        # xmm0 = (xmm6 * xmm4) + xmm0
inc rax
cmp rax, r14
jbe .LBB3_10
-O1 assembly:
vcvtusi2sd xmm4, xmm5, rax
vaddsd xmm4, xmm4, xmm1
vmulsd xmm4, xmm4, qword ptr [rcx]
vmovsd qword ptr [rdx], xmm4
vfnmadd213sd xmm4, xmm4, xmm2        # xmm4 = -(xmm4 * xmm4) + xmm2
vsqrtsd xmm4, xmm4, xmm4
vfmadd231sd xmm0, xmm4, xmm3        # xmm0 = (xmm4 * xmm3) + xmm0
inc rax
cmp r14, rax
jne .LBB3_3



In reply to Erik Fabrizzi

Re: Compiling Sqrt benchmark with -O1

by Erik Fabrizzi -
Adding private(x) seems to have solved the issue,
#pragma omp parallel for private(x) reduction(+:sum)
for (size_t i = 0; i < intr_n; i++)
{
x = (i+0.5)*delta;
Still some more info would be appreciated, thanks!
sum += 4.0 * sqrt(1.0 - x * x);
}
In reply to Erik Fabrizzi

Re: Compiling Sqrt benchmark with -O1

by Georg Hager -

Hi,

yes, the missing private(x) was the culprit. As to why this was the case, we won't tell but instead make it a homework exercise of its own (maybe the week after this one because this week we are all about GPUs).

Best,

Georg.