2.5 sec runtime for pi calculation

2.5 sec runtime for pi calculation

by Arjun Lenan Sandhya -
Number of replies: 15

We have been running the pi value estimation code since assignment0 and we have always received a wall clock runtime of ~2.5 secs. However, it was mentioned in the tutorial class that the actual runtime should be nearly half of that for double precision with frequency locked at 2.4GHz. We are using intel icx compiler and running the job using a job script.
Any help is appreciated. Job script screenshot is attached


In reply to Arjun Lenan Sandhya

Re: 2.5 sec runtime for pi calculation

by Georg Hager -

What is the measured number of cycles per iteration? Did you perchance set N=2e9?

It would be helpful if you had posted your source code. And not as a screenshot, since it's awfully hard to copy code from a screenshot.

In reply to Georg Hager

Re: 2.5 sec runtime for pi calculation

by Arjun Lenan Sandhya -
So sorry professor. I wanted to include the code as well but couldn't find a "code block" option. I'll share it here. The number of iterations are same, 1e9.

##########################
include <stdio.h>
#include <math.h>
#include "timing.h"
double getTimeStamp();

int main(){
    float slices;
    double sum;
    float delta_x;
    double wcTime,wcTimeStart,wcTimeEnd;

    slices = 1000000000.0;
    sum = 0.0;
    delta_x = 1.0f/slices;
    wcTimeStart = getTimeStamp();
    for (int i=0; i<slices; i++){
        double x = (i+0.5f)*delta_x;
        sum = sum + 4*sqrt(1.0f-x*x);
    }
    wcTimeEnd = getTimeStamp();
    wcTime = wcTimeEnd - wcTimeStart;
    double Pi = sum * delta_x;
    printf("Pi is %.18f\n",Pi);
    printf("Walltime: %.3lf s\n",wcTime);
    return 0;
}
#############################
In reply to Arjun Lenan Sandhya

Re: 2.5 sec runtime for pi calculation

by Jannik Hausladen -

I think the problem is that you have the variable slices declared as a float.

Why? You only use in the loop to compare it to an int. By having it declared as float, in each iteration a type cast has to take place. This costs a lot of CPU time.

Try running the code with "int slices= 1e9". It should increase the performance to the desired value.

In reply to Jannik Hausladen

Re: 2.5 sec runtime for pi calculation

by Georg Hager -
Yes, that is probably the main problem but there is more: delta_x is float, and so are the numerical constants 1.0f and 4.0f. It is unclear what the compiler can do at compile time to eliminate the float-to-double conversions, but it is always a good idea to avoid unnecessary conversions if possible.
In reply to Georg Hager

Re: 2.5 sec runtime for pi calculation

by Arjun Lenan Sandhya -
Just putting it here for information, I tried running the code with and without float constants, there wasn't any significant change in runtime.
In reply to Arjun Lenan Sandhya

Re: 2.5 sec runtime for pi calculation

by Razvan Vass -
Setting the frequency may not necessarily double your speed; in fact, it can sometimes lead to poorer results, as shown in the attached screenshot. This occurs because setting the frequency imposes limitations, although it can sometimes push the processor to perform more computations.

Not setting the frequency results in random values within the processor's capable range, making performance comparisons harder due to varying results.

The 2.4GHz value isn't arbitrary; it represents the minimum assured speed achievable by the processor architecture at any given time.

If someone knows that I said something wrong please correct me smile

Attachment FrequencyTest.png
In reply to Razvan Vass

Re: 2.5 sec runtime for pi calculation

by Georg Hager -
Don't get confused. This code's performance scales linearly with the clock speed. In your example above, using "srun" without any option obviously allowed the core to run at about 3.45 GHz; your code did not know that, of course, so the cy/it calculation is wrong in your first run. Fixing the clock speed to 2.4 GHz made the code slower, of course.
However, if you don't give srun any option then the CPU will run with the default governor, which may be "powersave" (this should be the case on Fritz, but your experiment suggests otherwise - I have to look into that). This is why we told you in Assignmetn 0.4 to use the "--cpu-freq=performance" option to force the performance governor. That way, Turbo Mode is assured.
In reply to Georg Hager

Re: 2.5 sec runtime for pi calculation

by Razvan Vass -
Then it is weird, because I obtained different performances.
The number of cycles is not calculated correct, I agree, because I set printed like this, with the frequency set:
printf("Cycle / iteration: %lf\n\n", total_time * 2400000000 / SLICES);

But with the same .exe I obtained different results when I didn't set the frequency, sometimes worse. You can find another screenshot with more benchmarks.

Attachment Measurements.png
In reply to Razvan Vass

Re: 2.5 sec runtime for pi calculation

by Georg Hager -
As I mentioned, if you use srun without any further option, the powersave governor is active. The CPU is then in a low-power state when idle, and when you start a program it can take some time until the frequency ramps up. Also it is not known which frequency will be attained eventually. Hence, your observation of fluctuating performance is no surprise.

If you use the "--cpu-freq=performance" option to srun, the performance governor is activated and you should see stable clock speed (although the actual value can still vary across nodes).

Note that you forgot the "=performance" bit in your first run above, hence the error message.

In reply to Georg Hager

Re: 2.5 sec runtime for pi calculation

by Razvan Vass -
Indeed, with --cpu-freq=performance the speed is constant. I just wanted to make sure that it is normal to receive random values (better and worse) in case I don't put anything and indeed, you are right. If I run several times, the speed tends to increase, but sometimes the processor is just lazy.

Thank you.
In reply to Arjun Lenan Sandhya

Re: 2.5 sec runtime for pi calculation

by Erik Fabrizzi -
Take this as a wild guess:
Make sure that you are using int as datatype for your loop counter. Since a conversion to double is part of the computation, with a friend we observed that the instruction to cast unsigned ints to double can bump up the computation time to the same as the scalar version!
In reply to Erik Fabrizzi

Re: 2.5 sec runtime for pi calculation

by Georg Hager -

Using unsigned int instead of int (with all else being equal) didn't make a difference for me. I think that is because the loop length is hard coded so the compiler knows there will be no overflows. 

In general, you are right. This has to do with the fact that unsigned int overflow has a defined behavior in the language standard, while signed int overflow is implementation defined.

In reply to Georg Hager

Re: 2.5 sec runtime for pi calculation

by Erik Fabrizzi -

Thanks a lot for this info! it explains a lot of related "misterious" behaviour that I observed: I declare the a function that runs the benchmark in an header file and define it in it's own src file, takes a parameter for number of slices. I observed that moving the definition next to the main function (where is always ever called with an hardcoded number of slices) was getting rid of the issue, probably because the compiler could take in consideration that no overflow will happen. It is very interesting stuff, and it is good to be aware of it! Again, thanks a lot!

In reply to Erik Fabrizzi

Re: 2.5 sec runtime for pi calculation

by Arjun Lenan Sandhya -
Aha! That worked! Thank you Erik! My avrg. runtime is now 1.255 s when I changed the datatype of N to int from float. But it is very weird that something like that could affect the runtime so much (almost 2x the normal runtime!)and it is not even part of the main computationally "intensive" loop