We have been running the pi value estimation code since assignment0 and we have always received a wall clock runtime of ~2.5 secs. However, it was mentioned in the tutorial class that the actual runtime should be nearly half of that for double precision with frequency locked at 2.4GHz. We are using intel icx compiler and running the job using a job script.
Any help is appreciated. Job script screenshot is attached
What is the measured number of cycles per iteration? Did you perchance set N=2e9?
It would be helpful if you had posted your source code. And not as a screenshot, since it's awfully hard to copy code from a screenshot.
##########################
include <stdio.h>
#include <math.h>
#include "timing.h"
double getTimeStamp();
int main(){
float slices;
double sum;
float delta_x;
double wcTime,wcTimeStart,wcTimeEnd;
slices = 1000000000.0;
sum = 0.0;
delta_x = 1.0f/slices;
wcTimeStart = getTimeStamp();
for (int i=0; i<slices; i++){
double x = (i+0.5f)*delta_x;
sum = sum + 4*sqrt(1.0f-x*x);
}
wcTimeEnd = getTimeStamp();
wcTime = wcTimeEnd - wcTimeStart;
double Pi = sum * delta_x;
printf("Pi is %.18f\n",Pi);
printf("Walltime: %.3lf s\n",wcTime);
return 0;
}
I think the problem is that you have the variable slices declared as a float.
Why? You only use in the loop to compare it to an int. By having it declared as float, in each iteration a type cast has to take place. This costs a lot of CPU time.
Try running the code with "int slices= 1e9". It should increase the performance to the desired value.
Not setting the frequency results in random values within the processor's capable range, making performance comparisons harder due to varying results.
The 2.4GHz value isn't arbitrary; it represents the minimum assured speed achievable by the processor architecture at any given time.
If someone knows that I said something wrong please correct me
However, if you don't give srun any option then the CPU will run with the default governor, which may be "powersave" (this should be the case on Fritz, but your experiment suggests otherwise - I have to look into that). This is why we told you in Assignmetn 0.4 to use the "--cpu-freq=performance" option to force the performance governor. That way, Turbo Mode is assured.
If you use the "--cpu-freq=performance" option to srun, the performance governor is activated and you should see stable clock speed (although the actual value can still vary across nodes).
Note that you forgot the "=performance" bit in your first run above, hence the error message.
Thank you.
Make sure that you are using int as datatype for your loop counter. Since a conversion to double is part of the computation, with a friend we observed that the instruction to cast unsigned ints to double can bump up the computation time to the same as the scalar version!
Using unsigned int instead of int (with all else being equal) didn't make a difference for me. I think that is because the loop length is hard coded so the compiler knows there will be no overflows.
In general, you are right. This has to do with the fact that unsigned int overflow has a defined behavior in the language standard, while signed int overflow is implementation defined.
Thanks a lot for this info! it explains a lot of related "misterious" behaviour that I observed: I declare the a function that runs the benchmark in an header file and define it in it's own src file, takes a parameter for number of slices. I observed that moving the definition next to the main function (where is always ever called with an hardcoded number of slices) was getting rid of the issue, probably because the compiler could take in consideration that no overflow will happen. It is very interesting stuff, and it is good to be aware of it! Again, thanks a lot!