PTfS26: Question on Sheet 1, Ex 2 | NHR Learning Platform

Hello,

I'm currently working on ex. 2) of this weeks assignment sheet. I have some questions about the microarchitectural assumptions that I'm not sure the assignments statement gives a hint to.

As far as I understood, any iteration is composed of the following procedure:

Under the assumption that the 16 registers suffice, some updated values a[i] which will become a[i-2] if the counter is incremented, will continue to reside in registers after am iteration has been completed. Hence, we do not need to load a[i-2] from the L1 cache again.
The value a[i-2] is being multiplied with s for a total of 8 cycles and its result is written back into a register. At the same time, we can load a[i] and write the updated value a[i-2] from the previous iteration back to L1/MM because the latter take only one cycle.
Next, we add our result "s*a[i-2]" to a[i] over the course of 6 cycles. and write it back to some register.
Finally, this updated value of a[i] needs to be written back into L1/MM. However, we can already start new operations on this value immediately since its value resides in a register.

This model now relies on two assumptions for which I'm not sure whether they are sensible or not:

Assuming the 16 registers suffice, may we assume that "intermediate" values a[i-2] do not need to be fetched from L1 if they still reside in a register from a previously calculated iteration?
Can multiple instructions, i.e. MULT, STORE, LOAD, ADD, access a registers value simultaneously. This is because I assume that a new multiplication with a[i] can be started while it is simultaneously being written back to L1/MM.

I'd be grateful if anyone could clarify these points for me

Best Regards

Max Jordan

Re: Question on Sheet 1, Ex 2

by Jan Laukemann - Wednesday, 6 May 2026, 2:56 PM

Hi Max,

you are absolutely right, a smart CPU would keep a[i] in a register for two more iterations to, then, reuse it as a[i-2] at that moment. ~~However, assuming each Load and Store takes 1 cycle, performance wouldn't change either if the CPU reloads a[i-2] each iteration from cache in this case~~ (the last sentence is more complicated and I guess more confusing than helpful, so let's forget about it, sorry)
As soon as any arithmetic or load instruction finishes, the result will be in a register and can be used by any other instruction that requires an input. I.e., two ADDs on the same variable would take exactly 2x6 cycles. Any instruction, that is independent from any other instruction that is in-flight (i.e., currently executed), can be executed in parallel, given the first stage of the required pipeline is free (meaning, of course I can't start two independent ADD instructions at the same cycle due to the given hardware limitation)

Did this clarify the task for you?

Best,
Jan