The LOAD and STORE instructions are outside of the dependency chain and, therefore, can overlap with the MUL-ADD-dependency chain.
For the LOAD, think about when data can be loaded the earliest. We can load a[0]
in the first cycle, a[1]
in the second, a[2]
in the third, etc... While we would have to wait a cycle in the warmup phase in the first iteration until we can continue with the MUL, for every following iteration, the data we need is already loaded (in case of a[i]
) or already in a register as we computed its value before (in case of a[i-2]
).
For the STORE, while we do have to execute it, we already have the value of a[i]
in a register and keep it there for when we need it again two iterations later, so it can also fully overlap to the limiting MUL-ADD chain.