If you look at the lecture you will see that we generally assume that SIMD variants of instructions have the same throughput and latency as their scalar counterparts. There are exceptions, as you know from the SQRT problem, but if this is the case it will be clearly stated.
Horizontal add. Most of this was shown in the lecture 04/30: We assume here that it is one instruction that takes some extra amount of latency. In case the compiler performs unrolling on top of SIMD, we need it multiple times, plus a final scalar ADD.