Load latency tolerance in dynamically scheduled processors
This paper provides a quantitative evaluation of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory hierarchy that dictates the latency. Although our policies delay load completion as long as possible, they produce performance (instructions committed per cycle (IPC)) comparable to a processor with an ideal memory system where all loads complete in one cycle. Our simulations reveal that to produce IPC values within 12% of a processor with an ideal memory system, between 1% and 71% of loads need to be satisfied within a single cycle and that up to 74% can be satisfied in as many as 32 cycles, depending on the benchmark and processor configuration. Load latency tolerance is largely determined by whether a mispredicted branch is in the load's data dependence graph and the depth of the dependence graph. Our results show that up to 36% of all loads miss in the level one cache yet have latency demands lower than second level cache access times. We also show that a similar percentage of loads hit in the level one cache even though they possess enough latency tolerance to be satisfied by lower levels of the memory hierarchy.