What exactly makes Mean Time to Recovery the default metric for measuring incident response efficiency when it completely ignores the vast variance in outage severity? Furthermore, a single prolonged system failure skewing your monthly average can overshadow dozens of rapid, automated self-healing resolutions. How do you balance raw recovery speed against deep root-cause elimination?