SysAdmin1138 is a user on octodon.social. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
SysAdmin1138 @sysadmin1138

There is a thing you see a lot in incident management, and customer response to it. The demand for a Root Cause Analysis and identification of a root-cause.

This is misleading.

Most systems are distributed systems these days. They're complex. Which makes for complex failures. Complex failures are interlocking failures.

Failure-Mode Analysis is a better term than RCA.

The Interstate 35W bridge fell into the Mississippi in Minneapolis several years ago. The failures there were many.
* A bad design (known)
* Lots of salt used on MN roads (known)
* Inspections missed a few things (unknown)
* Incorrect understanding about how cracks move in that particular structure (unknown)
All of it failed.