Error logs matter. More than any other log level, they’re what you rely on to proactively catch issues before your users complain and to reactively debug incidents when things go wrong. They deserve to be taken seriously — which is why it annoys me that we don’t have a good, standard consensus on what deserves an ERROR log.
Time and again, I come across codebases that over-log errors. This is especially true in older codebases where ownership has changed hands and dozens of developers have contributed over the lifetime of the service. The problem isn’t that any one team was careless — it’s that there was never a shared mental model for what an ERROR actually means and sticking to it.
Here’s one that works for me and has been discussed quite a lot: an ERROR log means the operation failed unexpectedly, and a human needs to look at it.
Two conditions, both required:
- The operation failed unexpectedly — if it completed, even through a fallback or retry, it’s not an error.
- A human needs to look at it — if nobody needs to investigate or fix something, it’s not an error either.
Severity should encode actionability, not merely “something went wrong.”
These are error logs:
- An external API changed its schema and broke your integration
- A database connection failed because credentials expired
- An unhandled exception crashed a request
These are not:
- A timeout that succeeded on retry
- A cache miss that fell back to the source database
- A user submitting malformed JSON over your API
- A failed password attempt
The second list is where most noise comes from. Each of those is a local failure — something didn’t work on the first try, or an input was bad. But the system handled it. The request finished. The user got a response. Nothing is broken. Logging these as ERROR means your alerts fire when the system is working as designed.
The deeper issue is one of perspective. There’s a difference between an operation failing and the program failing. A retry timeout is an operational hiccup. Five consecutive retry failures with no recovery — that’s a program-level error. This perspective also explains why library code should almost never log errors directly. A library doesn’t know whether its caller will retry, fall back, or give up. What’s a fatal failure in one context is a routine branch in another. Pass the status up. Let the code with the full picture decide the severity.