Why Observability Is the Most Underrated Engineering Skill
· 4 min read

If I could teach one skill to every engineer, it wouldn't be system design or algorithms. It would be observability. The ability to understand what a system is doing in production, right now, from the outside.
Most engineers treat observability as an afterthought. "We'll add logging later." "We'll set up dashboards when we have time." Then something breaks in production, and they're flying blind.
What Observability Actually Is
Observability is the ability to answer questions about your system's behavior without deploying new code. If you have to add a log statement, build, deploy, and wait for the issue to happen again, your system isn't observable.
The three pillars:
Logs: What happened, in sequence. Good for understanding specific events and error paths.
Metrics: What's happening right now, in aggregate. Good for detecting trends, anomalies, and the overall health of the system.
Traces: How a single request flowed through the system. Good for understanding latency, dependencies, and failure propagation.
Why Most Logging Is Bad
// Bad: tells you nothing useful
logger.log("Error occurred")
// Bad: too verbose, drowns out signal
logger.log("Entering function loadFeed")
logger.log("Fetched 47 items from API")
logger.log("Filtering items")
logger.log("Filtered to 42 items")
logger.log("Mapping items to UI models")
logger.log("Returning 42 UI models")
// Good: structured, contextual, actionable
logger.warn("Feed load failed", mapOf(
"userId" to userId,
"errorType" to error.javaClass.simpleName,
"errorMessage" to error.message,
"retryCount" to retryCount,
"networkType" to networkInfo.type
))Good logs answer: What happened? To whom? Under what conditions? Can we reproduce it?
The Metrics That Matter
I think about metrics in layers:
Business metrics: Revenue, DAU, retention. These tell you if the product is healthy.
Product metrics: Feature adoption, funnel conversion, engagement depth. These tell you if features are working.
System metrics: Latency, error rate, throughput, resource utilization. These tell you if the infrastructure is healthy.
Code-level metrics: Cache hit rates, function execution time, queue depth. These tell you if specific components are performing well.
Each layer depends on the ones below it. A drop in business metrics usually traces back through product metrics to system metrics to a specific code-level issue.
Building Observable Systems
The best time to add observability is when you're writing the code. Not after. The same way you write tests alongside features, you should write instrumentation alongside features.
class FeedRepository(
private val api: FeedApi,
private val cache: FeedCache,
private val metrics: MetricsReporter
) {
suspend fun getFeed(userId: String): FeedResult {
val startTime = System.currentTimeMillis()
// Try cache first
val cached = cache.get(userId)
if (cached != null) {
metrics.increment("feed.cache.hit")
metrics.timing("feed.load.cached", System.currentTimeMillis() - startTime)
return FeedResult.Success(cached)
}
metrics.increment("feed.cache.miss")
// Fetch from API
return try {
val feed = api.fetchFeed(userId)
cache.put(userId, feed)
metrics.increment("feed.api.success")
metrics.timing("feed.load.api", System.currentTimeMillis() - startTime)
FeedResult.Success(feed)
} catch (e: IOException) {
metrics.increment("feed.api.failure", tags = mapOf("error" to e.javaClass.simpleName))
FeedResult.Error(e)
}
}
}This code is barely longer than the version without metrics. But it lets me answer questions like:
- What's the cache hit rate?
- How fast is the cached path vs the API path?
- What percentage of API calls are failing?
- What types of failures are most common?
Without these metrics, the answer to "why is the feed slow?" requires adding instrumentation, deploying, and waiting. With them, I can answer it in minutes from a dashboard.
Alerts Done Right
Bad alerts: "CPU is above 80%." Is that bad? What should I do about it?
Good alerts: "Feed load p95 latency has exceeded 3 seconds for 10 consecutive minutes, affecting approximately 15% of users in the US-West region."
Good alerts are:
- Actionable. The alert tells you what's wrong and implies what to investigate.
- Not noisy. If an alert fires and doesn't require action, it's training you to ignore alerts. That's dangerous.
- Tied to user impact. CPU usage doesn't matter. User-facing latency does. Alert on what users experience.
The Mindset Shift
Observability isn't a tool or a framework. It's a way of thinking. When I write code, I ask: "If this breaks at 3 AM, will the on-call engineer have enough information to diagnose and fix it without waking me up?"
If the answer is no, I add more context. More structured logging. More metrics. Better error messages. The 30 seconds it takes to write a good log line saves hours of debugging later.
The most productive engineers I know aren't the ones who write the most code. They're the ones who can diagnose production issues the fastest. And that speed comes from building observable systems from day one.