Why I Always Start With the Failure Case

When I start building something new, I don't think about the happy path first. I think about what happens when things go wrong.

What if the network is unavailable? What if the API returns an error? What if the user taps the button twice? What if the device is low on memory? What if the data is corrupted?

This isn't pessimism. It's the single most valuable habit I've developed as an engineer. The happy path usually takes care of itself. The failure cases are where systems break, users get frustrated, and teams lose sleep at 2 AM.

The Problem With Happy-Path-First

When you design for the happy path first, failure handling becomes an afterthought. You bolt on error handling at the end, usually in a rush, usually incomplete.

// Happy-path-first thinking:
fun loadUserProfile(userId: String): UserProfile {
    val response = api.getProfile(userId)
    return response.toUserProfile()
}
 
// "Oh right, I should handle errors..."
fun loadUserProfile(userId: String): UserProfile? {
    return try {
        val response = api.getProfile(userId)
        response.toUserProfile()
    } catch (e: Exception) {
        null // Now every caller has to deal with null
    }
}

The problem compounds. Every downstream function now receives a nullable type. The UI layer has to guess what "null" means. Was it a network error? A 404? A malformed response? The information is lost because error handling was an afterthought.

Failure-First Design

When I start with failure cases, the code structure naturally accommodates both success and failure from the beginning.

sealed class ProfileResult {
    data class Success(val profile: UserProfile) : ProfileResult()
    data class NotFound(val userId: String) : ProfileResult()
    data class NetworkError(val cause: Throwable) : ProfileResult()
    data class ServerError(val code: Int, val message: String) : ProfileResult()
}
 
fun loadUserProfile(userId: String): ProfileResult {
    val response = try {
        api.getProfile(userId)
    } catch (e: IOException) {
        return ProfileResult.NetworkError(e)
    }
 
    return when (response.code) {
        200 -> ProfileResult.Success(response.body.toUserProfile())
        404 -> ProfileResult.NotFound(userId)
        else -> ProfileResult.ServerError(response.code, response.message)
    }
}

Now every caller knows exactly what can go wrong and can handle each case appropriately. The UI can show "no internet" for network errors, "user not found" for 404s, and "something went wrong" for server errors. No guessing.

My Failure-Case Checklist

For every feature I build, I run through these categories:

Network failures

Connection timeout
Request timeout
No internet connectivity
Partial response (connection drops mid-transfer)
SSL/certificate errors

Data failures

Missing required fields in API response
Unexpected data types (string where you expect int)
Empty collections where you expect at least one item
Stale cache data that conflicts with fresh data

User behavior failures

Double-tap on submit buttons
Navigating away mid-operation
Rotating the device during a long-running task
Backgrounding the app during a critical flow
Rapidly switching between screens

Device failures

Low memory (OS may kill your process)
Low storage (can't write to disk)
Slow CPU (operations take longer than expected)
Screen sizes you didn't test on

Timing failures

Race conditions between concurrent operations
Operations completing after the UI is destroyed
Stale callbacks from previous screen instances

A Real Example

I was building a feature that let users save content to a collection. The happy path was simple: tap save, call API, show confirmation.

Before writing any code, I listed the failure cases:

User taps save while offline
User taps save twice quickly
API call succeeds but the UI has already been dismissed
Collection is full (server-side limit)
Content was deleted by the time the save request reaches the server
User's auth token expired between tapping save and the request firing

Each of these required a different response. For #1, I queue the save and retry when connectivity returns. For #2, I debounce the tap and disable the button during the request. For #3, I use a ViewModel that survives the UI lifecycle. For #4 and #5, I show specific error messages. For #6, I silently refresh the token and retry.

If I had started with the happy path, most of these would have been bugs discovered in production.

The Counterargument

People push back on this approach. "You're over-engineering. YAGNI. Ship it and fix bugs later."

There's some truth to that. Not every failure case needs handling upfront. But there's a difference between "I know about this edge case and I'm choosing to defer it" and "I never considered this edge case and it bit me in production."

Failure-first thinking isn't about handling every possible error. It's about being aware of every possible error. You still make trade-offs. You still ship incrementally. But you make those trade-offs consciously, with a list of known risks.

What This Looks Like in Practice

I keep a simple format in my design notes:

Feature: Save to Collection

Failure cases:
[HANDLE] Offline - queue and retry
[HANDLE] Double-tap - debounce + disable
[HANDLE] UI dismissed - ViewModel survives
[HANDLE] Collection full - show error with suggestion
[DEFER]  Content deleted - show generic error, revisit if frequent
[DEFER]  Token expired - rely on existing auth refresh middleware

The [HANDLE] items get built into the implementation. The [DEFER] items get documented so future engineers (including future me) know they exist.

This takes 10 minutes and prevents hours of debugging. More importantly, it builds a habit of defensive thinking that compounds over time. The engineer who thinks about failure first doesn't produce perfect code. They produce resilient code, which is much more valuable.