Designing an Offline-First Sync Engine for Mobile Apps
· 7 min read
CONTEXT
Mobile apps operate in unreliable network environments. Users expect instant feedback regardless of connectivity. An offline-first sync engine treats the local database as the source of truth and syncs with the server asynchronously.
PROBLEM
Most mobile apps treat the network as a given. They show a spinner, make a request, render the response. This breaks in three common scenarios:
- Flaky connections: elevators, tunnels, rural areas, crowded venues
- High latency: emerging markets where round trips take 2 to 5 seconds
- Aggressive battery optimization: the OS kills background connections on both Android and iOS
The core problem: how do you keep the app fully functional offline while ensuring data consistency when connectivity returns?
CONSTRAINTS
- Local database must be the single source of truth for reads
- Mutations must be captured and queued for async sync
- Conflict resolution must be deterministic and predictable
- Sync must be idempotent (safe to retry any operation)
- Battery and bandwidth must be respected (no sync on every keystroke)
- The engine must recover from mid-sync crashes without data loss
DESIGN
The sync engine sits between the app's data layer and the remote API. Four responsibilities:
- Local persistence: all reads and writes hit a local database
- Change tracking: mutations captured as an append-only operation log
- Sync scheduling: background process pushes and pulls when connectivity allows
- Conflict resolution: deterministic strategy when local and remote diverge
OPERATION LOG
Every mutation gets written to an append-only log before touching the local database. Each entry contains:
- Unique operation ID
- Entity type and entity ID
- Operation type (create / update / delete)
- Logical timestamp (monotonically increasing counter, not wall clock)
- Payload (for creates and updates)
data class SyncOperation(
val id: String = UUID.randomUUID().toString(),
val entityType: String,
val entityId: String,
val type: OperationType,
val timestamp: Long,
val payload: Map<String, Any?>?,
val status: SyncStatus = SyncStatus.PENDING
)
enum class OperationType { CREATE, UPDATE, DELETE }
enum class SyncStatus { PENDING, IN_FLIGHT, SYNCED, FAILED }Logical clocks avoid issues with users changing device time or timezone drift across devices.
SYNC SCHEDULING
Batch operations. Sync when conditions are favorable:
| Trigger | Strategy |
|---|---|
| Network available | ConnectivityManager (Android) / NWPathMonitor (iOS) |
| Debounce | Wait 2 to 5 seconds after last write |
| Retry | Exponential backoff: 1s, 2s, 4s, 8s, capped at 60s |
| Periodic fallback | WorkManager / BGTaskScheduler every 15 minutes |
class SyncScheduler(
private val connectivityMonitor: ConnectivityMonitor,
private val syncEngine: SyncEngine
) {
private var debounceJob: Job? = null
fun onLocalWrite() {
debounceJob?.cancel()
debounceJob = scope.launch {
delay(3_000)
if (connectivityMonitor.isConnected()) {
syncEngine.push()
}
}
}
}CONFLICT RESOLUTION
Two devices edit the same record while both are offline. Three strategies, ordered by complexity:
- Last-Write-Wins (LWW): highest logical timestamp wins. Simple. Silently discards changes. Acceptable for user preferences, read receipts.
- Field-Level Merge: merge at field level. Device A changes name, device B changes email, both survive. Conflict only when the same field is modified on both sides.
fun mergeFields(
base: Map<String, Any?>,
local: Map<String, Any?>,
remote: Map<String, Any?>
): Map<String, Any?> {
val merged = base.toMutableMap()
for (key in (local.keys + remote.keys)) {
val localChanged = local[key] != base[key]
val remoteChanged = remote[key] != base[key]
merged[key] = when {
localChanged && !remoteChanged -> local[key]
!localChanged && remoteChanged -> remote[key]
localChanged && remoteChanged -> remote[key] // LWW fallback per field
else -> base[key]
}
}
return merged
}- Application-Level Resolution: domain-specific logic. Inventory systems sum deltas. Collaborative editors use CRDTs. Financial transactions require explicit user resolution.
HANDLING DELETES
Physical deletion creates a re-creation problem: if one device deletes a record and another hasn't synced, the un-synced device will re-create it.
Solution: tombstones. Mark records as deleted with a deletedAt timestamp. Propagate the tombstone via sync. Purge tombstones older than 30 days.
data class Entity(
val id: String,
val data: Map<String, Any?>,
val updatedAt: Long,
val deletedAt: Long? = null // null = alive, non-null = tombstone
)ORDERING GUARANTEES
Operations on the same entity must be applied in order. Operations on different entities can be applied in any order.
- Group pending operations by entity ID
- Sort each group by logical timestamp
- Send sequentially per entity, wait for acknowledgment
- Different entities can sync concurrently
TRADE-OFFS
| Gain | Cost |
|---|---|
| Works offline | Local database + operation log storage overhead |
| Instant UI feedback | Eventual consistency, UI may show stale data |
| Resilient to network failures | Conflict resolution complexity is domain-specific |
| Battery-friendly batching | Sync delay means data is not immediately available on other devices |
FAILURE MODES
| Failure | Mitigation |
|---|---|
| Network drops mid-sync | Idempotent operations with operation ID as server-side idempotency key |
| App killed by OS during sync | Transactional batches: local DB update + queue insertion in one transaction |
| Double-send of operations | Mark as IN_FLIGHT during sync, reset to PENDING on failure |
| Permanently failing operations | Dead letter queue after N retries for manual inspection |
| Clock skew between devices | Logical clocks instead of wall-clock timestamps |
| Tombstone not propagated | Periodic full-state reconciliation as fallback |
SCALING CONSIDERATIONS
- Operation log growth: compact the log periodically. Merge consecutive updates to the same entity into a single operation.
- Large backlogs: if a device comes online after extended offline, paginate sync. Do not send 10,000 operations in one batch.
- Server-side fan-out: when multiple devices sync for the same user, the server must handle concurrent writes with proper locking or CAS (compare-and-swap).
- Selective sync: not all entities need to be synced. Allow per-entity-type opt-in to reduce bandwidth and storage.
OBSERVABILITY
Track these metrics to understand sync health in production:
- Sync latency: time between local mutation and server acknowledgment.
- Queue depth: number of pending operations per device (alerts if consistently growing).
- Conflict rate: percentage of sync operations that trigger conflict resolution.
- Failure rate: percentage of operations that enter the dead letter queue.
- Tombstone accumulation: count of active tombstones (indicates deletion patterns).
KEY TAKEAWAYS
- Local database is the source of truth. The server is a peer that eventually catches up.
- Use logical clocks, not wall clocks.
- Conflict resolution strategy depends on the domain. Start with LWW, graduate to field-level merge when needed.
- Tombstones solve the delete propagation problem.
- Idempotency is non-negotiable. Every operation must be safe to retry.
FINAL THOUGHTS
The best sync engines are invisible. The user edits data, puts the phone in a pocket, and everything converges. Building that experience requires careful thinking about operation logs, conflict resolution, ordering guarantees, and failure recovery.
Start with the minimum viable sync: local persistence, an operation queue, last-write-wins. Layer in field-level merging, smarter scheduling, and observability as usage patterns emerge. The architecture should grow with the product, not ahead of it.