Observability | paugarcia32

When a system breaks at 2am, you have two options: guess at the cause or know it. Observability is what makes the second option possible.

What Is Observability?

Observability is the ability to understand what is happening inside your system by examining its outputs. A doctor orders blood tests and X-rays to understand what is going on inside a patient’s body. Engineers use observability tools to do the same with software.

Without it, your system is a black box. You see inputs and outputs, but nothing in between. That works until something breaks, and then you have no idea where to start.

Why It Matters

Good observability gives you three concrete advantages:

You identify and fix bugs faster because you can trace the exact sequence of events that led to the failure.
You find performance bottlenecks before they become user-facing problems.
You understand the system as a whole, not just the part you wrote.

That last point matters more than it sounds. A developer who can read a production system is more useful than one who can only read their own code.

The Three Pillars

Logs

Logs are the first thing you should add to any system. They are a continuous stream of text entries that capture events as they happen, in the order they happened: a flight recorder for your software.

Every log entry should include:

Field	Description
Level	`INFO`, `WARN`, `ERROR`, `CRITICAL` — indicates severity
Timestamp	Exact datetime of when the event occurred
Message	Description of what happened, including relevant IDs and context

One hard rule: never log secrets. API keys, tokens, passwords, cookies, and card numbers must never appear in logs. Anyone with log access can exploit them.

Metrics

Metrics give you quantitative data about your system’s behavior over time. Three are worth tracking from the start.

Latency measures how long a part of your system takes to respond. Track it at these percentiles:

Percentile	What it tells you
P50	50% of requests respond within X ms (your typical case)
P95	95% of requests respond within X ms
P99	99% of requests respond within X ms

The remaining 1% are edge cases and are excluded to avoid skewing the data.

Error rate tracks the rate of 5xx HTTP responses (500, 502, 503, etc.). It answers concrete questions: are errors spiking at a specific time of day? Is one service failing more than others? Are certain users consistently triggering failures?

Throughput measures requests per second or per minute. It answers capacity questions: should you scale up and add replicas? Are you handling far fewer requests than your infrastructure was designed for?

Traces

Traces follow the full journey of a single request across multiple services. Each step in the journey is a span, capturing the server, IP, latency, and errors for that step. A trace is the complete sequence of spans for one execution, stitched together by a shared Trace ID. Visualization tools render traces as a connected graph of boxes, each with its own timing and log data.

Traces are most useful in microservice architectures or flows that involve several external services. For a small application or a monolith with simple operations, logs and metrics are enough. Adding traces before you need them adds instrumentation complexity with little practical return.

When to Add Each Pillar

Pillar	What it answers	When to add it
Logs	What happened?	Always, start here
Metrics	How is the system performing?	Early on
Traces	Where exactly did it fail or slow down?	Once you have logs and metrics

Good observability turns debugging from guesswork into diagnosis. You do not need all three pillars on day one, but you do need a plan for getting there.