Observability

In software engineering, more specifically in distributed computingobservability is the ability to collect data about programs’ execution, modules’ internal states, and the communication among components.[1][2] 

To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it.

Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.

Philosophy and Methodology

Three Pillars of Observability

The “three pillars of observability” are the main categories of telemetry data we collect to understand and troubleshoot complex distributed systems. They are:

1. Logs
  • What they are:
    Time-stamped, discrete records of events. Usually unstructured (text) or semi-structured (JSON).
  • Examples:
    • 2025-08-16T12:30:01Z INFO User 123 logged in
    • Stack traces when a service crashes
    • Audit events (API calls, config changes)
  • Strengths:
    • Detailed context (e.g., error messages, user IDs).
    • Easy to search for specific events.
  • Weaknesses:
    • Can get extremely large and noisy.
    • Hard to analyze across thousands of services without indexing & aggregation.
2. Metrics
  • What they are:
    Numerical measurements aggregated over intervals. Usually structured as time-series data.
  • Examples:
    • CPU utilization = 75%
    • Latency p95 = 120ms
    • Request rate = 10k req/sec
  • Strengths:
    • Compact and efficient to store long-term.
    • Great for dashboards, trend analysis, and alerts (e.g., “alert if CPU > 90% for 5 min”).
  • Weaknesses:
    • Limited detail/context — you know “latency is 200ms,” but not why.
    • Granularity trade-offs (can miss rare anomalies if aggregated too much).
3. Traces
  • What they are:
    Records of requests flowing through a distributed system, capturing causal relationships between services.
  • Examples:
    • A single user request → travels through API Gateway → Service A → Service B → Database.
    • Each step has timing + metadata, stitched into a “trace.”
  • Strengths:
    • Lets you see end-to-end request flow.
    • Perfect for debugging latency or dependency failures.
  • Weaknesses:
    • High storage/processing overhead.
    • Hard to implement consistently (needs instrumentation).
How they work together
  • Logs = What happened? (narrative details)
  • Metrics = How much / how often? (quantitative health signals)
  • Traces = Where and why did it happen? (causal relationships, context)

Together, they give observability: the ability to infer the internal state of your system from the outside, especially in cloud-native, microservices, and distributed environments.

Technology and Tools

  • Prometheus
  • Kafka / Superset
  • Grafana Stack
  • Elastic Stack
  • TICK stack
  • Nagios
  • Zabbix
  • Icinga2
  • Ansible/Chef/Puppet
  • Git/GitLab/Bitbucket
  • Go/Python/Ruby
  • Terraform/Terragrunt
  • Docker/K8s

Companies