Prometheus is a free software application used for event monitoring and alerting.[2] It records metrics in a time series database (allowing for high dimensionality) built using an HTTP pull model, with flexible queries and real-time alerting.[3][4]
The project is written in Go and licensed under the Apache 2 License, with source code available on GitHub.[5]
There are two sides to Prometheus in terms of the application itself (we will discuss the ecosystem later):
Data Collection Side (Exporters / Scraping)
- Prometheus pulls metrics from targets by scraping HTTP endpoints.
- Applications expose metrics directly (via client libraries) or indirectly through exporters (adapters that translate metrics into Prometheus’ format, like Node Exporter for system metrics, Blackbox Exporter for probes, etc.).
- This side is all about gathering raw metrics.
Data Storage & Querying Side (Time-Series Database)
- Once collected, metrics are stored in Prometheus’ internal time-series database (TSDB).
- Prometheus automatically indexes by labels (key-value pairs), which makes filtering and aggregation flexible.
- The stored data can be queried using PromQL, visualized (often in Grafana), or used to trigger alerts.
- This side is all about storing, querying, and analyzing metrics.
Each side will have issues and solutions but for the Observability SRE, the main focus of the work will be the data storage and querying side, with the following challenges:
1. Data Retention & Storage Limits
- Prometheus stores data on local disk (by default), which doesn’t scale infinitely.
- High-cardinality metrics (too many label combinations) can blow up storage quickly.
- Retention is usually limited (15 days default) unless remote storage is configured.
- Managing long-term storage is often a headache.
2. High Cardinality & Label Explosion
- Too many unique label/value pairs (e.g.,
user_id
,session_id
) create millions of time series. - This increases memory usage, slows queries, and can even crash Prometheus.
- Detecting and preventing label explosion is a constant concern.
3. Query Performance (PromQL Load)
- Complex PromQL queries (e.g., aggregations across millions of series) can consume huge CPU and memory.
- Heavy queries from dashboards (Grafana, ad-hoc queries) can degrade performance.
- Requires tuning and sometimes limiting queries.
4. Scaling Limitations
- A single Prometheus server has limits (storage, CPU, memory).
- Federating multiple Prometheus instances is possible, but adds complexity.
- For very large environments, SREs often integrate with remote storage backends (e.g., Thanos, Cortex, Mimir, VictoriaMetrics).
5. Reliability & Durability
- Prometheus’ local TSDB can be fragile if the node crashes or disk fills up.
- Data loss risk is higher without replication.
- Need for backups or HA setups (e.g., running multiple Prometheus instances scraping the same targets).
6. Maintenance & Operations
- Disk I/O tuning (SSD vs HDD) affects query and compaction performance.
- WAL (Write-Ahead Log) and block compaction issues can cause storage bloat.
- Requires monitoring Prometheus itself (meta-monitoring).
Thanos / Cortex / Mimir Storage
Thanos, Cortex, and Mimir are the most common remote storage & global query solutions for Prometheus. Each solves the same basic problems (scaling, HA, long-term storage), but with different trade-offs. Here’s a breakdown:
Thanos
Pros
- Simple to add on top of existing Prometheus — just sidecar + object storage.
- Retains all Prometheus’ local features (no need to replace Prometheus itself).
- Global query layer across multiple Prometheus instances.
- Scales storage easily via S3, GCS, or any object store.
- Open source, widely adopted, and CNCF project.
Cons
- Query performance can degrade if object storage latency is high.
- Operationally complex: multiple components (sidecar, querier, store gateway, compactor, ruler).
- Scaling query layer requires tuning.
- Still dependent on local Prometheus for scraping.
Cortex
Pros
- Horizontal scalability: designed as a microservices architecture from the ground up.
- Supports multi-tenancy (strong isolation between tenants).
- Uses chunk storage in object stores or DynamoDB/Bigtable/Cassandra for durability.
- HA and long-term storage are built-in (no need for separate Prometheus TSDB retention tuning).
- Good for very large enterprise / multi-team setups.
Cons
- Complex to operate (lots of services: distributors, ingesters, queriers, rulers, etc.).
- Higher operational overhead compared to Thanos.
- Query performance depends heavily on backend storage and caching.
- Migration from plain Prometheus setup can be non-trivial.
Mimir (Grafana Labs version of Cortex)
Pros
- Built on Cortex, but with simplified operations.
- Better out-of-the-box defaults, reduced component complexity.
- Strong multi-tenancy and horizontal scalability.
- Integrates tightly with Grafana ecosystem.
- Advanced query optimization and caching.
- Actively maintained and production-proven at scale (Grafana Cloud).
Cons
- Still relatively new compared to Thanos → smaller community.
- Operates like Cortex (many moving parts, though simplified).
- Requires Kubernetes or strong orchestration for reliable deployment.
- May “lock you in” closer to Grafana ecosystem compared to Thanos’ neutrality.