DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Agentic Ops: How I Shipped My Vibe-Coded Game to Production

Agentic Ops: How I Shipped My Vibe-Coded Game to Production

Comments
2 min read
Observability Telemetry and Predictive AIOps

Observability Telemetry and Predictive AIOps

Comments
8 min read
Building ReefWatch, a Coral-Powered Production Triage Agent

Building ReefWatch, a Coral-Powered Production Triage Agent

Comments
18 min read
Runbook-Driven Development: A New Way to Ship

Runbook-Driven Development: A New Way to Ship

Comments
2 min read
The 10 Commandments of Working in Production

The 10 Commandments of Working in Production

Comments
7 min read
Why I treat API timeouts as "unknown", not failures

Why I treat API timeouts as "unknown", not failures

Comments
1 min read
The Prometheus label that blew our monitoring bill out 6x

The Prometheus label that blew our monitoring bill out 6x

1
Comments
4 min read
How to Optimize MongoDB on Bare Metal Servers: SRE Playbook

How to Optimize MongoDB on Bare Metal Servers: SRE Playbook

Comments
5 min read
API Rate Limiting: Patterns That Scale

API Rate Limiting: Patterns That Scale

Comments
2 min read
Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Kiln Crisis Management: Controlling Irregular Raw Meal in CCR Using Python

Comments
3 min read
Rename a Kubernetes PVC Without Losing Your Data: PersistentVolume Rebinding

Rename a Kubernetes PVC Without Losing Your Data: PersistentVolume Rebinding

Comments
4 min read
Performance Tuning: The Day the Server Got “Tired” and Started Acting Funny

Performance Tuning: The Day the Server Got “Tired” and Started Acting Funny

Comments
3 min read
AI as an SRE Intern: What I Let Agents Touch During Incident Response

AI as an SRE Intern: What I Let Agents Touch During Incident Response

Comments
2 min read
AI Agents Mapped My Legacy Production Environment in One Hour.

AI Agents Mapped My Legacy Production Environment in One Hour.

2
Comments
4 min read
Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

Remetric: find waste in self-hosted Prometheus, Grafana, and Loki

Comments
6 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.