Kubernetes·Mar 4, 2025·10 min read

IntelliTune's top 10 K8s remediation patterns — and how they work

Applicare Engineering Team Mar 4, 2025 10 min read

400,000 Kubernetes incidents. 10 patterns.

IntelliTune has auto-resolved over 400,000 Kubernetes incidents across our customer base. When we analysed the full dataset, we found that 10 patterns account for 78% of all auto-resolutions. Here they are — what triggers them, how they're detected, and what IntelliTune does to fix them.

400k+

K8s incidents resolved

78%

Covered by top 10 patterns

400ms

Median remediation time

Pattern 1: OOMKill recovery

Trigger: Pod OOMKilled, restarts > 2 in 10 minutes
Detection: IntelliSense flags the restart loop and correlates with memory utilisation trend
Action: Increase memory limit by 40% within policy gates, create Jira ticket with heap dump analysis
Success rate: 94%

Pattern 2: Node pressure pod migration

Trigger: Node CPU ready > 8% or node memory pressure flagged
Detection: Correlate node resource contention with pod performance degradation on that node
Action: Trigger live migration to a less-loaded node via DRS recommendation or K8s rescheduling
Success rate: 97%

Pattern 3: Connection pool exhaustion

Trigger: DB connection pool > 90% for > 2 minutes
Detection: ArcIn identifies the service causing the exhaustion and traces it to a missing pool config
Action: Scale pool size within configured bounds, alert dev team with root cause and recommended config change
Success rate: 89%

Connection pool exhaustion is the single most common root cause of "mysterious" latency spikes we see across customer environments. It almost always traces back to a deploy that changed connection handling without updating the pool configuration.

Pattern 4: CrashLoopBackOff — config error

Trigger: Pod in CrashLoopBackOff, exit code 1 or 137
Detection: Parse container logs for known config error signatures; correlate with recent ConfigMap changes
Action: If config error detected and previous ConfigMap version exists: rollback ConfigMap, restart pod
Success rate: 82%

Pattern 5: Horizontal scaling lag

Trigger: HPA at max replicas, request queue building, p99 latency increasing
Detection: IntelliSense predicts queue exhaustion 3-5 minutes before user impact
Action: Temporarily increase HPA max within policy gates; notify platform team
Success rate: 91%

Patterns 6–10 (summary)

Liveness probe timeout — adjust probe timing based on actual startup time data (88% success)
PVC near capacity — trigger storage expansion request and alert dev team (95% success)
Certificate expiry — auto-rotate certs 14 days before expiry, zero-downtime renewal (99% success)
Deployment rollout regression — detect p99 degradation mid-rollout, pause rollout, alert (93% success)
DNS resolution failures — restart CoreDNS pods, adjust TTL caching based on observed failure pattern (87% success)

How policy gates work

Every IntelliTune action runs through policy gates before executing. Gates define: which patterns are allowed to run automatically, which require human approval, which are blocked entirely, and what rollback looks like if the remediation makes things worse.

The default gate configuration is conservative — most actions require approval for the first 30 days, then graduate to automatic based on success rate in your environment. You can override this at any time.

← Back to blog Try Applicare free →