Change management matters - even in a home lab
Over the last few days, I was reminded firsthand why change management exists, and why it matters far beyond large enterprise environments.
What started as a series of well-intentioned improvements to my local network quickly turned into a complex troubleshooting exercise - not because the technologies were flawed, but because I did not manage the changes deliberately enough.
The changes
I made two significant changes to my environment at roughly the same time:
- Replaced my edge device
- Introduced Pi-hole with Unbound to improve DNS visibility, control, and privacy
Individually, both changes were reasonable. Together, they touched far more of my environment than I initially accounted for.
The problems that followed
Almost immediately, issues began to surface across multiple layers of the network:
- Pi-hole and Unbound were not resolving consistently
- DNS queries intermittently timed out
- Docker containers inherited DNS settings that had not been fully validated
- Services appeared “up” but were silently failing
- Traefik returned 404 responses even though it was healthy
- A Cloudflare Tunnel remained connected while my website was effectively offline
At first, these issues looked unrelated. In reality, they were all symptoms of a weak point in my network that had gone unidentified.
The hidden weak point
The DNS instability was not caused by Pi-hole or Unbound themselves. It was caused by an existing fragility in my network path that became exposed once DNS was centralized and relied upon more heavily.
That weak point should have been identified during a proper change management process.
A basic pre-change review would have raised important questions:
- Is the network path stable enough to become a DNS dependency?
- Are there single points of failure being introduced?
- How will containers and services behave if DNS resolution degrades?
- What breaks first if this component becomes unreliable?
Because those questions were not asked early, DNS issues surfaced only after the change was live - and they cascaded outward into application behavior.
The actual root cause
After working through the environment layer by layer, the website outage ultimately came down to a simple issue:
The container hosting my website had exited.
Traefik was functioning correctly, and Cloudflare Tunnel remained connected, but without the backend service running, there was nothing for Traefik to route traffic to. The result was a clean but misleading 404 response.
A simple post-change validation checklist would have caught this almost immediately.
The real lesson
The biggest takeaway was not about Pi-hole, Unbound, Docker, or Traefik.
It was about process.
Change management is not bureaucracy - it is risk control.
Even a lightweight process dramatically improves outcomes:
- Clearly define the scope of each change
- Identify existing weaknesses before introducing new dependencies
- Understand dependencies and blast radius
- Make one change at a time
- Validate each layer before proceeding
- Always have a rollback plan
In production environments, this protects customers and uptime. In a home lab, it protects your time, focus, and sanity.
Why this still matters
I intentionally build and operate a home lab because it allows me to learn in ways documentation never can. Real systems fail in non-obvious ways, and the only way to build strong mental models is to experience those failures and work through them.
This experience reinforced that technical skill and operational discipline go hand in hand. The more complex the environment becomes, the more important process becomes - regardless of scale.
The environment is now more resilient, the weak point has been addressed, and the change process itself has improved.
Lessons learned. Process improved. Onward.