Web caching is hard

2022-08-14 3-minute read

Web caching is hard. And also, maybe I’m not that good under pressure? In any event, I made the following mistakes while trying to debug a web site using our nginx cache that bit the dust under heavy load today:

Action: I ran curl -I https://website.org/ and it hung.

Wrong assumption: Something is wrong with nginx. Why else would it just hang?

Reconsidered conclusion: The resource (the home page) is a MISS, so nginx has to retrieve it from the origin, but the origin is over-loaded and timing out, so my request is also timing out. Maybe something is wrong with the nginx caching configuration since the home page really should be a HIT… but that’s another problem.

Action: I changed the configuration from our normal caching set of directives to our aggressive caching set of directives, reloaded nginx and curl -I https://website.org/ still hung.

Wrong assumption: aggressive caching isn’t working and I need a different configuration.

Reconsidered conclusion: The home page still has failed to be loaded from the origin, so every request for it is going to be a MISS, and is going to hang, until nginx is able to fill the cache with it. The configuration change might be the right change; we just need the origin to calm down before we will know.

Action: I restarted PHP on the origin to free up PHP processes so my home page request can fill the cache … and still curl -I https://website.org/ hangs.

Wrong assumption: WTF! The world is ending!

Reconsidered conclusion: The regular traffic which is accessing other pages (not the home page) consumed all the available PHP processes on the origin before my request for the home page could complete, so nginx is still unable to fill the cache with the home page.

Action: Once we got things under control, I changed the caching level from aggressive back down to normal. I ran curl -I https://website.org/ and it was HIT’ing. I concluded that we don’t need the aggressive cache after all. Got some coffee, came back later and ran it again and it consistently showed MISS.

Wrong assumption: What?!? Did something change on the origin to stop the cache from working??

Reconsidered conclusion: The aggressive cache set the cache for 5 minutes. Even after changing to normal caching, the home page was still cached so it was served from the cache. After 5 minutes, the cache expired. Now, the normal cache setting are in play to determine whether the request would be cached or not. In other words, you have to wait for the cache to expire (or bust the cache) before you can effectively know if the new cache settings are working.