Windows Activation Server Outage: Human Error
Remember the WGA server outage of a few days ago, the one which basically resulted in everyone trying to activate - or patch - being labeled a pirate? It was caused by the one thing it's almost impossible to guard against: human error.
According to Microsoft, as they wrote on the WGA blog:
And, some clarity on how things are supposed to work. If the servers are down, the system is designed so that all activations are defaulted to "genuine." In this case the servers returned an erroneous value, which answers my question in my earlier post about why the system couldn't simply recognize a server outage - it wasn't an outage.How did this happen in the first place?
Nothing more than human error started it all. Pre-production code was sent to production servers. The production servers had not yet been upgraded with a recent change to enable stronger encryption/decryption of product keys during the activation and validation processes. The result of this is that the production servers declined activation and validation requests that should have passed.
Why did it take so long to fix?
While the response to the activation issue was quick (less than thirty minutes) the effect on our validation service continued even after the rollback took place. We expected the rollback to fix both issues at the same time but we now realize that we didn't have the right monitoring in place to be sure the fixes had the intended effect.
While it's nice to get a how and why, it's doesn't necessarily assuage my issues with activated software in general. However, if I want to use Windows, I'm pretty much painted into a corner.


0 comments:
Post a Comment