Interesting cause of Skype outage
In my last post I wrote about the recent Skype outage. The service was back in full swing shortly thereafter, however I found the cause of the outage to be noteworthy.
In their blog post explaining what happened, Skype's Villu Arak wrote:
On Thursday, 16th August 2007, the Skype peer-to-peer network became unstable and suffered a critical disruption. The disruption was triggered by a massive restart of our users’ computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update.
The high number of restarts affected Skype’s network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact.
Arak went on to explain that normally their network's self-healing technology can correct for this sort of problem, but in this case was unable to do so due to a previously unseen bug, resulting in the two-day outage. Regarding Skype's reliability, Arak stated:
This disruption was unprecedented in terms of its impact and scope. We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions.Though upon hearing the explanation one may for a moment think that Microsoft is to blame, a slightly closer analysis of the issue will prove that to be an unfair accusation. In a new blog post today, Arak exonerated Microsoft from any responsibility for the Skype outage:
Some reactions to the explanation, however, have reminded us of one of the basic tenets of communication: It’s not what you say. It’s what they hear. We’d therefore like to clear a few misunderstandings that emerged in yesterday’s reactions to our explanation of what transpired last week.As much as I've quoted here, there's a lot more detail in the clarification post. If you're interested, check it out: Skype Heartbeat—The Microsoft connection clarified
...
We don’t blame anyone but ourselves. The Microsoft Update patches were merely a catalyst — a trigger — for a series of events that led to the disruption of Skype, not the root cause of it. And Microsoft has been very helpful and supportive throughout.
...
there was nothing different about this set of Microsoft patches. During a joint call soon after problems were detected, Skype and Microsoft engineers went through the list of patches that had been pushed out. We ruled each one out as a possible cause for Skype’s problems.
...
the update patches were not the cause of the disruption. In previous instances where a large number of supernodes in the P2P network were rebooted, other factors of a “perfect storm” had not been present. That is, there had not been such a combination of high usage load during supernode rebooting. As a result, P2P network resources were allocated efficiently and self-healing worked fast enough to overcome the challenge.
...
We’ve already introduced a number of improvements to our software to ensure our users will not be similarly affected – in the unlikely possibility of this combination of events recurring.

0 comments:
Post a Comment