1 / 8
Jun 2013

We experienced a major network outage the evening of Saturday, June 1, which our redundant network failed to handle automatically as it should have. Service was restored at approximately 11:30 pm MST (June 2 0630 UTC), but a full determination of the cause of the outage has not been made yet. We are preliminarily scheduling further testing Sunday night when traffic is lowest around midnight (Monday June 3 0700 UTC). This testing may result in a small number of interruptions of <1 minute in duration around that time. We apologize for this interruption of service; more details will follow.

  • created

    Jun '13
  • last reply

    Jun '13
  • 7

    replies

  • 1.8k

    views

  • 6

    users

  • 1

    link

I am concerned with the server outage - when we had all the problems a year or so back, we were advised that the server back up and been improved / upgraded.



Were you trying to implement system upgrades on Saturday evening?



Being down for 4 hours + = significant lost sales.



This is an automated message from SiteUptime.



The system confirmed 8 failed checks at 30 minute intervals starting at June 1, 2013 19:32:12



Alert Type: Site is Available

Result: Ok

Time: June 2, 2013 01:30:08

HostName/URL: www.e-junkie.com

Monitor Name: E-Junkie

Hopefully E-Junkie does not experience another outage, but if there is another outage, I think we can all agree that better communication via email, Twitter, etc. during the outage would be greatly appreciated. My customers were trying to make purchases last night and all I could tell them is "E-Junkie is down. I will try contacting someone there." I had no explanation to give them as to why their checkout cart was not working, just that it was not working. I've been with E-Junkie for several years and am very pleased with the service. However, this outage occurred at a very critical time for my customers so I have to question E-Junkie's reliability. What is E-Junkie planning to do to ensure this does not happen again?

Still no explanation as to why you didn't or couldn't let us know during the outage! I can't afford that kind of service again.

The outage occurred after office hours, so this morning was the first I heard of it myself. From what Development has said so far, the outage last night was a "perfect storm" of multiple unlikely failures converging into an even more unlikely combination.



In brief, it appears that our main network uplink failed, and the redundant automatic fallback to a different uplink also didn't work properly, so our system remained up but became unreachable from the Web. This sort of connection outage would normally have been resolved within the hour at most if our sysadmin was aware of it, which unfortunately they weren't until hours into the event, as they hadn't received notification of the outage from our automated uptime-monitoring services, and internal staff communications about the outage in progress were incommensurate with the urgency of the situation. Once the matter finally came to our sysadmin's attention, the outage was quickly resolved.



Suffice to say for now, the reasons for each of those failures are currently being investigated and addressed internally to prevent a future recurrence of any one of them, let alone all of them concurrently as happened last night.

The reason that we use E-Junkie is because we need to send out product instantly. E-Junkie should have stuffs watch the website 24/7. It is totally unacceptable that e-Junkie did not notice the website is down (even http://www.e-junkie.com itself is down) until 4+ hours later. In addition, no communication what so ever during that time.



I do not believe the "prefect storm" at all. Those are simple management errors that can be easily prevented.

Suggest someone at e-junkie signs up for



SiteUptime



I was made aware within 30 minutes that your website was down



(PS - its free)



Also, there should be a backup - I did tweet you guys, but no-one is watching that either

This may be a case of "success breeding complacency", as it's been such a long time since our own systems were responsible for any outage that our vigilance may have become unduly relaxed. Rest assured this event has shaken us up, and we are reviewing our internal outage communication and remediation policies and systems accordingly to ensure we won't drop the ball again in the future.



We do have systems monitoring our system's uptime and availability 24/7, which should have notified our sysadmin of this outage immediately; we have identified the reason this notification was not received in this case and corrected that issue, so that lapse should not recur. One of our staff noticed the outage within about half an hour after it started, but they merely sent an email to notify the rest of us, when of course they should have contacted our sysadmin personally, directly and persistently by every means available until they acknowledged the outage was being attended to.



Investigating the actual cause of the outage, tests we performed last night confirmed that all of our own internal systems and hardware were and are properly configured and fully operational, yet the failover to a fallback uplink still could not be triggered successfully, all of which indicates some sort of networking issue(s) at our colocation datacenter outside of our direct control were likely responsible for the outage. We have filed a trouble ticket with them to get the matter investigated and resolved, as this is the second time their "redundant" networking provisions have actually caused an outage for us, which is two more times than we're aware of it ever actually having prevented an outage.