Archive for Internet

Another Surpass Hosting Outage?

It appears that Surpass Hosting has experienced yet another outage, on a weekend. Many customers, including Wright PC Consulting noticed server loss of connectivity just after 8AM CDT.  Once users were able to reconnect to their servers, many of us noticed that they had been rebooted, some more than once.

So, we must ask again: “What happened at the Dime?” Customers at WebHosting talk and at HostDime’s community forums even reported that once again that the voice lines were out during the network/data center outage as well. That makes since because they use VoIP for their support lines.

While this incident was a shorter event that the last one, downtime is downtime and means potential loss of profit, and even more damaging for many– our reputation. Unlike the outage in May, were there was a pretty steady stream of information, this time so far all that we have heard was this from staff:

“Our facilities are still experiencing some intermittent connectivity issues at this time. Our engineers are on-site diligently working on this issue. We expect to give a more detailed report within the hour. We greatly apologize for the inconvenience this is causing, but please know that we are doing everything in our power to get everyone back to full operation. Thank you for your patience during this time.”

Hopefully, Surpass and DimeNOC will have more information on the outage and what is being done to reduce he probability of it happening again. Things like why customer servers have to be rebooted to fix a networking issue within the data center would be a good place to start.

07/14/2008-Update from Surpass/HostingDime:

“Greetings to all,

I apologize for the delay in getting this up, per the power issues we ran into May 23 2008 where we mentioned that things were being done to be pro-active on ensuring no such outages occurred again, we had scheduled a planned UPS / Battery maintenance to replace / add some things to our systems. This was scheduled to be done Saturday July 12 2008 between 6AM EST and 10AM EST. There was no public announcement made about this because this maintenance was not to effect any clients or internal systems. However just in case, we planned it on a weekend during early morning hours in case something did go wrong we would be on standby and being the weekend the least possible impact it would have on clients.

Shortly before 10AM est when the maintenance was about to be completed there was what we thought at the time a DNS network wide issue which we thought effected some servers in the data center. We noticed the call volume increase and immediately began to investigate the issue. We advised staff taking calls to let everyone know it was an intermittent connection issue as we didn’t know what the real cause was just yet. Shortly after that we noticed that the EATON UPS maintenance staff had completed the maintenance and attempted to move our entire DC Floor load back to the UPS. The whole data center runs off 3 phases, when the Eaton UPS staff switched the load to the UPS one of the 3 phases failed to fully switch and caused over 33% of the servers in the data center power cycle. It was so sudden and so fast we initially thought it was a DNS issue and thus why it was said initially as a possible network issue when in reality it was one of the 3 phases failing to fully switch over to the UPS. During this time there was about 2-3 intermittent loss of power to the 3rd phase of the load thus why some of you saw your servers go on and off more then once. We immediately went to the Eaton Power UPS staff and advised what happened as this maintenance was not to suppose cause any service interruption to any clients. One of the tasks of the maintenance was to replace the capacitors in the system. The UPS system has over 50 of these capacitors which provide power via 3 electrical phases to the entire NOC, he immediately admitted that one of those capacitors was not wired correctly from the factory thus why 33% of the DC load power cycled on and off during attempts to move the load to the UPS as it shorted it out the phase.

They immediately corrected it and shortly after that the entire NOC power load was fully running on the UPS. A very very mis fortunate situation and we have scheduled a call with EATON today to get an explanation from them of how something this important was not wired correctly from factory. Eaton is a public company http://www.eaton.com/ and we are shocked how this happened and are determined to get to the bottom of it. We had an entire team of 25 staff on standby in case anything went wrong and immediately got 98% of systems online within 1-2 hours.

The situation is still unacceptable under no circumstances but out of our control. Our culture is built upon integrity and everything stated is 100% true. The only good news that can come out of this is that all the needed maintenance/changes/upgrades to our power systems is completed 100% and we are looking forward to the next 5 years ++ of uptime
We take downtime seriously and have acted pro-actively to ensure you get 100% uptime each and every single month. Services have remained 100% since then and are expected to remain 100% from here on out.

If there is anything we can do for you please just reach out to us and we will make it possible. We are determined to make your business relationship with us something you can count on with peace of mind. We put our entire soul and mind into this company with our entire 100+ staff team with a desire to give you the best level of service and support. We are at your disposal at anytime.”

__________________
Emmanuel :: Surpass Hosting Network Admin
http://www.SurpassHosting.com
Manny
Sys Admin HostDime

Comments (1)


Why off site backups for hosts?

Hard Drive on FireWhile most all web hosting providers, including mine, have a provision in their AUP or TOS that says that you, the consumer, are responsibility for backing up your data, I believe that as a service provider I owe it to my customers to do all that I can to protect their data. Backing up customer data both internally and off site is really the only way to fully mitigate potential disaster as the unfortunate folks that run their business out of The Planet’s H1 data center in Houston, Texas discovered when there was a large explosion in the electrical room that took the center off line for more than a week. While The Planet staff from the top down worked 24/7 to get the center back online, the update page tells the story of how difficult a task they had before them. During the downtime I followed the frustration of many whose data was held in powerless servers with no off site backup, were without a web presence, email and other services that make any hosting service’s world turn.

So, what do does Wright PC Consulting, LLC do to protect consumer data? Here is our current backup formula. While not perfect, it provides a pretty reasonable level of safety.

  1. At approx. 3AM the cPanel backup script, cpbackup, runs an incremental backup of all customers’ home directories and cPanel configuration files and saves them to a dedicated backup drive where they are archived and rotated monthly.
  2. When cpbackup has completed its backup, it starts a second script that “mirrors” the backup drive and sends it to a data center in another state through a secure encrypted tunnel using the rsync protocol.
  3. Monthly customer data is burned to DVD and stored in a third, secure location.
  4. For added security and by special arrangement, we will also mirror a customer’s email off site every hour.

These steps have proven to not only ease customer anxiety, but make they make restoring data, from a single file to any entire account easy.

Comments

HostDime Power Failure Explanation

Update: More Down time for Surpass/HostDime.

Early in the morning on Friday, May 23, 2008 power was lost at the HostDime data center that houses Wright PC Consulting, LLC’s leased server. While, that is not that big of a deal, there are both UPS and a Generator for such occasions, what followed is. For some it meant 16 hours or more of downtime.

Here is what happened and what is being done to prevent such future failures.

To our clients and business partners,

There are no words to describe how deeply we apologize about the downtime which occurred on Friday, May 23, 2008. The incident has created immense discontentment to our organization mentally and emotionally because of the love and dedication our team has to our entire community. Moreover, because we realize the level of damage this incident has potentially caused you. We know there is neither money nor words which will replace the losses that may have been experienced by each one of you. Our organization is forever in debt to you all for the frustration and grief endured. It is never easy in disasters, but many of you showed your support as we worked non-stop to get things back to normal. We want to thank all of you for your patience, understanding, and support during such a difficult time. In any case, a formal incident report of our investigation is what we wish to rightfully deliver to you. Below is the detailed summary of events as they occurred. Please note some of you may have not experience any outage during this, not all clients were effected but we wanted to keep everyone updated.

What happened:

At approximately 8 A.M. EST our data center experienced a surge followed by a power outage which lasted several minutes from our electrical utility provider Progressive Energy. The surge tripped our facility’s main breaker; this main breaker is designed to have a certain level of sensitivity and to trip in the event of a severe surge in order to protect the load (servers and critical equipment) from being burned. Immediately after this occurred, our generator automatically started up within a few seconds. Meanwhile power to our load (servers and equipment) was automatically transitioned from unavailable raw power to generator power by the automatic transfer switch (ATS), our uninterruptible power system (UPS) in conjunction with our battery set supply is supposed to automatically sustain continuous power to the load. However, it appeared this did not happen. In any case, generator power was indeed immediately available within the minute of the outage.

Immediately post the outage our engineers and electricians came on site. The diagnosis conducted revealed there was a fault within a battery string which is connected to the UPS. It is this fault that disabled the UPS from being able to fully sustain continuous power to the load meanwhile the ATS transitioned the facility to the generator power lines from the raw power lines. During this time a great portion of the data center experienced a sudden power loss which caused a myriad of servers to power cycle. Unfortunately, at times when some systems experience sudden power loss some require manual administrator intervention to get full function restored. Post the outage, our team immediately started working on checking systems and all servers that may have been adversely affected by the sudden power loss these experienced.

What was done to correct the problem:

Our on call UPS maintenance technician along with our electricians and engineers immediately came together on site to conduct a thorough diagnosis and put together a plan of action to correct any and all possible issues.

While the age of the battery supply being employed was well within the manufacturer’s life span expectancy, the entire battery supply was replaced with a new set. In addition, our UPS underwent a thorough in depth inspection and all critical components were individually inspected and reconditioned as necessary. Lastly, the batteries and UPS were load tested before being re-employed to the overall power back up system to ensure 100% reliability. All this was completed within several hours of the incident.

Who was affected:

The power outage experienced was intermittent. However, once power was fully restored to the facility many servers required file system checks (FSCK), some power supply replacements, and a few others hard drive replacements due to excessive I/O errors. Unfortunately, depending on the space on the drive the system occupies a FSCK run time can range from 30 minutes to a nine hours plus (approximately 200 servers counted). Those that were worst affected are the systems that were having excessive I/O errors and needed hard drive replacements (approximately 12 servers total counted). Again, unfortunately, hard drive replacements may take 4-12 hours plus to complete depending on the space being occupied on the drive. Those that were least affected were servers that only required a power supply replacement (approximately 60 servers counted).

For those servers that experienced the greatest downtime was not due directly to power unavailability, but rather due to post sudden power loss adverse effects described above.

What preventative measures are being taken:

All critical power systems in our data center and loads were previously and are regularly inspected and maintained. This includes generator, UPS, breakers, etc. In fact, our UPS underwent an inspection and a maintenance service on the week of the 12th of May 2008. The service report came back showing the UPS was in good working condition as well as the battery supply set. The only advice made was to consider replacement of the battery set supply as these were approaching the last year of the manufacturer’s life span expectancy. Pro actively following up on the advice made by the maintenance engineer, a new battery supply set was ordered right away and scheduled to be installed this Tuesday May 27, 2008.

Unfortunately, the battery supply set is what ended up being the fault and ironically this is what was already schedule for routine replacement maintenance. It is difficult to state that more could have been done as the batteries were within their life expectancy limits but failed short during this situation. Something of this magnitude, unfortunately, could not be predicted and was already being addressed with a new battery supply set replacement as a proactive measure. Nonetheless, a new standard has now been adopted as we will be increasing the battery reliability tests schedule to be completed monthly. This will allow us to intercept any and all types of possible issues with any battery sooner and overall highly reducing the probability of a failure encounter during critical times.

Our data center employs a 500KVA UPS and a 500KW generator. This is a statement that can be further proved by the recent pictures and videos taken yesterday afternoon. If you are in any kind of doubt whatsoever with regards to this, we would like to kindly ask for the opportunity to disprove your doubt. The pictures and videos below are of our backup systems in place which have protected us from several past outages to the entire data center. We uncover what maybe some of you didn’t know was in place in our facility since day one so you can see that your services with us are secure.

We have been in the industry close to 8 years now and we have always tried our best to ensure 100% uptime to all of you. This is the first outage we experienced with this level of severity in our entire existence. It is not only our job but our passion to give you the best level of service possible. We do not want to use the misfortune of this unpredictable situation to be an excuse for the downtime experienced. Despite the nature of the situation, we accept full responsibility for the outage and we are ready to compensate you in anyway we can. We value your business relationship and the level of trust you put in us. We know many of you will have a desire to cancel with us due to the losses you have incurred and question our systems’ integrity. We ask you to please talk to someone in management before you make your decision as we do understand the level of importance this means to each one of you. We work in high a high volatile environment where anything can happen just like with any of our competitors, however, we will always, no matter what, promise to be here whenever any issue occurs with an open hand to help resolve it as fast as humanly possible. Misfortunes will always happen to the best of us, how they are handled and treated makes the difference. If there is anything at all we can do to help you minimize your losses please just ask and consider it done. Our awareness and commitment level has tripled as a company and you can ensure this has only made us stronger and more experienced as a company. It is not everyday people or companies can overcome such issues and have the support and loyalty that many of you have given us. If you wish to reach out to me personally with any concerns, recommendations, suggestions, venting, or ways we can compensate you, please email me personally at e.v @ hostdime.com. I will be happy to talk to you in person.

__________________
Emmanuel, CTO
Surpass Hosting

While annoying at the time of failure, I truly admire a company that not only accepts responsibility for what happens, but doesn’t point the finger at the suppliers, contractors and everyone else. Way to go HostDime/Surpass!

Comments (4)

« Previous entries · Next entries »

Friday, November 21, 2008