Unplanned Downtime: A Message from Our CTO

Posted

Yesterday, many of you, our valued customers encountered issues accessing our products. I want to sincerely apologise to you on behalf of the entire team as we take any disruption to your service more seriously than anything else.

I also want to be completely transparent on what happened and assure you that we will do everything possible to ensure this never happens again.

At 4.20pm GMT we performed what we would normally classify as a routine database table update on our primary database. We required a new field on a table that holds information about each specific Teamwork.com account. To minimize impact on the live platform while a database update is performed we use a tool that can perform the change on a copy of the table. When the update is complete, the old table is replaced by the temporary modified table. Unfortunately a step failed while replacing the original table which we had not planned for.

This table is used by the User Login service and the impact of the issue left users with an account not found message. Once identified, we immediately set about restoring the table to get you back into your accounts.

Based on an initial analysis, a mixture of human and process error were to blame. We disrupted the majority of our customers for 45 minutes but critically left many customers unable to access their Teamwork.com sites for up to 3 hours while we worked on restoring the records for all accounts from our RDS backup.

We have immediately updated some of our internal processes to ensure this does not happen again. We will also carry out a full audit of events to ensure that all possible learnings are taken from the incident. The messages we got from you again brought home how critical it is that we deliver maximum uptime to support your projects, and will increase our efforts to meet and exceed your availability expectations.

To all our customers who were affected by this incident, please accept my most sincere apology.

Daniel Mackey, CTO

enterprise

Keep your projects on track with Teamwork.com

Streamline. Connect. Collaborate.

One account works for all Teamwork.com apps. Have an account? Sign in here.

26 Comments

Deb Mason

Thank you for the transparency! Though it may have slowed us down a bit, the impact was very slight in our case, and only required a minimum of patience. I wanted to call out a couple of very good things that came from this situation. Firstly, this transparent description of what went wrong, as well as the dedication to learn from this, is very refreshing. Secondly, your customer service gets an A+! I was greeted on the phone with calm, informed confidence, a sincere explanation of issues you were having and the timing to fix was spot on. Kudos to your staff for clearly communicating the issue to those who would actually greet your clients. Job well done!

Reply
Daniel Mackey

Cheers Deb! We’re very proud of our support team – not just in crisis mode but every day – I’ll make sure to pass your comments on! Thank you for your kind words and patience.

Dan.

Reply
Kelly Iriye

Thank you for your transparency, Daniel. I’m the head of the app-development department and I can relate wholeheartedly to yesterday’s unfortunate event. Please know that I am still singing your praises to my colleagues and friends and have faith in Teamwork’s skills to provide the top-of-the-line services.
Technology hiccups happen and the best we can do is learn from them. Thanks again for your post and I do hope your other clients were as understanding.
Take care,
Kelly

Reply
Daniel Mackey

Thanks Kelly! The whole team take it personally when we disappoint our customers so we’ll definitely improve because of this.
Dan.

Reply
John Morales

Thank you for the post, Daniel. We were down for almost 4 hours but your support team was very good with communication to us about expectations.

I have expressed to them, but would like to point out that ping response on your https://uptime.teamwork.com site is not very helpful in describing the status of app availability. The best understanding of uptime would be a percentage of all application instances reporting as available and a threshold (.5%, 1%, whatever) definition of “up”. This would have been a very clear description of the event and recovery yesterday.

The threshold is necessary because constant 100% uptime of all is not reasonable as I’m sure there are legitimate reasons for an instance being down at any given time.

Reply
Daniel Mackey

Hi John,

Again – I’m terribly sorry it took 4 hours to get your account back up and running. The communication side was something that came out of our investigation. While our support staff worked really hard to answer everyone in a timely fashion we could have done a better job communicating the issue, the resolution and an expected recovery time if we had the correct communication tool in place.

In yesterdays unplanned event the core platform was running but the application could not get the data from the database to service logins. Because of this uptime.teamwork.com didn’t show an accurate picture of platform being unavailable.

Today we made it our priority to get a Status Page for the platform online separate from our own infrastructure. If an event happens again in the future this new Status Page will let our support and sys-op staff to keep people informed in a timely fashion and have all incident related communication in a single place. We’re going to extend this new Status Page with more insightful metrics on the platform health. The new status page can be found at http://status.teamwork.com/

Thanks again for your patience and understanding,
Dan.

Reply
Gustavo Pontin

Thank you for your candor! We were really concern about the downtime, since we’d just transferred all our activities to Teamwork. But kudos to the well prepared support team and this straight-forward explanation.

Keep at it!

Reply
Daniel Mackey

Thanks Gustavo! We were disappointed we let this happen as we’ve worked hard in the last year to improve the availability and speed of the platform.

Dan.

Reply
Nicole Nestel

Hi Daniel,

At least you didn’t send out an emergency alert to 1.5 million people that a nuclear ballistic missile was incoming (which actually happened to us on Saturday. )

But seriously, Hawaii’s Emergency Management Agency could take some serious direction from you guys on how to handle an issue, how to explain what happened, mitigate the problem and resolve it for the future.

Aloha,
Nicole

Reply
Daniel Mackey

Thanks Nicole – I can’t even begin to imagine what receiving that message was like!

Reply
Tom Creasey

I absolutely love how transparent and active your support teams are. A lot of companies, when faced with unplanned downtime, do not post anything on social media when this is for a lot of people the first point of call to see if there is an issue.

It was great to see that there was a message on Twitter when I looked, and it is great to read this blog post to not only know what the problem was but how you are actively learning from mistakes and instantly reacting to them.

Within days you are already implementing a better status page.

I wish all companies were like this. Keep up the great work you do.

Reply
Daniel Mackey

Thanks Tom! We’ve made the new Status Page a priority project to ensure that we can give the best possible answers for any future unplanned issues (and scheduled planned maintenance)

Dan.

Reply
Ali Khalil

Hey, Dan and everyone on the team!

As many have commented, thank you for your transparency in highlighting the issues at hand and thank you for the team’s humility in deciding to learn from such an incident.

Sometimes, unfortunate events like this one allow both sides, the provider, and the customers, to understand the importance of both parties in such a SaaS relationship.

My team spreads on 14 different time zones around the world, and we were surely spread thin during the outage, however, it allowed us to bond on a personal level in different ways. It actually allowed us to take a break, breath and be thankful for all the uptime hours you guys were giving us!

On another note, I would like to emphasize the importance of the support team. I wanted to call in to make sure that ‘we didn’t miss our payment’ or anything of that sort, so I did. I was greeted with a calm voice who waited for me to finish my sentences before sharing an apology and informing me that the team is working on it. GOOD JOB! GREAT support team, makes me want to be part of this startup! 😀

Finally, allow me to suggest one ‘tiny’ thing. It would’ve been interesting for the end user to read a message a bit different than ‘your account is not found’. Knowing the error was there, It would’ve been nice to change the wording of that landing page to reflect the nature of the problem.

Once more, thank you Dan and the team for all the hard work you’ve put into it, and the prevention measure you are taking to avoid such an issue in the future.

Have a blessed day,
Ali Khalil – Lebanon

Reply
Daniel Mackey

Thanks for your kind words Ali! We’re very proud of our support team and I’ll make sure your comments are shared with them 🙂

I actually changed the message on the account not found screen immediately after we restored access to people as it’s only when people highlighted it to us that we stepped back and realized it could have been better. Part of the audit on this particular incident is that we failed to follow our process which was to get a proper message in front of customers specifying the exact issue. The team focused on restoring access as step 1 instead of first communicating that there was an issue and we were on it. This is going to be fixed going ahead.

Thanks again Ali!
Dan.

Reply
Tricia Havas

I wanted to add – in case anyone new to Teamwork is reading – downtime on Teamwork is not a frequent occurrence by any stretch of the imagination. In the 2 years we’ve been using the system I think this is the first outage they’ve had.

The support team is always excellent. Friendly, fast and knowledgable.

Please keep up the good work!

Reply
Daniel Mackey

Cheers Tricia! I really appreciate your comment on our platform status over the last few years as it’s something we’ve really put a lot of resources towards.

Dan.

Reply
Natalie

Thank you for the explanation. We were done for about an hour. I think our greatest frustration was the constant busy signal on the U.S. support line. I eventually received a return message through the form submission but I would have appreciated the ability to call.

Reply
Daniel Mackey

Hi Natalie,

I understand your frustration and we’ve taken steps to improve this area too. The lines got slammed and we’re looking at increasing the capacity of our support number. The other communication tools we’re putting in place will also help keep customers informed when an unplanned incident affects the ability to use our platform.

Dan.

Reply
Ken Lewis

Dan,

Thank you for explaining what happened. However, you can’t have an issue like this and not get any critical feedback from your customers. With that said, you really need to replace your Gravatar with one where you are smiling and looking like you love your job (which otherwise seems obvious).

In all seriousness, the handling of the unexpected was an amazing job by your entire team!

Having been the “human” part of an “issue” or two before, I am more than familiar with that feeling your team likely had when it was obvious that things didn’t go as planned and I know how hard it can be to work through. That feeling where you feel like your gut is being turned inside-out, followed quickly by the feeling of all the blood rushing from your head and extremities, right before your career flashes before your eyes can make it quite difficult to respond as efficiently as was done. I’m making an educated guess that your team kept their cool (as much as humanly possible) and was able to explain what happened and what they did, thereby enabling your team to keep the downtime at 45 minutes to 3 hours instead of what could have turned into a much longer recovery plagued by more errors from “reacting” instead of “thinking it through”.

As far as procedure errors are concerned, as you know, the only procedures that are infallible (if there is such a thing) are written in blood, or in our world, downtime. I bet that procedure was not just fixed with the lessons from this event, but was also reviewed from the top to the bottom to see what other things were missed (even if there weren’t any). It’s events like this that gives even your most junior team member a notch in their belt towards being the next seasoned veteran team member. I bet they will remember this day forever and will be more cautious in the future than they ever thought was possible.

So, there’s my “silver lining” rant. Thanks for the openness! Now, about your Gravatar…

Cheers,
Ken

Reply
Daniel Mackey

Ha! Thanks Ken 🙂 That was my happy face…

Seriously, thanks for your kind words. You summed up perfectly the feeling we each had when we realised the impact of the issue. Keeping a cool head and focusing on getting the app back up was our number one priority.

Dan.

Reply
Leonid

Communications, support and overall company behavior like this one is an example and is exceptional.

Great product, great support and great team that really cares.

Keep up the good job!!

Reply
David

Great response Daniel. Open, factual and professional, and much appreciated. Sometimes these things happen and your approach is robust, which creates confidence. Unfortunate that the impact was high but, given the honest communication, I’ve not doubt the correct action will be followed through.

Praise also to the support team for coming together to work the problem and do what they could to help. Great job folks!

Reply
Anna

Thanks for helping me keep my team calm! Your customer service response time (even in a “crisis” situation) is exemplary. Great product and great support.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.