Downtime on 15 April - a Saturday... we know, and we are sorry

Carlos Butler

19 April, 2023 • 4 minutes read

It happened…

ORDR is a very stable product that since it’s launch has had far above-average uptime metrics.

What that means is that in about 3 years, we’ve only ever experienced a combined downtime of around 20 minutes. We need not bring up how annoyed we get when our WhatsApp’s, Instagram’s and Facebook’s aren’t working - and that can unfortunately last a few hours.

We of course wish not compare or point fingers, but simply explain that keeping these large, multifaceted systems going takes a lot of work. In the case of ORDR, things like:

customer ordering
customer payments
waiter ordering
kitchen display system
printers
payment terminals

Numerous systems with different responsibilities.

Transparency

We have already been in communication with the majority of restaurants affected on the Saturday during the course of instability.

The next section is a more detailed technical, engineer focused, explanation of why it happened. But summarising it here, we had a huge amount of users accessing the entirety of ORDR’s numerous systems in one go.

This was unprecedented for us to have so many thousands of users over a short time window. Of course, this is one of those “good problems” to have as it shows ORDR is growing its client base, meaning we can deliver more and more features to all restaurants and bars in the coming months.

Nonetheless, we have put in place an alert (yes, our Slack chat windows turn red!) whereby if our systems go above 70% usage for a few minutes, we automatically upscale all servers to ensure that everyone gets the snappy ORDR interface they are used to.

Boiling point

As one can imagine, when a database starts to slow down, everything connected to it will suffer. It is common practise to have on database instance, replicated for backup, but that instance can have multiple databases. As such, we have ORDR’s main database, but then supplementary ones such as printing and payments there as well.

Now, what ended up happening was that we started creeping over the threshold that AWS, Amazon Web Services, deems acceptable. What AWS then does, in order to not affect other companies using AWS, is reduce your computing power - by a lot.

So, not only did we have more users than expected, with those number of users increasing every minute due to it being peak (19:30 - 20:30) hours, but each user was getting a gradually worse and worse response.

Thus, with the DB at over 100% capacity, all background tasks such as Sidekiq background jobs that connect to the db, or receiving (and processing in the background) webhooks from Stripe, started to all take around 5000% (multiple seconds) longer.

Scale up!

ORDR is proud, in general, a large monolithic application. We do not like micro-services for the sake of them. With the exception of both printing, which has thousands and thousands of requests per minute as thats how cloud printing works, and payments, which needs to be sectioned of for compliance, we will never break-up ORDR.

Therefore, scale up. Simple as that.

And that’s the same approach to our database. As connectivity was already so flaky, we took the decision to have a few minutes 100% downtime and change our AWS PSQL-RDS to a much large instance type.

Aftermath

It still took around 20 minutes for the entire system to restabilise, which was quite interesting (at least for the engineers). This is because we rely heavily on background processing to keep the user experience fast and snappy.

The thousands of background jobs that had been futilely running and piling up, started succeeding, and the DB managed to get back down in CPU usage.

Unfortunately, this is part of distributed online computing. The most modern systems end up relying on so many other external systems that when one of the ten or fifteen services any company might use goes down, it affects everything.

We learn. We grow. Carlos, co-founder.