RM-RF#5 Laurent from Front

"4 years ago, we got locked out of production data for 6 hours"

Dec 21, 2021

👋 Welcome to this issue of RM-RF, where you get to hear about crazy dev war stories from inspiring tech leaders.

This time I interviewed Laurent Perrin, co-founder and CTO of Front, a communication hub powering extraordinary customer experiences. Along his nearly 10-years journey at the company, Laurent has been through some pretty crazy stuff.

🕰 For how long have you been working on Front?

We started working on the project back in 2013 with Mathilde and quickly moved to SF to join YCombinator in 2014. Two years later we raised a $10M series A and have been growing fast ever since. Our latest fundraise was in 2020, with participations from leaders of the Future of Work industry (note: Front has Zoom’s CEO in their investors, how cool is that).

🏉 How big is your team today?

Today our team counts about 320 people. We initially took a small hit from the covid period, but things rapidly bounced back as companies adapted to new work realities which Front supports. This period proved our resiliency and enabled us to confidently grow the team afterwards.

In the R&D team, we’ve got about 90 people. Most of them (70) are engineers, and we’ve also got a product & design team as well as a pure research team. Each team has their own leader. Interestingly, none of these leaders are formally reporting to me (note: this setup is also that of Dharmesh Shah, Hubspot CTO). I strive to be a partner rather than a manager. My role is hence focused on driving the overall vision and being custodian of the product and tech architectures. I’m also the voice of tech and product at the Board, to ensure we do put realistic expectations on engineering (Laurent referred here to the Boeing board which lacked engineers and led to the issues with the 737 Max).

🛠 Describe your tech stack in a nutshell?

Our back-end is a Node.js app with a constellation of small services (150) that exchange messages using queues. It’s noteworthy that our stack remained based on Node since its early days, without major shifts. In the front, we use React + Redux.

Our infra runs on AWS with Kubernetes, leveraging their proprietary SQS as well as common tech such as MySQL, Redis and ElasticSearch. Each AWS region has independent infra stacks.

We have sharded our main DB on 25 shards split by customers. We initially met a few issues linked to shard rebalancing occurring at bad times (e.g. during high user activity load). This led us to create a “stop button” to prevent rebalancing in set timeframes.

🐛 What is the most elusive bug you and your team ever had to fix?

We’ve met a seemingly futile but very frustrating bug related to our inbox messages counters (notification dots). Basically, the counters got out-of-sync with the actual messages count, leading users to believe they had unread messages when they had none (note: this was also a hard problem for Facebook back in the days, which ultimately pushed to the Flux architecture and React).

This was a complex real-time events issue. Counters are influenced by a plethora of events - with up to 200,000 updates per second at our scale. Even if we’re good on 99.999 of cases, it still means we generate 1-2 errors / second.

We initially tried to raise awareness within the teams on the need to pay extra attention to counters consistency. But as we grew the team we had to resort to more automated means. So we ended up carpet-testing counters across the whole test suite with something behaving like a global after hook.

😱 What is the most stressful tech situation you ever faced?

In 2018, we shifted our infra to Terraform, meaning we had to start with a clean-up of the legacy “manual” infra. At some point during the deploy, Terraform identified a remaining KMS key (encryption key hosted on AWS) which had not been wiped out yet, and prompted the developer to do so.

The developer deleted the key and everything went smoothly for the next 2 hours, because AWS puts that grace period before actually destroying the key… It’s only after deletion that we realised it was the one key encrypting our production database.

This was really the worst event we ever had, as basically no one could access any data anymore. We were very lucky that we had an odd replica of the DB in another region with its KMS key untouched, purely because someone had been paranoid at some point and thought it’d be a good idea to have this redundancy.

We went through 6 hours of total downtime. Now, each shard comes with its replica. We contacted AWS about this event and they have since improved the KMS experience so that it cannot happen anymore (note: AWS now marks clearly when a key is directly linked to an AWS-managed resource such as a DB, with no deletion possible). We all sympathised with the developer who went through this adventure because of the misleading UI by then.

✌ What's your best piece of learning on these topics?

Through my career, I’ve learned to shape clear expectations for the engineering teams. Before Front, I got trapped in a project with unrealistic technical expectations from the unset. Because of that I worked so hard I literally ended up at the hospital, and the project eventually didn’t deliver meaningful value. But I came back stronger and much more assertive about tech objectives.

⭐ Finally, are you hiring any developers these days?

We’ve got more than 30 open positions at the moment. Roles are located in Paris, San Francisco and Remote.

Thanks a ton to Laurent for this interview 🙏. He’s a passionate tech leader with interesting takes on the role of the CTO in a fast-growing scale-up. If you’d like to work with Laurent, join Front’s team here. See you in our next of issue of RM-RF with Jonathan Alzetta from Wooclap.

RM-RF

RM-RF#5 Laurent from Front

"4 years ago, we got locked out of production data for 6 hours"

Discussion about this post