In 2017, I was on a two person team which scaled Strava’s most popular and critical infrastructure: Segment Leaderboards.
My colleague Jeff Pollard wrote a series of four blog posts after we retired the legacy service, providing a comprehensive and well written account of the project. He included the motivations behind the work, as well as a deep dive on each component.
The goal behind this blog post, and ones to follow, are to editorialize each of the posts with my commentary. I’ll be generously using blockqoutes to bring enough relevant content here, while respecting Jeff’s contributions and unique achievements.
The original blog is posted at: https://medium.com/strava-engineering/rebuilding-the-segment-leaderboards-infrastructure-part-1-background-13d8850c2e77
One of the earliest Strava concepts was that of a Segment: member-created and edited portions of road or trail where users can compete for time. When a Strava user rides, runs, swims, or in any way traverses a segment as part of an activity, during activity upload processing we calculate the time it took for them to traverse that segment. This segment “effort” is then placed on a “segment leaderboard”, ranked against all other efforts by all other athletes who have traversed that segment.
Not everything ran smoothly, though. Over time the Scala-powered Redis-backed service began to experience chronic failures, highlighting problems with the design of the system. A few strava.com outages were due to problems with the leaderboards service. By 2015, the general consensus was that work should focus on making it more stable, scalable, and easier to manage for the future.
Besides causing cascading issues with the infrastructure, a high number of user support complaints plagued the system. The underlying issue of these was the RPC, non retry-able nature of the requests coming into the system. Any timeout meant that an athlete’s effort was forever lost!
Operationally, we still retained the ability to add more shards to the system, but it was a very delicate and manual process involving needing to rebalance data. It was a top priority for the team that there should exist a period of time where nobody was working on the segment leaderboards system in any capacity – something that hadn’t ever happened in the company’s history.
Problem #1: Redis was a problematic choice for storage
With Redis, all data has to fit into memory. However, memory is quite expensive to scale. As our data set grew, it became obvious that scaling an infrastructure to hold the entire data set in memory was untenable. If we were to upgrade our machines to the next instance size up, we’d be paying north of $20,000/month for just data storage, even with a one year prepay of EC2 reserved instances.
Problem #2: Synchronizing leaderboard writes via RPC is hard
The leaderboard system is a distributed system, made up of many server and data instances, all handling requests in parallel. Requests which require writes generally followed the pattern of first reading leaderboard data, modifying the structure in memory, then writing it back to Redis. To prevent concurrent and/or conflicting updates, the Scala service employed very aggressive locking around most write operations. This locking obviously created write latency and computational overhead, but when combined with chunky shards and hot spots, it also created large amounts of lock contention. Sometimes, during high load, many requests would time out waiting for a lock.
Summary of Technical Challenges
The growth rate and size of data is quite large. Strava is growing, and many segments have now seen nearly 100,000 athletes attempt them. Popular segments have well north of half a million efforts. Redis node memory consumption grew 43% in the past 12 months.
Data accuracy and consistency is of the utmost importance. Our users care deeply about their leaderboard results. They often work hard and challenge themselves to get a better time on a particular segment, and if we fail to update a leaderboard with that new effort, we’ve disappointed them.
New efforts must be visible on leaderboards with low latency. The normal use case for our user is to upload an activity, and then go see how they fared on leaderboards within a few seconds of their upload completing.
Leaderboards have a strong reliance of synchronizing writes. For any effort write occurring, the system needs to A) retrieve all existing best efforts on the same segment, B) determine if the incoming effort is better than any of the existing efforts, and C) if so update the data store with the incoming new effort. The leaderboard system needs to be able to do all this atomically, without concern for other threads updating data at the same time.
The variety of leaderboards presents a challenge to the data store. We want users to be able to view complex multifaceted leaderboards — i.e, see who was the fastest female this year who is 55–64 years old, or which of my male, club members had the fastest time on the climb they all just rode today. This often means needing to return ordered, ranked, and sliceable (for pagination) sequences of efforts, sometimes made up of arbitrary sets of athletes determined at query time.
With these problems in mind, we started working in earnest late July of 2016.
Besides diving deep and obtaining mastery of technologies such as Apache Cassandra and Kafka, as well as greatly improving my Scala programming, I learned important lessons in focusing on solving critical pain points before all else, and the importance of rapid empirical learning. More on that in future posts!