Hailo, like many startups, started small; small enough that our offices were below deck on a boat in central London–the HMS President.
Working on a boat as a small focused team, we built out our original apps and APIs using tried and tested technologies, including Java, PHP, MySQL and Redis, all running on Amazon’s EC2 platform. We built two PHP APIs (one for our customers, and one for our drivers) and a Java backend which did the heavy lifting–real time position tracking, and geospatial searching.
After we launched in London, and then Dublin, we expanded from one continent to two, and then three; launching first in North America, and then in Asia. This posed a number of challenges–the main one being locality of customer data.
At this point we were running our infrastructure in one AWS region; if a customer used our app in London and then flew to Ireland, no problem–their data was still close enough.
Our customers flying to Osaka however created a more challenging problem, as the latency from Japan to Europe is too high to support a realtime experience. To give our customers and drivers the experience we wanted it was necessary to place their data closer to them, with lower latency. For our drivers this was simpler, as they usually only have a taxi licence in one city we could home them to the nearest Amazon region. But for customers we needed their data to be accessible from multiple locations around the world.
To accomplish this we would need to make our customer facing data available simultaneously from our three data centres. Eric Brewer’s CAP theorem shows that it is impossible for a distributed system to simultaneously provide Consistency, Availability and Partition Tolerance guarantees, and that only two of these can be chosen. However, Partition Tolerance cannot be sacrificed, and as we wanted our service to optimise for availability as much as possible, the only option was to move to an eventually consistent data store. Some of our team had prior experience with Cassandra, and with its masterless architecture and excellent feature set, this was a logical choice for us. But, this wasn’t the only challenge we had–our APIs in particular were monolithic, and complex, so we couldn’t make a straight switch.
Additionally we wanted to launch quickly, so we couldn’t change everything in one go. But, as our drivers were based in a single city, we could take a short cut and leave our driver-facing API largely unchanged; cloning the infrastructure for each city we launched in, and deploying these to the region closest to the city. This allowed us to continue expanding, and defer refactoring this section until later.
As a result we refactored our customer facing API; moving core functionality out into a number of stateless HTTP based services which could run in all three regions, backed by Cassandra, written in either PHP or Java.
This was a big step forward, as we could serve requests to customers with low latency, account for movement, and have an increased degree of fault tolerance–both benefitting from Cassandra’s masterless architecture, and having the ability to route traffic to alternative regions in the case of failures.
However, this was merely the first step on our journey.
“My God! It’s full of stars!”
Having dealt with one monolith, we had a long journey to deal with the next monolith. In the process we had improved the reliability and scalability of our customer facing systems significantly, but there were a number of areas which were causing us problems:
- Our driver-facing infrastructure was still deployed on a per city basis, so expansion to new cities was complex, slow, and expensive.
- The per city architecture had some single points of failure. Individually these were very reliable, but they were slow to fail over or recover if there was a problem.
- Compounding this, we had a lack of automation; so infrastructure builds and failovers usually required manual intervention.
- Our services were larger than perhaps they should have been, and were often tightly coupled. Crucially while on the surface they provided a rough area of functionality, they didn’t have clearly defined responsibilities. This meant that changes to features often required modifications to several components.
A good example of the final point is amending logic around our Payment flow, which often required changes in both APIs, a PHP service and a Java service; with a correspondingly complex deployment.
These difficulties made us realise we needed to radically shift the way we worked; to support growth of our customer base, our engineering team, and to increase the speed of our product and feature development.
Working on a small number of large code bases meant that we had a lot of features in play at once, and this made scaling up our team difficult–communication, keeping track of branches, and testing these, took up progressively more time (due to Brooks’s Law). Some of these problems could possibly have been solved with alternative development strategies such as continuous integration into trunk and flagging features on and off, but fundamentally having a small number of projects made scaling more difficult. Increasing team size meant more people working on the same project with a corresponding increase in communication overhead; and increasing traffic often meant the only option was to inefficiently scale whole applications, when only one small section needed to be scaled up.
Our first forays into a service oriented architecture had largely been a success, and based on both these and the experiences coming from other companies such as Netflix and Twitter we wanted to continue down this path. But, with most of our developers not having experience of the JVM (which would have allowed us to use parts of the brilliant Netflix OSS) we would need to experiment.
Now continue reading part two, A Journey into Microservices: A Cloudy Beginning.
Image credit: HMS President, Roger Marks