Being Multi Cloud with Cloudflare
This blog post details our motivations during our recent work with Cloudflare on multi-cloud infrastructure. For details on how things work on their side, see their post here.
At Billforward, we believe that your vision, rather than the limitations of your billing platform, should drive your product development. That’s why we pride ourselves on building a billing solution that does the work for you – and why one of our major goals for infrastructure is to remain flexible. Rather than relying on features that tie us down to one platform, we’ve made a concerted effort to operate in a way that allows us to move freely in the future and stay ahead of the needs of our customers.
This goal has seen us deploy infrastructure in self-hosted OpenStack environments, in Google Cloud Platform (GCP) and in Amazon Web Services (AWS). This has come with some trade-offs, such as spending time becoming familiar with multiple environments and investing engineering time in solutions that ultimately go unused. On the whole we believe this is worth it as it increases the confidence that we have made the right choice, and should we want to make changes in the future we have already gained the knowledge needed to make any transition as smooth as possible.
There are a number of benefits to be realised from being able to move between cloud providers at will. One of the major factors for us and for any company running large amounts of their infrastructure in the public cloud is cost. It is reasonably obvious that all the large cloud providers are well aware of this, evidenced by the recent introduction of per-second billing for AWS EC2 and EBS (New – Per-Second Billing for EC2 Instances and EBS Volumes) which was followed eight days later by the exact same feature being brought in by GCP (Extending per second billing in Google Cloud).
In addition to cost, being cloud-agnostic allows us to improve reliability by engineering for failure from day one and to avoid vendor lock-in by ensuring that we find ways to run our workload without features that are unique to specific cloud providers. Of course, this also leaves us open to being able to use a cloud specific feature in any cloud in instances where it would be a mistake not to do so – the trade-offs between AWS Redshift, GCP BigQuery or rolling our own data warehouse is a prime example of this.
Abiding by our goal of remaining flexible, our application is built around a message queue (RabbitMQ) and is divided into a number of micro-services, each of which runs nicely in a Docker container.
Using a message queue allows us to scale our consumers and producers up and down as required and our choice to divide work in to small, atomic chunks means that we can very easily pick up and move workloads without concern for data integrity or for duplicating efforts.
The magic of containers means that we can very easily move our workload to a server running practically anywhere and, with Docker, on an ever growing choice of Operating Systems. The combination of our architectural decisions along with Docker gives us the added benefit of having an environment on a developer’s laptop being very close to an environment running in production, further speeding up development and our deployment pipeline.
The ability to run our core workload anywhere solves a few of our problems. It means that we can still get business critical tasks performed during a widespread outage in one or more Availability Zones or even across cloud providers. It means that we can take advantage of offers and incentives to run our infrastructure in one environment vice another. It means we can use cloud providers for scalability but still run services in cheaper dedicated infrastructure where that makes sense.
The unsolved part of this puzzle was how to get these benefits on our consumer facing environments. We expose our functionality through a REST API and a web application and our business depends on keeping these available and highly responsive – both goals that can be aided by the benefits of a multi-cloud environment.
Our early attempt at a solution to this revolved around DNS traffic shaping, primarily using Route 53. This, combined with short TTLs allowed us to balance load between cloud providers and to provide some element of redundancy between environments.
We quickly discovered that although DNS solved most of our problems, it could not give us the flexibility or redundancy we required. DNS caching, the inability to guarantee TTL being honoured and the fact that our customers often sit behind a single caching DNS server all meant that there was no way for us to control our traffic to the extent that we desired.
The obvious solution for us was to run a globally distributed front end proxy layer, allowing us to distribute traffic based on performance, availability and geographical location, and to take advantage of a single point of entry to protect from attacks or DDOS attempts.
We quickly discovered that Cloudflare had done this already. Today, we’re taking advantage of their global infrastructure to allow us to serve our core application to consumers the way that we want to, and to get the full benefit of all the multi-cloud flexibility we have developed so far.
As a result, our architecture from the user’s perspective now works as so:
Our infrastructure sits in both GCP and AWS (and, should we choose, any other location capable of running it) and Cloudflare handles all the abstraction of presenting this to the user seamlessly.
For more detail on exactly how things work from Cloudflare’s side, take a look at their blog post here.