This is a two-part blog, describing our journey to PaaS, back to IaaS, and why we chose Couchbase as our database along the way.
Being an early adopter has its perks, and as a technology company, I’m proud to say Movere jumped on the cloud bandwagon at Day One, well, maybe Day Two, since Day One was spent fixing stuff and wondering why we moved from our “closet” servers in the first place!
One of the perks of being an early adopter is getting to try new technologies with very low overhead, like Platform-as-a-Service (or PaaS); and some of you may have already tried things like SQL Azure or Amazon RDS. If you want to try the latest version of SQL or Postgres, and if you don’t need advanced features like SQL Service Broker, then PaaS sounds like the perfect solution. The problems with PaaS (outside its steep cost) come later, unfortunately sometimes much later, as in when you’re so entrenched and dependent on the technology, and you’ve likely already fired your on-staff DBA (by the way, don’t do that!!!).
Here’s the deal: with PaaS, you give up a fundamental aspect of the tech itself: control over details. And, as with everything, the devil is in the details. Fine-tuning, availability, and scalability are out of your control. A promise of a 99.9% SLA is great, but you lose control of when maintenance occurs, and it always seems to occur when you want it the least (speaking from experience). In exchange for this loss of control, you get a very low barrier to entry, you get to try new technologies and taste a flavor of their capabilities with just a few clicks which is great for rapid development, but not so great for production.
Here are two examples to illustrate my point:
PaaS on paper != PaaS in production
In 2016 we launched ARC, which stands for Actual Resource Consumption. There are other blog posts that describe what it does in detail, but think of ARC and high-volume, streaming, performance data used by the vast majority of our customers that want to size their environments for the cloud. We need to process ARC data and present it back to our users, but also to aggregate it and perform ML on it to fine-tune our recommendations in the future. This means that we have to get this data not only displayed on the website, but also stored in our Data Warehouse.
When we designed Movere ARC, one of the technologies we decided to use is Azure Data Factory. You can read more about it here, but in essence think of it as a data mover: from on-prem to cloud, from SQL to SQL, from BLOB to SQL and vice versa. One of its most exciting features is the fact that it supports Polybase, a SQL Data Warehouse feature that uses parallel jobs to bulk-load data at blazing speeds. To put things in perspective, Movere, on average, captures about 150 rows of data per ARCbeat (every five minutes) per server. So, if we have 100,000 servers ARC’ing concurrently, we generate roughly 15M rows of data, equivalent to about 1,850 MB per hour. According to the specs, Polybase should be able to copy that sort of volume in under two seconds “on paper.” Layer in, however, the bootstrapping, and other factors such as the cloud movement units of the user performing the task and the DWUs of the Data Warehouse and you would still expect the time to increase reasonably, right? The reality, however, is that those two seconds “on paper” turn into about 30 minutes.
Yes, reality can be a beast. What Azure Data Factory is not very good at is dealing with 100,000 small files stored on BLOB storage. The advertised performance would be available if all this data was stored in a single (or very few contiguous) large file(s). Can you learn this information easily online – before you buy? No.
Back to the drawing board, we devised a plan to build such large files ourselves, using a new service that we called the ARC Capacitor. Since each ARCbeat generates a single file, we store this data in a caching layer and, once every 10 minutes, flush the cache as a single file. This generates six files per hour that the Data Factory can read without the massive overhead of fetching a large number of small files. We had to do it quickly, so we built the new service using Azure Redis Cache, another PaaS, as the temporary storage for the ARC data. Results came back positive at first glance:
We reduced the bulk copy time to Data Warehouse down to one minute. Huge win!
But how long does it take to generate each file? No matter how much we optimized for write, the best read throughput we got from Azure Redis (and we used Premium Tier) was about 30 seconds per 15,000 ARC files, which was okay when we had less than 100,000 ARCbeats to deal with per hour. However, the throughput was not at all consistent. Sometimes, it went up to 1.5 minutes for a similar number of files:
What happens when we have a 5x or 10x increase in traffic? If a read takes 1.5 minutes, and we only have 10 minutes till the next read starts, what do we do?
This is where Couchbase came to the rescue. Unlike Redis, which acts as a key-value store, Couchbase has all the advantages of a modern cache (in-memory, speed to write, lots of client connections), and it scales a lot better and can also be optimized for reads.
We converted our File Transfer API that deals with the intake of ARC data, and our ARC Capacitor to support Couchbase in a matter of days and started testing. And what we saw is impressive:
This shows that we are now reading over 15K ARC payloads in 2.5 seconds. Add the overhead of writing them to BLOB and we are looking at 5-6 seconds! How is that possible? Well, we added an index combining the three items that we are querying conditionally on: documentType, payloadType, and timeStamp:
Note that the index is replicated as well, which means that we achieved high-availability by adding a second index node to our cluster, in addition to our data nodes which are also replicated.
This is a 5x performance increase on average over Azure Redis Cache, and 15x at peak load (I ran the same query hundreds of times and never got it to exceed three seconds to fetch 15K documents) – incredible!
In the second part of my blog, I’ll discuss our transition from SQL Azure to Couchbase in support of Movere’s Authentication Service and some of the benefits and lessons we learned from using XDCR.