Articles, Blog

Choosing the right partition key for cost and performance with Azure Cosmos DB

February 22, 2020


(introductory music) Hello and welcome. Please enjoy this short readiness video on Azure Cosmos DB. Hi Deborah, how are you doing today? Hi Sanjay, great to be here. Great, it’s always fun to talk to you. So what you going to do First of all about your role at Microsoft. So my name is Deborah and I am a program manager on the Cosmos DB team. We are focused on the developer experience. So this covers the SDK portal and the Notebook experience. I see. So what are you going to cover today? Today we are going to be covering Partitioning. Which is one of the core topics that you need to know in order to be successful when you are new to Cosmos DB for the first time. Along with Data modeling which Thomas presented. So today we will cover the basics and fundamentals of partitioning. Why its so important to get good scale and good cost in performance on Cosmos DB and how choosing the right partitioning key whether its for Read-heavy or Write heavy workload has a huge impact on really the success of your workload. Okay let’s get started then. Great. So there some fundamentals. But before do that, I really want to talk about How to Partition. Why do we partition in the first place. Yes. There a one word answer which is Scale. So modern application modern workloads your data cannot fit on one machine and even if it could scaling that is a nightmare because you have to keep buying bigger and better machine and keep upgrading it. In contrast instead of this old scale-up model, what Cosmos DB does is it scales-out. So we do horizontal scaling. Which means all your data, we take it and put it among a small number of physical machines. So each machine is responsible for a subset of the data. So a kind of example here, we have a bunch of people from product teams, Cosmos DB, Postgres and each of us is responsible for a certain part of the product. And that way like in partitioning each machine is responsible for a subset of the data. I see. Okay. So in a retail example you would have billions of rows, then you would partition that you know across different machines in a way. Yeah. Okay. Yeah and one common question we get is, you know, partitioning is important. When does it become super critical? Yeah We find for our customers who have like a smaller workload like you know, 50GBs 50,000RUs of scale or less, then still follow the best practices but you won’t really see the true impact until you get to, you know, a terabyte of data. Hundreds of thousands physical partitions like some of our largest customers. I see. So then how do you think about the keys? The partitioning keys? I think that is the most important decision. Right? Yeah. So when it comes to partitioning key, before you can even choose it, and I’ll show some demos of what goes wrong when you chose a good one or bad one. There are some fundamentals that you should know. So the first definition. I promise there are only two definitions. So the first one is Logical partition. Which means all the data of the same partitioning key is in the same logical partition. So if we took everyone who presented today, and partition as per the product group, I am on Cosmos DB, Thomas is on Cosmos DB Folks are on Postgres. All the folks, myself and other people on my team who are the Cosmos DB product group as their partitioning key value for product team are the same logical partition. So all data can be grouped together. Same with Postgres and any other product. In contrast a physical partition is an actuall SSD-backed piece of compute and hardware and storage that stores the data. So we take all the logical partitions and distribute them among a small number of physical partitions. From the users perspective, you can define one partition to be per container. All right. So if I have one million logical partitions, do I have that many, one million for physical partitions as well? That a really good common question we get from people who are in need of this. One of the best practices later you will see is to have really high correlality among values. So having one million logical partitions is actually good. Then you are like, Are you going to provision one million machines? That’s sounds really expensive. So the answer is no. What we do is we take all the logical partitions whether you have ten or a million and partition them on a small number of physical machines. The way we chose a number of physical machines is hieristic based on how big your storage is. So obviously if you need more storage you’ve got to have more machines. If you need a lot of Rus, you need more scale and more machines. But its definitely, you will typically have a lot of logical partitions on a smaller number of physical partitions. I see, all right, let’s keep going. Cool. All right so here just to make it really visual for those who like to learn visually, here is a diagram that kind of shows behind the scenes what’s happening in Cosmos DB I told you all the data is distributed among the physical partitions. But how does Cosmos DB know which machine to put your data on? And that is based on the partitioning key. So if you look at this diagram the purple box here think of it as the physical partition. Right, actual storage, The blue boxes are the logical partition. So for logical partition key, that’s the ABC123. You might have multiple documents with that same value. All data with the same logical partition are also in the same physical partition, Of course So you have another one say a different partitioning key. It gets hashed and it figures out which machine that it belongs to. The great thing for our customers is that as your data size increases or as need more scale, we’ll automatically add more partitioning for you. So you don’t have to worry about you know, adding more data or physically, manually shutting the data yourself, Wow fantastic! Okay. So we start the demo? Yeah, so that’s great. So now lets get into the demo. We have really kind of, two kind of key workloads we see. One is for read-heavy workload. So let’s take a look at how you can choose a good partitioning key for a read-heavy workload and what might go wrong if we don’t choose a good version key. So here are some data and its just product reviews that uses cyber info for a retail website. So here is a sample one. Someone said that, they reviewed a fleece jacket. Right, and we have a user name and in this workload a common thing you would want to do, is show all the reviews written by a user. What’s the scenario here? So it’s a retail scenario where you buy things on a site and users are also writing product reviews. Okay,got it. Okay so imagine you log on to this website and you see, okay here are all the reviews that were written by this one person, Yeah. Which would be a very simple C+ query. So lets say, let’s actually go ahead and write that query ourselves. So lets say select serve room c where c.username is equal lets pick this one. Kurt28. We will run the query and see we get around 881 results and query stats, we see this query consumed around 123 Rus. Which aren’t as many. So if you wanted to run this query a hundred times a second that’s almost 1200RUs per second which is a lot. Right. 123RUs for one query especially for the common one it going to be kind of expensive. So the first thing you want to do as you can probably guess is check you partitioning strategy. By the way I should have mentioned that this scenario is read-heavy because if you think about a retail website, people read a lot more product reviews than they right product reviews. Of course Right. So you want to optimize the cost of your queries and key value look-ups. So here we see ,we partition let’s take a look. Inside settings. Partition by ID. But if you look at the query we are filtering by user name. Yes. So what we are actually doing here is remember that purple box diagram I showed you. We are actually going through every single physical partition spreading on or two Rus to check the index and see. Do you have this data for username? Do you have this data? Do you have this data? Which if you have a small kind of workload, checking you know five partitions for data is not a big deal. But if you have you know, hundreds and thousands of partitions hundreds in terabytes of data that overhead isn’t going to add up. So what we can do instead is we can use a partitioning key that we can be use as a filter in most of our queries. So most of our queries are going to filter by username in this retail website. So here I have the same collection, same exact data. Only difference is I have now partitioned by username. Okay. So let’s run the same query here. It going to partition by username. Right? That’s what you are doing? Yeah. All right and let’s see how many RUs this consumed. And this one only consumed 40RUs Instead of? 120 So this same query on V1 123 and then V2 forty. That’s a big difference. That’s a huge difference. If I want to run this query 100 times a second, this one would only need 4000RUs Versus 12000RUs which is three times cheaper also in cost. Like monetary cost. So you really see that with the same data, same workload ,choosing the right partition key Makes a huge difference in the funds. Absolutely. Thank you for that great example. And by the way this advise for the read-heavy workload also applies to anything read-heavy. So user-profile, product reviews, product catalog, common things such as partition key like product ID, user id etcetera, So besides that, of course, what else could go wrong if you choose a bad key? Yeah. So Rus are definitely lengthy. Sometimes when people say, ‘Oh my query seems to be going kind of slow.” Rus in that I kind of correlated. If you are taking a lot of time to check each partition that’s adding more time. So those are common symptoms. You should check your partition key first and make sure you are able to use that as the filter. I see. And anything else in terms of IOT scenarios? Yeah, so that’s actually the next one. When we see a customers workload typically they are read-heavy kind of like that retail website retail scenario or write-heavy. Write heavy means you’ve got tons of data coming to Cosmos GB at once. Examples are IOT, telemetry IOT may have many different devices pushing data one time a second even multiple times a second. So you are really pushing lots of data in at once. So imagine a scenario like in a retail store. You know customers are coming in. There are IOT sensors in the store realtime people they will be sensing you know, that’s probably what that could be. Right? Yeah it could be that, it could be vehicles on a road, it could be devices in a factory, it could even be, as long as there’s hard work it could even be users on a website tracking everything they click. The point data is coming in so fast. Yes so what I have done here, switching over to my VM. I just want to demo a scenario and simulate a really write heavy workload on Cosmos DB. And show how to choose a good partition key and a bad partition key. So the scenario I have chosen today is a vehicle telemetry like IOT data. But really it could be any of those scenarios. Of course So we have like a car having lots of sensors so lots of information. And you can see we’ve got some information like event, name, description, vehicle ID number, time stamp date and region. Any of these could potentially be a partition key. So let’s see what happens, so let’s just choose a few and measure the performance and see what we get. To simulate a write-heavy workload, I have created two collections here, one partition by date and one partition by vehicle ID number and date. And each one of this I have provisioned 50000RUs. Which ,there a pretty small document so we should be able to get 50000RUs worth of writes per second which is around 10000ish writes per second assuming 5,7 RUs per write. So let’s actually run this. I have the bulk executor library which is optimized for sending lots of data at once. So we are just going to simulate having a lot of data in Cosmos GB. Now one common thing you might want to do is, okay let partition by date. Right. Because we’ll query on it a lot and let’s see what happens. Yes. We are going to run it and time how long it takes to do batches of 10000 records and how many RUs we will effectively able to get. If you think about date, what do you think might go wrong? Because In a day multiple times you know there could be interesting pattern across dates. Yeah and especially because, There’s a heavy sale happening on one day or one way for cars. If you think about it even just a regular day we have a lot of cars on the road and they are all pushing date, and today’s date is all going to be the same for every vehicle. Yeah. So all of them are going to be at the same partition key value. Which means they are all going to hash the same machine. So they are all hammering that one machine and all the other machines you have which can serve requests are not getting any requests. Which means you are not really fully utilizing the Rus. Since each physical partition has a equal amount of the total Rus. I see. That is what we call the hot-partition problem. As you can see here we finished but we only got effective Rus of around 7000RUs per second even though we provisioned 50000. And the reason is we had a very low correlality partitioning key. To the point there is only one distinct or unique value for date. Which is, I’m generating today’s date. Okay. So in contrast what you probably want to do for write heavy workloads is choose something with high correlality or high number of distinct values. Okay like VIN numbers. Yeah exactly like VIN number. In our case we are going to make it even a bit more distinct. We are going to have Vin_the current date. Okay. And the reason we do that is just for IOT works that are extremely write-heavy. Cosmos GB is 10GB constraint on logical partitions key values. So if you have one vehicle ID number say 123, you might actually fill up 10GB of data for the 123 if you are writing you know say ten documents a second for an entire year. So by making the synthetic key coinciding with date we would get more granularity to not hit that 10GB limit. Okay. So let’s run this. Let’s see this in action. Yeah , how fast it goes and let’s see! And well that finished in less than five seconds. And if I run this a few more times. Let’s run it again. I’m going to get so closer to 5000RUs per second. Yeah so we kind of like that. Of course now we ran it very very fast. It’s the same data being ran. All we did is change the partition key. Wow! That’s awesome. Thank you so much. Any of the closing thoughts? Yeah let me just put this here. This is kind of the scenarios and the best practices for write-heavy workloads. Closing thought is really you know, if you are starting out with Cosmos DB you want to master data modeling, partitioning and of course later RUs. It’s the very fundamental concepts to being successful with Cosmos DB Fantastic. Thank you so much, it’s always fun talking to you. All right, thanks a lot Sanjay. Thanks for watching this short Azure readiness video.

No Comments

Leave a Reply