The uncomfortable truth about data file size and IOPS in Azure SQL

Before We Begin

When we use cloud services, there's this weird feeling that they're not expensive at first, but then they get more and more expensive. We often work on billing optimization, and recently, we optimized database storage within one of our accounts to optimize billing, and I'd like to share the shoveling story of what happened.

Azure SQL Managed Instance is a very good service

With a few clicks of the mouse, PaaS DB is created, automatically placed in vnet and subnet, automatically created public endpoint, automatically created connection configuration, automatically configured PITR, and it is a very convenient and good service. (Hooray for SQL Managed Instance👍)

Once created, there is almost nothing to maintain. I have to add login, tune queries, and so on, but it works well. Billing works well too

Check!! this link to learn a little more about Azure SQL Managed Instance! https://docs.microsoft.com/ko-kr/azure/azure-sql/managed-instance/sql-managed-instance-paas-overview

Azure SQL Managed Instance란? - Azure SQL Managed Instance

Azure SQL Managed Instance가 어떻게 최신 SQL Server(Enterprise Edition) 데이터베이스 엔진과 거의 100% 호환되는지 알아보세요.

learn.microsoft.com

Incident situation

A DB has 640 GiB of reserved storage and 400 GiB of used storage, and 96% of the used storage is concentrated in a single table. Let's call this table tbl_his. tbl_his contains historical data for the last 4-5 years, and it is only inserted and rarely selected.

Solution!!

>>
I decided that I needed to establish a data retention management policy. I could either keep only the last 3 months of data in tbl_his and move the rest of the data to Azure blob storage, or create a very low tier Azure SQL single database for archival purposes to handle the very occasional select.

If we only keep the last three months of data in tbl_his, the size of the mdf will be too loose, so we'll only need to shrinkfile truncate after shrinkfile notruncate. This is a perfect plan, and I'm going to work with it.

Now let's start working

★The sequence of operations is organized as follows

▶Creating a table tbl_his_new with the same structure as tbl_his.
▶ Partition tbl_his_new by year and month.
▶ N filegroups from 1900-01 to 2099-12. N files.
▶ The single data file has an init size of 64 MiB and a growth of 16 MiB.
▶ Add only the last 3 months of data from tbl_his to the tbl_his_new table.
▶ Rename tbl_his -> tbl_his_old
▶ Rename tbl_his_new -> tbl_his
▶ Delete the tbl_his_old table
▶ shrinkfile notruncate
▶ shrinkfile truncateonly

Tasks 3 and 4 above should be done at the same time during a time when there are few sessions to avoid errors in the application.
(The so-called table swap strategy)

We used the table swap strategy because we realized that it would take much less time to put only the last three months of data from the tbl_his table into a separate table than it would take to delete all historical data except the last three months of data from the tbl_his table.

The above sequence of operations went smoothly, and the storage utilization after the clean shuffling is as follows...

The used storage has been cut in half, from 400 GiB to 196 GiB. Now that we've reduced the used storage by a lot, we can reduce the reserved storage in the Azure portal by a lot.

Onset of failure

As I stared at the metrics, I felt an eerie sensation come over me. Let's look at the metrics below.

For reference, the application environment in this account is .NET Core 2.1 and is running services utilizing Azure App Service...

The request metric for the App Service started to fluctuate regularly, as shown above.

The CPU % value of the Azure SQL Managed Instance oscillated, matching the pattern of the App Service's request metric, with CPU % values reaching up to 95%.

App Service's response time. If the Avg is that low, how high is the Max? Many requests were waiting for a very long time and the responsiveness was slow, and even as I'm writing this post, my heart is pounding when I see that metric.

Troubleshooting

Is the guaranteed IOPS different depending on the reserved storage capacity of DB and the used storage capacity? I tried to increase the reserved storage capacity of the DB in the Azure portal based on this hypothesis. (Application completed within a few minutes)

Hmmm, still no improvement.

A quick web search gave me some new insights. The premium storage disks in Azure SQL Managed Instance have different maximum IOPS per disk depending on the disk size tier.

Scalability and performance targets for VM disks
https://docs.microsoft.com/en-us/azure/virtual-machines/disks-scalability-targets

Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose)
https://techcommunity.microsoft.com/t5/datacat/storage-performance-best-practices-and-considerations-for-azure/ba-p/305525

Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose) | Microsoft Community

First published on MSDN on Jul 20, 2018 Reviewed by: Kun Cheng, Borko Novakovic, Jovan Popovic, Denzil Ribeiro, Rajesh Setlem, Arvind Shyamsundar, Branislav...

techcommunity.microsoft.com

Before partitioning the tbl_his table, it was contained in the Primary filegroup, and partitioning it from this Primary single file reduced the single data file size to 64 MiB, which is estimated to have dropped the IOPS on that file from 2,300 to 500.

I decided to increase the size of each individual data file to a value greater than 128 GiB, say 136 GiB, to see if that would resolve the failure.

eventually The problem was solved.

As you can see in the metrics below, as soon as we increase the data file size, the app has a steady stream of requests.

As shown in the metrics below, we see that the app's response time (avg,max) has dropped to a normal level starting at 11:30 AM.

Also, as shown in the metric below, the CPU % of the DB has gone from spiking during the previous failure to almost no spikes around 11:30 AM.

what i learned...

IOPS in Azure SQL Managed Instance are delivered based on data file size, not reserved storage or used storage.

Don't blindly reduce data capacity to optimize billing or you'll end up working overtime.

Don't skim this blog because it's wordy, but read it carefully, there are nuggets of wisdom in there.

'🐳Azure' 카테고리의 다른 글

Getting started with Azure IoT Hub(2) (3)	2025.06.29
Getting started with Azure IoT Hub (1) (6)	2025.06.24
Azure AD Cloud Provisioning (0)	2025.06.17
Controlling resource access with Azure AD Identity Governance (1)	2025.05.19
Azure IoT Architecture (4)	2025.05.14

Chan's CloudOpsLab

The uncomfortable truth about data file size and IOPS in Azure SQL

Before We Begin

Azure SQL Managed Instance is a very good service

Incident situation

Solution!!

Now let's start working

Onset of failure

The CPU % value of the Azure SQL Managed Instance oscillated, matching the pattern of the App Service's request metric, with CPU % values reaching up to 95%.

Troubleshooting

'🐳Azure' 카테고리의 다른 글

티스토리툴바

The uncomfortable truth about data file size and IOPS in Azure SQL

Before We Begin

Azure SQL Managed Instance is a very good service

Incident situation

Solution!!

Now let's start working

Onset of failure

The CPU % value of the Azure SQL Managed Instance oscillated, matching the pattern of the App Service's request metric, with CPU % values reaching up to 95%.

Troubleshooting

'🐳Azure' 카테고리의 다른 글

관련글

티스토리툴바