Best practices for storing unstructured data in Azure Blobs for an app that uses SQL Server Analysis Services (SSAS) and Azure Blob storage
Unstructured data is the most common and widely used file format in the world. This includes files such as spreadsheets, images, documents, presentations, audio, video, and other types of content. Applications that deal with unstructured data include Web applications like Office 365 or Salesforce.com where people can view or edit the information stored in Word or Excel file formats or Adobe PDFs; line-of-business (LOB) apps like Microsoft Dynamics CRM; and visualization tools like Power BI and SQL Server Reporting Services (SSRS). These apps need to make decisions about which content to index and make available in their search engines, but tough choices must be made when deciding what types of files to include. On the one hand, you need to be able to index enough content for it to be useful in your application. On the other hand, if you index too much information, you will reduce search efficiency and performance because of storage requirements and processing time.
For these reasons, many apps typically use a tiered storage architecture that stores content according to its profile. This allows newer data (hotter documents) with high turnover rates (less frequent accesses) to reside on faster storage media while older data rests on slower media or tape backup. Specific profiles are assigned based on an algorithm that considers factors such as how often items are accessed compared to others in the same document library or folder; who owns them; file size; and so on.
In this article, we show you how to use Azure Blobs to store your unstructured data in tiered storage architecture. We walk you through the process of configuring a search index for an SSAS Tabular 1400 model using Azure Search. Then, we show you how to move selected content from blob storage into the search index, allowing end-users to view it based on their permissions. Throughout this journey, we provide best practices for storing this type of information efficiently and affordably by using features such as object tagging and lifecycle management.
Before going further, let’s take a look at some common strategies for dealing with unstructured data: When looking at these examples keep in mind that SSAS Tabular 1400 is a column-store model designed to support massive amounts of data with fast, efficient queries. It uses the xVelocity in-memory analytics engine which enables extremely fast tabular querying, but it also has significant storage requirements for new tables and indices.
Approximate profile of unstructured data objects
1: (all hot): Store everything on HDD (very low cost)
You can store all your blobs on hard disks (HDD), which makes them very easy to access at a lower price point than any other form of storage; however, this comes with some tradeoffs. Firstly, there’s no redundancy available since you don’t get multiple copies of your data. Secondly, there are only a limited number of transactions per second you can perform which means queries are slower than with other forms of storage. A third downside is that blobs stored on HDD tend to be larger than those stored on SSD or SCD (more on this later).
2: (all hot): Store everything in Azure Blob Storage – Hot tier / Archive tier
You could store all your blobs in the hot tier with object tags for each blob so you know what type it is and where it should go when it becomes stale relative to your business needs. For example, you could “archive” older media files so they aren’t searchable but can still be accessed by end-users for download. This would allow you to use a single storage account for all your data; thereby simplifying the process of moving blobs from one tier to another should you need to.
3: (all hot): Store everything in Azure Blob Storage – Hot tier / Cool tier / Archive tier
Suppose you want more granular control over which blobs are stored on HDD and SSD at a lower price point since they’re “cooler” documents with a turnover rate of once daily or less than archived items — but still want redundancy plus the ability to support transactions per second. You could store all your blobs in the hot tier with object tags so that when an item becomes stale it’s moved automatically into the Cool tier. Then, when users access the “cooler” documents, you could move them into the hot tier again.
4: (all warm): Store everything in Azure Table Storage – Warm tier / Archive table
Storing blobs in Table storage might be a good idea if fast queries are important to you but you don’t need transactions per second or redundancy across accounts. You could store your hot items directly in Azure Table Storage while archiving the older ones in another Azure Table Storage account for querying purposes only. This would allow for multiple copies of items within an account, plus the ability to perform fast lookups on individual blobs themselves thanks to built-in support for indexing. However, this option offers no redundancy across accounts. If your data storage account were to go down, you would lose all your blobs!
With the four storage strategies discussed above, you could use Azure Storage to effectively manage your unstructured data. Each tier in Figure 1 offers something different (and more affordable) than the previous one. The Hot tier scales to support transactions per second, while Cool and Archive tiers are better suited for querying. And Table storage would be appropriate if fast lookups with granular control were desired at a low price point – but watch out for those account-level limitations!