In my attempt to cover most of the features of the Microsoft Cloud Computing Windows Azure, I’ll be covering Windows Azure storage in the next few posts.
Why using Windows Azure storage:
- Fault-tolerance: Windows Azure Blobs, Tables and Queues stored on Windows Azure are replicated three times in the same data center for resiliency against hardware failure. No matter which storage service you use, your data will be replicated across different fault domains to increase availability
- Geo-replication: Windows Azure Blobs and Tables are also geo-replicated between two data centers 100s of miles apart from each other on the same continent, to provide additional data durability in the case of a major disaster, at no additional cost.
- REST and availability: In addition to using Storage services for your applications running on Windows Azure, your data is accessible from virtually anywhere, anytime.
- Content Delivery Network: With one-click, the Windows Azure CDN (Content Delivery Network) dramatically boosts performance by automatically caching content near your customers or users.
- Price: It’s insanely cheap storage
The only reason you would not be interested in the Windows Azure storage platform would be if you’re called Chuck Norris …
Now if you are still reading this line it means you aren’t Chuck Norris, so let’s get on with it, as long as it is serializable.
In this post we will cover Windows Azure Blob Storage, one of the storage services provided by the Microsoft cloud computing platform. Blob storage is the simplest way to store large amounts of unstructured text or binary data such as video, audio and images, but you can save about anything in it.
The concept behind the Windows Azure Blog storage is as following:
There are 3 things you need to know about to use Windows Azure Blob storage:
- Account: Windows Azure storage account, which is the account, containing blob, table and queue storage. The storage account blob storage can contain multiple containers.
- Container: blob storage container, which behaves like a folder in which we store items
- Blob: Binary Large Object, which is the actual item we want to store in the blob storage
1. Creating and using the Windows Azure Storage Account
To be able to store data in the Windows Azure platform, you will need a storage account. To create a storage account, log in the Windows Azure portal with your subscription and go to the Hosted Services, Storage Accounts & CDN service:
Select the Storage Accounts service and hit the Create button to create a new storage account:
Define a prefix for your storage account you want to create:
After the Windows Azure storage account is created, you can view the storage account properties by selecting the storage account:
The storage account can be used to store data in the blob storage, table storage or queue storage. In this post, we will only cover the blob storage. One of the properties of the storage account is the primary and secondary access key. You will need one of these 2 keys to be able to execute operations on the storage account. Both the keys are valid and can be used as an access key.
When you have an active Windows Azure storage account in your subscription, you’ll have a few possible operations:
- Delete Storage: Delete the storage account, including all the related data to the storage account
- View Access Keys: Shows the primary and secondary access key
- Regenerate Access Keys: Allows you to regenerate one or both of your access keys. If one of your access keys is compromised, you can regenerate it to revoke access for the compromised access key
- Add Domain: Map a custom DNS name to the storage account blob storage. For example map the robbincremers.blob.core.windows.net to static.robbincremers.me domain. Can be interesting for storage accounts which directly expose data to customers through the web. The mapping is only available for blob storage, since only blob storage can be publicly exposed.
Now that we created our Windows Azure storage account, we can start by getting a reference to our storage account in our code. To do so, you will need to work with the CloudStorageAccount, which belongs to Microsoft.WindowsAzure namespace:
We create a CloudStorageAccount by parsing a connection string. The connection string takes the account name and key, which you can find in the Windows Azure portal. You can also create a CloudStorageAccount by passing the values as parameters instead of a connection string, which could be preferable. You need to create an instance of the StorageCredentialsAccountAndKey and pass it into the CloudStorageAccount constructor:
The boolean that the CloudStorageAccount takes is to define whether you want to use HTTPS or not. In our case we chose to use HTTPS for our operations on the storage account. The storage account only has a few operations, like exposing the storage endpoints, the storage account credentials and the storage specific clients:
The storage account exposes the endpoint of the blob, queue and table storage. It also exposes the storage credentials by the Credentials operation. Finally it also exposes 4 important operations:
- CreateCloudBlobClient: Creates a client to work on the blob storage
- CreateCloudDrive: Creates a client to work on the drive storage
- CreateCloudQueueClient: Creates a client to work on the queue storage
- CreateCloudTableClient: Creates a client to work on the table storage
You won’t be using the CloudStorageAccount much, except for creating the service client for a specific storage type.
2. Basic operations for managing blob storage containers
A blob container is basically a folder in which we place our blobs. You can do the usual stuff like creating and deleting blob containers. There is also a possibility to set some permissions and metadata on our blob container, but those will be covered in the next chapters after we looked into the basics of the CloudBlobContainer and the CloudBlob.
Creating a blob container synchronously:
You start by creating the CloudBlobClient from the CloudStorageAccount through the CreateCloudBlobClient on the CloudStorageAccount. The CloudBlobClient exposes a bunch of operations which will be used to manage blob containers and to manage and store blobs. To create or retrieve a blob container, you use the GetContainerReference operation. This will return a reference to the blob container, even if the container does not exist yet. The reference does not execute a request over the network. It simply creates a reference with the values the container would have, and is returned as a CloudBlobContainer.
To create the blob container, you invoke the Create or CreateIfNotExists operation on the CloudBlobContainer. The Create operation will return a StorageClientException if the container you are trying to create already exists. You can also use the CreateIfNotExists operation, which only attempts to create the container if it does not exist yet. As a parameter you pass the name of the container. Important is to know a blob container can only contain numbers, lower case letters and dashes and has to be between 3 and 24 characters.
For almost every synchronous operation there is an asynchronous operation available as well.
Creating the blob container asynchronously by the BeginCreateIfNotExists and EndCreateIfNotExists operation:
It follows the default Begin and End pattern of the asynchronous operations. If you are a fan lambda expressions (awesomesauce), you can avoid splitting up your operation with a lambda expression:
You could do almost anything asynchronous, which is highly recommended. If my Greek ninja master would see me writing synchronous code, he would most likely slap me, but for demo purposes I will continue using synchronous code throughout this post, since this might be easier to follow and understand for the some people.
Deleting a blob container is as straight forward as creating one. We simple invoke the Delete operation on the CloudBlobContainer, which will execute a request to the REST storage interface:
There is 1 remark about this piece of code, and it is the Exists operation. By default, there is no operation to check whether a blob container already exists or not. I added an extension method on the CloudBlobContainer, which will check whether the blob container exists. This way of validation the existence of the container was suggested by Steve Marx. We will come back to FetchAttributes method in a later chapter.
I added an identical extension method to check if a Blob exists:
If you do not know what extension methods are, you can find an easy article about it here:
Implementing and executing C# extension methods
To explore my storage accounts, I use a free tool called Azure Storage Explorer which you can download on codeplex:
When you created the new blob container, you can view and change it’s properties with the Azure Storage Explorer:
I manually uploaded an image to the blob container so we have some data to test with.
There are a few other operations exposed on the CloudBlobClient, which you might end up using:
- ListContainers: Allows you to retrieve a list of blob containers that belong to the storage account blob storage. You can list all containers are search by a prefix.
- GetBlobReference: Allows you to retrieve a CloudBlob reference through the absolute uri of the blob.
- GetBlockBlobReference: See chapter 9
- GetPageBlobReference: See chapter 9
- SetMetadata: See chapter 4
- SetPermissions: See chapter 6
- FetchAttributes: See chapter 4
One thing that might surprise you is that it is not possible to nest one container beneath another.
3. Basic operations for storing and managing blobs
Blob stands for Binary Large Object, but you can basically store about anything in blob storage. Ideally it is build for storing images, files and so forth. But you can just as easily serialize an object and store it in blob storage. Let’s cover the basics for the CloudBlob.
There are a few possible operations on the CloudBlob to upload a new blob in a blob container:
- UploadByteArray: Uploads an array of bytes to a blob
- UploadFile: Uploads a file from the file system to a blob
- UploadFromStream: Uploads a blob from a stream
- UploadText: Uploads a string of text to a blob
There are a few possible operations on the CloudBlob to download a blob from blog storage:
- DownloadByteArray: Downloads the blob’s contents as an array of bytes
- DowloadText: Downloads the blob’s contents as a string
- DownloadToFile: Downloads the blob’s contents to a file
- DownloadToStream: Downloads the contents of a blob to a stream
There are a few possible operations on the CloudBlob to delete a blob from blog storage:
- Delete: Delete the blob. If the blob does not exist, the operation will fail
- DeleteIfExists: Delete the blob only if it exists
A few other common operations on the CloudBlob you might run into:
- OpenRead: Opens a stream for reading the blob’s content
- OpenWrite: Opens a stream for writing to the blob
- CopyFromBlob: Copy an existing blob with content, properties and metadata to a new blob
They all work identical, so we will only cover the upload and download of a file to blob storage, to show as an example.
Uploading a file to blob storage:
In the GetBlobReference we pass along the name we want the blob to be called in the Windows Azure blob storage. Finally we upload our local file to blob storage. Retrieving a file from blob storage:
Both operations have in common that they get a CloudBlob through the GetBlobReference operation. In the upload operation we used the GetBlobReference on the CloudBlobContainer, while in the download operation we used the GetBlobReference operation on the CloudBlobClient. They both return the same result, the only difference is they take different parameters.
- CloudBlobClient.GetBlobReference: Get a blob by providing the relative uri to the blob. The relative uri is of format “blobcontainer/blobname”
- CloudBlobContainer.GetBlobReference: Get a blob by providing the name of the blob. The blob is being search in the blob container we are working with.
Both the operations also allow you to get the blob by specifying the absolute uri of the blob.
Deleting a blob from blob storage:
There are a few other operations exposed on the CloudBlobClient, which you might end up using:
- SetMetadata: See chapter 4
- SetPermissions: See chapter 6
- SetProperties: See chapter 5
- CreateSnapshot: Snapshots provide read-only versions of blobs. Once a snapshot has been created, it can be read, copied, or deleted, but not modified. You can use a snapshot to restore a blob to an earlier version by copying over a base blob with its snapshot
4. Managing user-defined metadata on blob containers and on blobs
Managing used-defined metadata is identical for both the blob container as for the blob. Both the CloudBlobContainer as the CloudBlob exposes the same operations which allow us to manage the metadata.
Using metadata could be particularly interesting when you want to add some custom information to the blob or the blob container. Some examples to use metadata:
- A metadata property “author” that defines who created the blob container or the blob
- A metadata property “changedby” that defines which user issued the last change on the blob
- A metadata property “identifier” that defines a unique identifier for the blob, which could be needed when retrieving the blob
Adding metadata is done through the Metadata property, which takes a NameValueCollection of the metadata items you want to add. Retrieving the metadata information is done by returning the Metadata property, which returns a NameValueCollection. Before we return the Metadata property, we invoke the FetchAttributes operation. This operation makes sure the blob’s system properties and user-defined metadata is populated and the latest values are retrieved. It is advised to always invoke the FetchAttributes operation when trying to retrieve blob properties or metadata.
Working with metadata on the CloudBlobContainer:
Working with metadata on the CloudBlob is identical to working with metadata on the blob container:
Metadata allows you to easily store and retrieve custom properties with your blob or blob container.
5. Managing properties like HTTP headers on blobs
On the CloudBlob there is a property Properties exposed, which holds a list of defined blob properties. The following properties are exposed through the blob properties:
- CacheControl: Get or set the cache-control HTTP header for the blob, which allows you to instruct the browser to cache the blob item for a specified time
- ContentEncoding: Get or set the content-encoding HTTP header for the blob, which is used to define what type of encoding was used on the item. This is mainly used when using compression.
- ContentLanguage: Get or set the content-language HTTP header for the blob, which is used to define what language the content is at.
- Length: Get-only property to get the size of the blob in bytes
- ContentMD5: Get or set the content-MD5 HTTP header for the blob, which is used as A Base64-encoded binary MD5 sum of the content of the response
- ContentType: Get or set the content-type HTTP header for the blob, which is used to specify the mime type of the content
- ETag: Get-only property for the Etag HTTP header for the blob, which is an identifier for a specific version of a resource. The ETag value is an identifier assigned to the blob by the Blob service. It is updated on write operations to the blob. It may be used to perform operations conditionally, providing optimistic concurrency control. We will look into this in a following chapter.
- LastModifiedUtc: Get-only property which returns the last modified date of the blob
- BlobType: Get-only property which returns the type of the blob
- LeaseStatus: Get-only property which returns the lease status of the blob. We will get to leasing blobs in a following chapter.
Now I believe it might be pretty obvious to why some of these properties can be of crucial use. Retrieving or changing the property HTTP headers are pretty easily done. Retrieving a blob property:
The default content-type of our image is set to application/octet-stream:
If we check that in our browser:
Ideally this should be set to content-type image/jpeg, or some browsers might not parse the image as an image, but as a file that will be downloaded. So we will change the content-type to image/jpeg instead for this blob:
Saving the properties on the blob is done by calling the SetProperties method. If you run the client console application, the Content-Type header for the blob will change and next time you retrieve the blob with the new content-type header:
Setting some Blob properties like the Cache-Control, Content-Type and Content-Encoding HTTP headers can be very important on how the blob is being send to the client. If you upload a file in blob storage that is compressed with GZIP and you do not provide the content-encoding property on the blob, then the clients will not be able to read the retrieved blob. If you do set the correct HTTP header, the client will know the resource is being compressed with gzip and be able to take the necessary steps to read this compressed file.
If you are providing images directly from blob storage onto your website, you might wanto set the Cache-Control property so the header is being added to the requests. That way the images will not be retrieved on every single request, but will be cached in the client browser for the duration you specified.
6. Managing permissions on a blob container
When you create a default blob container, the blob container will be created as a private container by default. Meaning the content of the container is not publicly exposed for anonymous web users. If we create a default blob container called “images”, it will look like this in the Azure Storage Explorer:
Notice the images container has a lock image on the folder, meaning it is a private container. We have an image “robbin.jpg” uploaded in the images blob container. If you would attempt to view the image by your browser:
Now let’s suppose we want the image to be publicly available for our web application for example. In that case, we would have to change the permissions. To change the permissions on the blob container, you need the following code:
We create an operation on which you can pass the blob container name and pass the BlobContainerPublicAccessType as a parameter. Then we update the CloudBlobContainer with the new permission by the SetPermissions operation on the CloudBlobContainer. The BlobContainerPublicAccessType is an enumeration which currently holds 3 possible values:
- Blob: Blob-level public access. Anonymous clients can read the content and metadata of blobs within this container, but cannot read container metadata or list the blobs within the container.
- Container: Container-level public access. Anonymous clients can read blob content and metadata and container metadata, and can list the blobs within the container.
- Off: No anonymous access. Only the account owner can access resources in this container.
By default the Permissions are set to Off, which means only the account owner can access the resources in the blob container. We want the images to be publicly exposed for our web application. We will change the permission of our “images” blob container to the Blob permission:
Setting the permissions on the blob container to Public is mainly used when you want the users to browse through the content your blob container has. If you don’t want people to see the full content of your blob container, you set the blob container permissions to Blob, which results that the blob can be retrieved, without users knowing the full content of your blob container.
If you do not want to expose the blobs to the public, you set the blob container access level to private. There is also a possibility to define access to blobs in a private blob container. There is where a shared access policies and shared access signatures come in.
7. Managing shared access policies and shared access signatures
There is also a possibility to set more precise permissions on a blob or a blob container. Shared Access Policies and Shared Access Signatures (SAS) allow us to define a custom permission on a blob or blob container for specific rights, within a specific time frame.
There are 2 things that come forward:
- Shared Access Policy: The shared access policy, represented by a SharedAccessPolicy object, defines a start time, an expiry time, and a set of permissions for shared access signatures to a blob resource
- Shared Access Signature: A Shared Access Signature is a URL that grants access rights to containers and blobs. By specifying a Shared Access Signature, you can grant users who have the URL access to a specific blob or to any blob within a specified container for a specified period of time. You can also specify what operations can be performed on a blob that’s accessed via a Shared Access Signature
We will run through a few steps to cover the Shared Access Signature and the Shared Access Policy.
We start by adding some code in our console application to create a shared access signature for our image blob:
We get the specific blob we want to get a shared access signature for and we create a shared access signature by using the GetSharedAccessSignature operation on the CloudBlob. The operation takes a SharedAccessPolicy object, on which we can specify a few values:
- SharedAccessStartTime: Takes a datetime specifying the start time of the access signature. If you do not provide a datetime, the default value will be the moment of creation
- SharedAccessExpiryTime: Takes a datetime specifying when the access signature will expire. After expiration, the signature will no longer be valid to execute the operations on the resource
- Permissions: Takes a combination of SharedAccessPermissions. These permissions define the rights the shared access signature is granted on the resource
SharedAccessPermissions is an enumeration, which has the following possible values:
- Delete: Grant delete access
- Write: Grant write access
- List: Grant listing access
- Read: Grant read access
- None: No access granted
In this case we set the permissions to the resource only to Read permission and the shared access signature is only valid from this moment until 1 minute in the future. This means the resource will only be accessible with the shared access uri for the next 1 minute after creation. After expiration, you will not be able to access the resource anymore with the shared access uri. Also note we use DateTime.UtcNow to pass a datetime for the start date and expiration date, which is advised to avoid issues.
We write the shared access signature, the blob uri and the shared access uri to the output window. The shared access uri is the combination of the blob uri and the shared access signature, which results in a compelete uri. If we run the client console application:
The shared access signature is being written to the output window. Finally we also written the complete shared access uri to the output window, which is the of the format bloburi/signature. If we visit the shared access uri within the minute of running the client console application (the images blob container is set to private access level, so the resources are not being exposed publicly):
If you try to access the resource with the shared access uri after 1 minute, you’ll be receiving an error, because the shared access signature expired:
So even though the blob container is set to private access, we can expose certain resources with the shared access signature. We can trade the shared access signature with some users or applications for them to access the resource. Let’s have a look at the generated shared access signature for our blob:
You will see a few parameters in the generated signature:
- se: Stands for the signature expiration date
- sr: Stands for the signature resource type. The value is set to b, which stands for blob. The other possible value is c, which stands for container
- sp: Stands for the signature permissions. The value is set to r, which stands for read. If the value would rwdl, it would stand for read write delete list
- sig: Stands for the signature that is being used as a validation mechanism
- ss: Stands for the signature start date. Is not added in this signature since we did not specify the signature start date
One of the issues that arises with these shared access uri’s is that when you create a shared access signature that is valid for let’s say 1 month. You pass this shared access uri to one of the customers, so he can access the resource. However if the customer’s shared access signature is being compromised after 1 week, you will have to invalidate this shared access signature. But since the signature is generated with the storage account primary key, the shared access signature will stay valid as long it is not expired and as long the signature will validate against our storage account primary key. So to solve the compromised shared access signature, you have to regenerate the storage account primary key, which results in all shared access signatures being invalidated. This is obviously not an ideal situation and regenerating your storage account keys each time something might get compromised will make you end up being bald …. That’s where the shared access policy comes in.
Let’s create a Shared Access Policy for our images blob container:
We create a new BlobContainerPermissions collection, which holds the permissions set on the blob container. You could get the existing permissions by using the Permissions property on the CloudBlobContainer. We create 2 new shared access policies by the SharedAccessPolicy. The “readonly” shared access policy has only read rights, while the “writeonly” shared access policy only has write rights. We do not define an expiration date for the shared access policies, since the expiration date will be set by our blob requesting a shared access signature for a shared access policy. Finally we save the new shared access policies by the SetPermissions operation which takes a BlobContainerPermissions instance.
For our specific blob image, we want to create a shared access signature for a customer. Instead of specifying a new shared access policy with all attributes, we only specify a shared access policy with an expiration date and we pass the name of container-level access policy we want to create a shared access signature on:
The GetSharedAccessSignature operation uses the 2nd overload now, where you can pass a container-level access policy:
If we run the console application now:
Notice our shared access signature is looking differently. There is a new parameter in the signature, which is the si parameter. The si stands for signature identifier, which points to the shared access policy name the signature is created on. The sp parameter, which stands for the rights, is no longer present, since we specified that parameter already in the shared access policy. If you would specify the rights again in the signature you create on the container-level shared access policy, you’ll get an exception when trying to use the signature.
If we visit the blob image with the readonly signature we created:
If we now generate a blob signature for the “writeonly” access policy and visit the resource with the shared access uri:
Which makes sense since we did not specify read rights for the writeonly access policy.
Now let’s suppose the shared access signature we traded with a customer has gotten compromised. Instead of regenerating the storage account primary key and having suicide feelings, we simple change or revoke the shared access policy. You can see the shared access policies on the blob container with the Azure storage explorer:
You can change the policies, change the rights or simple make the policy expire if it has been compromised, which will disable the compromised access signature, without breaking everything else that uses the storage account. For some reason the Shared Access Policies do not show the start and expiry date … Weird stuff, but not important.
You can generate a shared access signature on blob container level or on blob level. If you generate it on blob container level, it can be used to access any blob in the container. If it is generated on blob level, it is only valid to access that one specific blob. Shared access policies are being placed on the blob container level.
Accessing resources with a shared access key can be done with the CloudBlobClient:
Instead of using the default CreateCloudBlobClient method on the CloudStorageAccount, we create a new CloudBlobClient like shown above. We are able to pass a StorageCredentialsSharedAccessSignature with the shared access signature into the CloudBlobClient constructor. All the operations you will execute with the CloudBlobClient will be working with the shared access signature.
8. Managing concurrency with Windows Azure blog storage
Just as with all other data services, the Windows Azure blob storage provides a concurrency control mechanism:
- Optimistic concurrency with the Etag property
- Exclusive write rights concurrency control with blob leasing
The issue is as following:
- Client 1 retrieves the blob.
- Client 1 updates some property on the blob
- Client 2 retrieves the blob.
- Client 2 updates some property on the blob
- Client 2 saves the changes on the blob to blob storage
- Client 1 saves the changes on the blog to blog storage. The changes client 2 made to the blob were overwritten and are lost since those changes were not retrieved yet by client 1.
The idea behind the optimistic concurrency is as following:
- Client 1 retrieves the blob. The Etag is currently 100 in blob storage
- Client 1 updates some property on the blob
- Client 2 retrieves the blob. The etag is currently 100 in blob storage
- Client 2 updates some property on the blob
- Client 2 saves the changes on the blob to blob storage with etag 100. The Etag of the blob in blog storage is 100, the provided etag by the client is 100, so the client had the latest version of the blob. The update is being validated and the blob is updated. The Etag is being changed to 101.
- Client 1 saves the changes on the blog to blog storage with etag 100. The Etag of the blog in blog storage is 101, the provided etag by the client is 100, so the client does not have the latest version of the blob. The update fails and an exception is being returned to the client.
Some dummy code in the console application to show this behavior of optimistic concurrency control through the Etag:
We get the current Etag value of the blob through the CloudBlob.Properties.ETag property. Then we write the Etag to the output window. Then we add a metadata item to the blob and we update the metadata of the blob to blob storage. The only difference now compared to before is that we pass a BlobRequestOptions with the operation. In the BlogRequestOptions we specify an AccessCondition of AccessCondition.IfMatch(etag), meaning that the blog request only will succeed if the access condition is fulfilled, which is that the etag of the blob in blog storage has to match the etag we pass along. The etag we pass along is the etag we got from the blob when we retrieved it. The BlobRequestOptions can be provided to almost every operation that interacts with the Windows Azure blog storage.
If we run this console application twice and update the second blob before the first blob, we will get an error when trying to update the first blob:
We are getting an error: The condition specified using HTTP conditional header(s) is not met, meaning that we tried to update a blob that was already updated by someone else since we received it. That way we avoid overwriting updated data in the blob storage and losing the changes made by another user.
Now it’s also possible to control concurrency on blob storage through blob leases. A lease will basically lock the blob so that other users can not modify it while it is being locked. If someone else tries to update the blob while it is locked, he will get an exception and will have to wait until the blob is unlocked to be able to update it. Once the client has updated the blob, he releases the blob lease, so that other users can modify the blob. Leasing blobs guarantee exclusive write rights.
Ideally, you will go with optimistic concurrency, since leasing blobs is expensive and locking resources might hurt performance and create bottle-necks. Personally I would always go with optimistic concurrency as it’s easy to implement and does not hurt your application performance.
If you really need to use the blob leasing concurrency control, you can find more information about leasing blobs here: http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-storage-client-library
9. Page blobs vs block blobs: the difference
There are two sorts of blobs. You can either use a page blob or a block blob.
MSDN information on a Block blob:
Block blobs let you upload large blobs efficiently. Block blobs are comprised of blocks, each of which is identified by a block ID. You create or modify a block blob by writing a set of blocks and committing them by their block IDs. Each block can be a different size, up to a maximum of 4 MB. The maximum size for a block blob is 200 GB, and a block blob can include no more than 50,000 blocks. If you are writing a block blob that is no more than 64 MB in size, you can upload it in its entirety with a single write operation. (Storage clients default to 32 MB, settable using the SingleBlobUploadThresholdInBytes property.)
When you upload a block to a blob in your storage account, it is associated with the specified block blob, but it does not become part of the blob until you commit a list of blocks that includes the new block’s ID. New blocks remain in an uncommitted state until they are specifically committed or discarded. Writing a block does not update the last modified time of an existing blob.
Block blobs include features that help you manage large files over networks. With a block blob, you can upload multiple blocks in parallel to decrease upload time. Each block can include an MD5 hash to verify the transfer, so you can track upload progress and re-send blocks as needed. You can upload blocks in any order, and determine their sequence in the final block list commitment step. You can also upload a new block to replace an existing uncommitted block of the same block ID. You have one week to commit blocks to a blob before they are discarded. All uncommitted blocks are also discarded when a block list commitment operation occurs but does not include them.
You can modify an existing block blob by inserting, replacing, or deleting existing blocks. After uploading the block or blocks that have changed, you can commit a new version of the blob by committing the new blocks with the existing blocks you want to keep using a single commit operation. To insert the same range of bytes in two different locations of the committed blob, you can commit the same block in two places within the same commit operation. For any commit operation, if any block is not found, the entire commitment operation fails with an error, and the blob is not modified. Any block commitment overwrites the blob’s existing properties and metadata, and discards all uncommitted blocks.
Block IDs are strings of equal length within a blob. Block client code usually uses base-64 encoding to normalize strings into equal lengths. When using base-64 encoding, the pre-encoded string must be 64 bytes or less. Block ID values can be duplicated in different blobs. A blob can have up to 100,000 uncommitted blocks, but their total size cannot exceed 400 GB.
If you write a block for a blob that does not exist, a new block blob is created, with a length of zero bytes. This blob will appear in blob lists that include uncommitted blobs. If you don’t commit any block to this blob, it and its uncommitted blocks will be discarded one week after the last successful block upload. All uncommitted blocks are also discarded when a new blob of the same name is created using a single step (rather than the two-step block upload-then-commit process).
MSDN information on a Page blob:
Page blobs are a collection of 512-byte pages optimized for random read and write operations. To create a page blob, you initialize the page blob and specify the maximum size the page blob will grow. To add or update the contents of a page blob, you write a page or pages by specifying an offset and a range that align to 512-byte page boundaries. A write to a page blob can overwrite just one page, some pages, or up to 4 MB of the page blob. Writes to page blobs happen in-place and are immediately committed to the blob. The maximum size for a page blob is 1 TB.
In most cases you will be using block blobs. By default when you upload a blob to blob storage, it will be a block blob. One of the key scenario’s for page blobs are cloud drives, allowing you to attach a VHD with data to a windows azure instance and having it behave as a local NTFS disk. However cloud drives do not belong to the scope of this post now. However if you are interested in how to work with page blobs, you can find the necessary information here: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/04/11/using-windows-azure-page-blobs-and-how-to-efficiently-upload-and-download-page-blobs.aspx
When working with the CloudBlob, the blob type is being abstracted from us. However you can work with page or block blobs by using the CloudBlockBlob and CloudPageBlob:
10. Streaming large files as block blobs to blob storage
If you would want to upload a large file as block blob, then the information about block blocks of chapter 9 provides most of the information.
To upload a large file to blob storage while streaming it, we will go through a few steps:
- Open a FileStream on the file
- Calculate how many blocks of 4 MB this file will generate. 4 MB blocks are the maximum allowed block size
- Read a 4 MB buffer from the file with the FileStream
- Create a block id. In chapter 9 it says all block id’s are of equal length, so we will convert the block id to a base64 string to get equal length block id’s for all blocks
- Upload the block to blob storage by the PutBlock method on the CloudBlockBlob
- Add the block id to a list of block id’s we’ve been uploading. The block will be added in uncommited state.
- Keep repeating step 3 to 6 until the FileStream is completely read
- Execute a PutBlockList of the entire list of block id’s we uploaded to blob storage. This will commit all the blocks we uploaded to the CloudBlockBlob.
Some example code just to show the concept of how the block uploading works. This does not include retry policy nor exception handling:
The code that streams the file from disk into blocks and uploads the blocks to blob storage:
This code could be particularly interesting when you want to stream upload blobs with a WCF service. However if you are using a CloudBlob, this is already getting handled for you by default. If you use a decompiler, like the Telerik JustDecompile and have a look at the CloudBlob implementation in the Microsoft.WindowsAzure.StorageClient.dll , you will find the following in the upload of a CloudBlob.
You can download the free Telerik JustDecompile here:
Internally if you upload a CloudBlob, there’s a piece of code like this in there:
If will check whether the file length is larger then the SingleBlobUploadTresholdInBytes property that is set on the CloudBlobClient. If the blob size is larger then the maximum single upload size of a blob, it will internally automatically switch over to uploading the blob with Blocks. So basically you do not have to bother with CloudBlockBlobs to do optimized uploading, unless you have to to interact with some other code which for example accepts a stream and you do not want to load the stream entirely into memory.
I suggest if you want to use your own mechanism to stream a file into blob blocks and upload the blocks to blob storage, you have a look at some of the internal implementation of the storage client. You will find highly optimized code there to do parallel and asynchronous block upload there.
If are looking about performance and parallelism for downloading and uploading blobs to storage, you should have a look at this:
Any suggestions, remarks or improvements are always welcome.
If you found this information useful, make sure to support me by leaving a comment.
Cheers and have fun,