We're now going to look at how the same goal can be achieved if your ASP.NET application is authenticating another way. We achieve this through use of the ASP.NET Data Protection system. Andrew Lock has written an excellent walkthrough on the topic and I encourage you to read it.
We're interested in the ASP.NET data-protection system because it encrypts and decrypts sensitive data including the authentication cookie. It's wonderful that the data protection does this, but at the same time it presents a problem. We would like to route traffic to multiple instances of our application… So traffic could go to instance 1, instance 2 of our app etc.
How can we ensure the different instances of our app can read the authentication cookies regardless of the instance that produced them? How can we ensure that instance 1 can read cookies produced by instance 2 and vice versa? And for that matter, we'd like all instances to be able to read cookies whether they were produced by an instance in a production or staging slot.
We're aiming to avoid the use of "sticky sessions" and ARRAffinity cookies. These ensure that traffic is continually routed to the same instance. Routing to the same instance explicitly prevents us from stopping routing traffic to an old instance and starting routing to a new one.
With the data protection activated and multiple instances of your app service you immediately face the issue that different instances of the app will be unable to read cookies they did not create. This is the default behaviour of data protection. To quote the docs:
Data Protection relies upon a set of cryptographic keys stored in a key ring. When the Data Protection system is initialized, it applies default settings that store the key ring locally. Under the default configuration, a unique key ring is stored on each node of the web farm. Consequently, each web farm node can't decrypt data that's encrypted by an app on any other node.
The problem here is the data protection keys (the key ring) is being stored locally on each instance. What are the implications of this? Well, For example, instance 2 doesn't have access to the keys instance 1 is using and so can't decrypt instance 1 cookies.
Sharing is caring
What we need to do is move away from storing keys locally, and to storing it in a shared place instead. We're going to store data protection keys in Azure Blob Storage and protect the keys with Azure Key Vault:
All instances of the application can access the key ring and consequently sharing cookies is enabled. As the documentation attests, enabling this is fairly simple. It amounts to adding the following packages to your ASP.NET app:
And adding the following to the ConfigureServices in your ASP.NET app:
services.AddDataProtection().SetApplicationName("OurWebApp")
// azure credentials require storage blob contributor role permissions
// eg https://my-storage-account.blob.core.windows.net/keys/key
.PersistKeysToAzureBlobStorage(new Uri($"https://{Configuration["StorageAccountName"]}.blob.core.windows.net/keys/key"), new DefaultAzureCredential())
// azure credentials require key vault crypto role permissions
// eg https://my-key-vault.vault.azure.net/keys/dataprotection
.ProtectKeysWithAzureKeyVault(new Uri($"https://{Configuration["KeyVaultName"]}.vault.azure.net/keys/dataprotection"), new DefaultAzureCredential());
In the above example you can see we're passing the name of our Storage account and Key Vault via configuration.
There's one more crucial piece of the puzzle here; and it's role assignments, better known as permissions. Your App Service needs to be able to read and write to Azure Key Vault and the Azure Blob Storage. The permissions of Storage Blob Data Contributor and Key Vault Crypto Officer are sufficient to enable this. (If you'd like to see what configuring that looks like via ARM templates then check out this post.)
With this in place we're able to route traffic to any instance of our application, secure in the knowledge that it will be able to read the cookies. Furthermore, we've enabled zero downtime releases as a direct consequence.
Our app uses a Linux App Service. It's worth knowing that Linux App Services run as a Docker container. As a consequence, Easy Auth works in a slightly different way; effectively as a middleware. To quote the docs on Easy Auth:
This module handles several things for your app:
Authenticates users with the specified provider
Validates, stores, and refreshes tokens
Manages the authenticated session
Injects identity information into request headers The module runs separately from your application code and is configured using app settings. No SDKs, specific languages, or changes to your application code are required.
The authentication and authorization module runs in a separate container, isolated from your application code. Using what's known as the Ambassador pattern, it interacts with the incoming traffic to perform similar functionality as on Windows.
This is really significant. You may well have "zero downtime deployment", but it doesn't amount to a hill of beans if the moment you've deployed your users find they're effectively logged out. The advice from Microsoft is to use Blob Storage for Token Cache:
you can provision an Azure Blob Storage container and configure your web app with a SaS URL (with read/write/list access) pointing to that blob container. This SaS URL can then be saved to the WEBSITE_AUTH_TOKEN_CONTAINER_SASURL app setting. When this app setting is present, all tokens will be stored in and fetched from the specified blob container.
To turn that into something visual, what's suggested is this:
SaS-sy ARM Templates
I have the good fortune to work with some very talented people. One of them, John McCormick turned his hand to putting this proposed solution into azure-pipelines.yml and ARM template-land. First of all, let's look at our azure-pipelines.yml. We add the following, prior to our deployment job:
In the SASGen job, a PowerShell script runs that generates a SaS token URL with read, write and list permissions that will last for 90 days. (Incidentally, there is a way to do this via ARM templates, and without PowerShell - but alas it didn't seem to work when we experimented with it.)
The generated (secret) token URL (sasUrl) is passed as a parameter to our App Service ARM template. The ARM template sets an appsetting for the app service:
If you google WEBSITE_AUTH_TOKEN_CONTAINER_SASURL you will not find a geat deal. Documentation is short. What you will find is Jeff Sanders excellent blog on the topic. It is, in terms of content, it has some commonality with this post; except in Jeff's example he's manually implementing the workaround in the Azure Portal.
What's actually happening?
With this in place, every time someone logs into your app a JSON token is written to the storage like so:
If you take the trouble to look inside you'll find something like this tucked away:
With this in place, you can safely restart your app service and / or deploy a new one, safe in the knowledge that the tokens will live on in the storage account, and that consequently you will not be unauthenticating users.
I've been working recently on zero downtime deployments using Azure App Service. They're facilitated by a combination of Health checks and deployment slots. This post will talk about why this is important and how it works.
Why zero downtime deployments?
Historically (and for many applications, currently) deployment results in downtime. A period of time during the release where an application is not available to users whilst the new version is deployed. There are a number of downsides to releases with downtime:
Your users cannot use your application. This will frustrate them and make them sad.
Because you're a kind person and you want your users to be happy, you'll optimise to make their lives better. You'll release when the fewest users are accessing your application. It will likely mean you'll end up working late, early or at weekends.
Again because you want to reduce impact on users, you'll release less often. This means that every release will bring with it a greater collection of changes. This is turn will often result in a large degree of focus on manually testing each release, to reduce the likelihood of bugs ending up in users hands. This is a noble aim, but it drags the teams focus away from shipping.
Put simply: downtime in releases impacts customer happiness and leads to reduced pace for teams. It's a vicious circle.
But if we turn it around, what does it look like if releases have no downtime at all?
Your users can always use your application. This will please them.
Your team is now safe to release at any time, day or night. They will likely release more often as a consequence.
If your team has sufficient automated testing in place, they're now in a position where they can move to Continuous Deployment.
Releases become boring. This is good. They "just work™️" and so the team can focus instead on building the cool features that are going to make users lives even better.
Manual zero downtime releases with App Services
App Services have the ability to scale out. To quote the docs:
A scale out operation is the equivalent of creating multiple copies of your web site and adding a load balancer to distribute the demand between them. When you scale out ... there is no need to configure load balancing separately since this is already provided by the platform.
As you can see, scaling out works by having multiple instances of your app. Deployment slots are exactly this, but with an extra twist. If you add a deployment slot to your App Service, then you no longer deploy to production. Instead you deploy to your staging slot. Your staging slot is accessible in the same way your production slot is accessible. So whilst your users may go to https://my-glorious-app.io, your staging slot may live at https://my-glorious-app-stage.azurewebsites.net instead. Because this is accessible, this is testable. You are in a position to test the deployed application before making it generally available.
Once you're happy that everything looks good, you can "swap slots". What this means, is the version of the app living in the staging slot, gets moved into the production slot. So that which lived at https://my-glorious-app-stage.azurewebsites.net moves to https://my-glorious-app.io. For a more details on what that involves read this. The significant take home is this: there is no downtime. Traffic stops being routed to the old instance and starts being routed to the new one. It's as simple as that.
I should mention at this point that there's a number of zero downtime strategies out there and slots can help support a number of these. This includes canary deployments, where a subset of traffic is routed to the new version prior to it being opened out more widely. In our case, we're looking at rolling deployments, where we replace the currently running instances of our application with the new ones; but it's worth being aware that there are other strategies that slots can facilitate.
So what does it look like when slots swap? Well, to test that out, we swapped slots on our two App Service instances. We repeatedly CURLed our apps api/build endpoint that exposes the build information; to get visibility around which version of our app we were routing traffic to. This is what we saw:
Thu Jan 21 11:51:51 GMT 2021
{"buildNumber":"20210121.5","buildId":"17992","commitHash":"c2122919df54bfa6a0d20bceb9f06890f822b26e"}
Thu Jan 21 11:51:54 GMT 2021
{"buildNumber":"20210121.6","buildId":"18015","commitHash":"062ac1488fcf1737fe1dbab0d05c095786218f30"}
Thu Jan 21 11:51:57 GMT 2021
{"buildNumber":"20210121.5","buildId":"17992","commitHash":"c2122919df54bfa6a0d20bceb9f06890f822b26e"}
Thu Jan 21 11:52:00 GMT 2021
{"buildNumber":"20210121.6","buildId":"18015","commitHash":"062ac1488fcf1737fe1dbab0d05c095786218f30"}
Thu Jan 21 11:52:03 GMT 2021
{"buildNumber":"20210121.6","buildId":"18015","commitHash":"062ac1488fcf1737fe1dbab0d05c095786218f30"}
Thu Jan 21 11:52:05 GMT 2021
{"buildNumber":"20210121.6","buildId":"18015","commitHash":"062ac1488fcf1737fe1dbab0d05c095786218f30"}
Thu Jan 21 11:52:08 GMT 2021
{"buildNumber":"20210121.5","buildId":"17992","commitHash":"c2122919df54bfa6a0d20bceb9f06890f822b26e"}
Thu Jan 21 11:52:10 GMT 2021
{"buildNumber":"20210121.6","buildId":"18015","commitHash":"062ac1488fcf1737fe1dbab0d05c095786218f30"}
Thu Jan 21 11:52:12 GMT 2021
{"buildNumber":"20210121.5","buildId":"17992","commitHash":"c2122919df54bfa6a0d20bceb9f06890f822b26e"}
Thu Jan 21 11:52:15 GMT 2021
{"buildNumber":"20210121.6","buildId":"18015","commitHash":"062ac1488fcf1737fe1dbab0d05c095786218f30"}
Thu Jan 21 11:52:17 GMT 2021
{"buildNumber":"20210121.6","buildId":"18015","commitHash":"062ac1488fcf1737fe1dbab0d05c095786218f30"}
The first new version of our application showed up in a production slot at 11:51:54, and the last old version showed up at 11:52:12. So it took a total of 15 seconds to complete the transition from hitting only instances of the old application to hitting only instances of the new application. During that 15 seconds either old or new versions of the application would be serving traffic. Significantly, there was always a version of the application returning responses.
This is very exciting! We have zero downtime deployments!
Rollbacks for bonus points
We now have the new version of the app (buildNumber: 20210121.6) in the production slot, and the old version of the app (buildNumber: 20210121.5) in the staging slot.
Slots have a tremendous rollback story. If it emerges that there was some uncaught issue in your release and you'd like to revert to the previous version, you can! Just as we swapped just now to move buildNumber: 20210121.6 from the staging slot to the production slot and buildNumber: 20210121.5 the other way, we can swap right back and revert our release like so:
This is also very exciting! We have zero downtime deployments and rollbacks!
Automated zero downtime releases with Health checks
The final piece of the puzzle here automation. You're a sophisticated team, you've put a great deal of energy into automating your tests. You don't want your release process to be manual for this very reason; you trust your test coverage. You want to move to Continuous Deployment.
Fortunately, automating swapping slots is a breeze with azure-pipelines.yml. Consider the following:
The first job here, deploys our previously built webapp to the stage slot. The second job swaps the slot.
When I first considered this, the question rattling around in the back of my mind was this: how does App Service know when it's safe to swap? What if we swap before our app has fully woken up and started serving responses?
It so happens that using Health checks, App Service caters for this beautifully. A health check endpoint is a URL in your application which, when hit, checks the dependencies of your application. "Is the database accessible?" "Are the APIs I depend upon accessible?" The diagram from the docs expresses it very well:
This tells App Service where to look to check the health. The health check endpoint itself is provided by the MapHealthChecks in our Startup.cs of our .NET application:
You read a full list of all the ways App Service uses Health checks here. Pertinent for zero downtime deployments is this:
when scaling up or out, App Service pings the Health check path to ensure new instances are ready.
This is the magic sauce. App Service doesn't route traffic to an instance until it's given the thumbs up that it's ready in the form of passing health checks. This is excellent; it is this that makes automated zero downtime releases a reality.
Props to the various Azure teams that have made this possible; I'm very impressed by the way in which the Health checks and slots can be combined together to support some tremendous use cases.
If you're deploying to Azure, there's a good chance you're using ARM templates to do so. Once you've got past "Hello World", you'll probably find yourself in a situation when you're deploying multiple types of resource to make your solution. For instance, you may be deploying an App Service alongside Key Vault and Storage.
One of the hardest things when it comes to deploying software and having it work, is permissions. Without adequate permissions configured, the most beautiful code can do nothing. Incidentally, this is a good thing. We're deploying to the web; many people are there, not all good. As a different kind of web-head once said:
Access management for cloud resources is critical for any organization that uses the cloud. Azure role-based access control (Azure RBAC) helps you manage who has access to Azure resources, what they can do with those resources, and what areas they have access to.
Designating groups or individual roles responsible for specific functions in Azure helps avoid confusion that can lead to human and automation errors that create security risks. Restricting access based on the need to know and least privilege security principles is imperative for organizations that want to enforce security policies for data access.
This is good advice. With that in mind, how can we ensure that the different resources we're deploying to Azure can talk to one another?
Role (up for your) assignments
The answer is roles. There's a number of roles that exist in Azure that can be assigned to users, groups, service principals and managed identities. In our own case we're using managed identity for our resources. What we can do is use "role assignments" to give our managed identity access to given resources. Arturo Lucatero gives a great short explanation of this:
Whilst this explanation is delightfully simple, the actual implementation when it comes to ARM templates is a little more involved. Because now it's time to talk "magic" GUIDs. Consider the following truncated ARM template, which gives our managed identity (and hence our App Service which uses this identity) access to Key Vault and Storage:
The three variables above contain the subscription resource ids for the roles Storage Blob Data Contributor, Key Vault Secrets Officer and Key Vault Crypto Officer. The first question on your mind is likely: "what is ba92f5b4-2d11-453d-a403-e96b0029c9fe and where does it come from?" Great question! Well, each of these GUIDs represents a built-in role in Azure RBAC. The ba92f5b4-2d11-453d-a403-e96b0029c9fe represents the Storage Blob Data Contributor role.
Get-AzRoleDefinition | ? {$_.id -eq "ba92f5b4-2d11-453d-a403-e96b0029c9fe" }
Name : Storage Blob Data Contributor
Id : ba92f5b4-2d11-453d-a403-e96b0029c9fe
IsCustom : False
Description : Allows for read, write and delete access to Azure Storage blob containers and data
Actions : {Microsoft.Storage/storageAccounts/blobServices/containers/delete, Microsoft.Storage/storageAccounts/blobServices/containers/read,
Microsoft.Storage/storageAccounts/blobServices/containers/write, Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action}
NotActions : {}
DataActions : {Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete, Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read,
Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write, Microsoft.Storage/storageAccounts/blobServices/containers/blobs/move/action…}
NotDataActions : {}
AssignableScopes : {/}
Or by name like so:
Get-AzRoleDefinition | ? {$_.name -like "*Crypto Officer*" }
Name : Key Vault Crypto Officer
Id : 14b46e9e-c2b7-41b4-b07b-48a6ebf60603
IsCustom : False
Description : Perform any action on the keys of a key vault, except manage permissions. Only works for key vaults that use the 'Azure role-based access control' permission model.
Actions : {Microsoft.Authorization/*/read, Microsoft.Insights/alertRules/*, Microsoft.Resources/deployments/*, Microsoft.Resources/subscriptions/resourceGroups/read…}
NotActions : {}
DataActions : {Microsoft.KeyVault/vaults/keys/*}
NotDataActions : {}
AssignableScopes : {/}
As you can see, the Actions section of the output above (and in even more detail on the linked article) provides information about what the different roles can do. So if you're looking to enable one Azure resource to talk to another, you should be able to refer to these to identify a role that you might want to use.
Creating a role assignment
So now we understand how you identify the roles in question, let's take the final leap and look at assigning those roles to our managed identity. For each role assignment, you'll need a roleAssignments resource defined that looks like this:
Let's go through the above, significant property by significant property (it's also worth checking the official reference here):
type - the type of role assignment we want to create, for a key vault it's "Microsoft.KeyVault/vaults/providers/roleAssignments", for storage it's "Microsoft.Storage/storageAccounts/providers/roleAssignments". The pattern is that it's the resource type, followed by "/providers/roleAssignments".
dependsOn - before we can create a role assignment, we need the service principal we desire to permission (in our case a managed identity) to exist
properties.roleDefinitionId - the role that we're assigning, provided as an id. So for this example it's the keyVaultCryptoOfficer variable, which was earlier defined as [subscriptionResourceId('Microsoft.Authorization/roleDefinitions', 'ba92f5b4-2d11-453d-a403-e96b0029c9fe')]. (Note the use of the GUID)
properties.principalId - the id of the principal we're adding permissions for. In our case this is a managed identity (a type of service principal).
properties.scope - we're modifying another resource; our key vault isn't defined in this ARM template and we want to specify the resource we're granting permissions to.
properties.principalType - the type of principal that we're creating an assignment for; in our this is "ServicePrincipal" - our managed identity.
There is an alternate approach that you can use where the type is "Microsoft.Authorization/roleAssignments". Whilst this also works, it displayed errors in the Azure tooling for VS Code. As such, we've opted not to use that approach in our ARM templates.
Many thanks to the awesome John McCormick who wrangled permissions with me until we bent Azure RBAC to our will.