A Concise Guide to Transient Fault Handling Application Block

Transient errors are intermittent errors caused by a short lived outage of a specific resource or service. In most cases, if you retry the operation after a few seconds, the error disappears. Transient errors are often beyond the control of the application programmer. However, every attempt should be made to make the application robust enough to handle transient errors.

 

In my previous post, Transient Errors Are Evil – How to Handle them With Exponential Backoff in C#, we looked at a simple retry logic that retried transient errors with exponential back-off. In this post, we’re going to attempt solving the same problem in a more fancier way using the Transient Fault Handling Application Block, a.k.a. Topaz. Looking at the Topaz information page in msdn, one might get the idea that you can only use Topaz against Azure Cloud services. This is not true and in this post we’re going to see how Topaz can be used to tackle transient faults in ANY application.

 

Components of Transient Fault Handling Block

The Transient Fault Handling Application Block makes your application more robust by providing the logic for handling transient faults. This logic is provided by specifying two things – a detection strategy and a retry strategy.

Detection Strategy: Detection strategy provides a way for identifying errors which are transient and hence should be retried. This is typically done by creating a class which implements ITransientErrorDetectionStrategy interface. For example, in case of an WebRequest, I might only want to retry all web exceptions or web exceptions which returns a specific error code. Note that if you’re using Azure services, you should probably use one of the in-built detection stratigies.

Retry Strategy: The retry strategy basically specifies how many times to retry the failed operation and at what intervals. The built-in retry strategies allow you to specify that retries should happen at fixed intervals, at intervals that increase by the same amount each time, and at intervals that increase exponentially but with some random variation. The following table from msdn shows examples of all three strategies.

Retry Strategy

Finally, the Detection Strategy and Retry Strategy is combined into a Retry policy , which in turn determines which errors should be retried and how often they should be retried.

RETRY POLICY = DETECTION STRATEGY + RETRY STRATEGY

Step By Step Guide To Using The Transient Fault Handling Application Block

Step # 1: Add the Transient Fault Handling Application Block to Your Solution – follow the msdn guidelines for obtaining and installing the nuget.

Step # 2: Define the Detection Strategy

In our case, we'll retry on any web exception. To so this, we define our class which implements the ITransientErrorDetectionStrategy interface.

class WebExceptionDetectionStrategy : ITransientErrorDetectionStrategy
{
    public bool IsTransient(Exception ex)
    {
        if (ex is WebException)
            return true;
        return false;
    }
}

Step # 3: Define a Retry Policy

We’ll define our retry policy such that it uses an randomized exponential backoff algorithm and set to retry 3 times.

var retryStrategy = new ExponentialBackoff(3, TimeSpan.FromSeconds(2),
                        TimeSpan.FromSeconds(20), TimeSpan.FromSeconds(1));

The first parameter specifies the number of retries before failing the operation. The next parameter specifies the minimum and maximum backoff time respectively. Finally, the last parameter is used to add a randomized  +/- 20% delta to avoid numerous clients all retrying simultaneously.

Step # 4:  Combine the retry strategy and detection strategy into a retry policy

var retryPolicy = new RetryPolicy(retryStrategy);

Step # 5: Call the ExecuteAction method on the retryPolicy object with your custom operation passed in as a delegate.

retryPolicy.ExecuteAction(() => ExecuteHTTPGET("https://microsoft.sharepoint.com"));

Putting it all together

The complete code listing is given below for reference:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Net;
using Microsoft.Practices.EnterpriseLibrary.TransientFaultHandling;

namespace TOPAZ
{
    class Program
    {
       static void Main(string[] args)
        {
            try
            {
                // Define the Retry Strategy
                var retryStrategy = new ExponentialBackoff(3, TimeSpan.FromSeconds(2),
                    TimeSpan.FromSeconds(20), TimeSpan.FromSeconds(1));

                // Define the Retry Policy
                var retryPolicy = new RetryPolicy(retryStrategy);

                // Execute the Action
                retryPolicy.ExecuteAction(() => ExecuteHTTPGET("https://microsoft.sharepoint.com"));
               
            }
            catch (Exception e)
            {
                Console.WriteLine(e.Message);
                throw;
            }

        }

        //HTTP GET Operation we want to retry multiple times
        static void ExecuteHTTPGET(string requestUri)
        {
            Console.WriteLine(DateTime.Now);
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(requestUri);
            request.KeepAlive = false;
            request.Method = "GET";

            HttpWebResponse webResponse = (HttpWebResponse)request.GetResponse();
            int requestStatus = (int)webResponse.StatusCode;
            webResponse.Close();
        }

    }

    //The Detection Strategy
    class WebExceptionDetectionStrategy : ITransientErrorDetectionStrategy
    {
        public bool IsTransient(Exception ex)
        {
            if (ex is WebException)
                return true;
            return false;
        }
    }
}

When To Use Topaz instead custom retry logic ?

TOPAZ is preferable over a custom retry logic whenever you're using Azure services or want a clear segregation between your detection strategy and retry strategy . There's a little bit of overhead when compared to using the custom retry logic, but the code clarity and inbuilt detection and retry stratigies makes it worthwhile.

  • Pingback: Polly - A Framework for Policy Based Retries of Transient Errors in C# - A CODER'S JOURNEY()

  • itellu3times

    Hi Deb, this looks like a great article. I’m a database guy just getting into Azure, and wondering why our app has service glitches when we change Azure service levels. Does that qualify as the kind of transient error that TOPAZ is meant to solve, all in current technologies? Thanks.

    • Hi there ! Yes, TOPAZ is meant to solve these exact issues in the client application. However, you need to be judicious in its use because sometimes it can mask deeper issues in the service causing silent data loss.

      What we usually do is have TOPAZ or a similar tech in our client applications , but also keep monitors on the service instance to capture spikes in 401/404/502 HTTP error codes. Then we’ll root cause and fix the bugs causing the spikes in service errors. You can probably use APP INSIGHTS or SCOM for monitoring purposes, depending on your architecture. PLease let me know if you have additional questions.

      • itellu3times

        Well, again, I’m not really up to speed on the C# side, but I talked to our app guys and looked at the code, and we are using this:

        public class MyDBConfiguration : DbConfiguration
        {
        public MyDBConfiguration()

        {
        SetExecutionStrategy(“System.Data.SqlClient”, () => SuspendExecutionStrategy ? (IDbExecutionStrategy)new DefaultExecutionStrategy() : new SqlAzureExecutionStrategy(4, TimeSpan.FromSeconds(3)));
        } // Constructor
        public static bool SuspendExecutionStrategy
        { get {return (bool?)CallContext.LogicalGetData(“SuspendExecutionStrategy”) ?? false; }
        set { CallContext.LogicalSetData(“SuspendExecutionStrategy”, value); }
        } // Suspend
        } // class
        (forgive the formatting, Disqus isn’t meant for this!)
        The point is that our code sometimes *suspends* the retry, and I’m not clear why. I asked the app guys working on it, and they just shrugged. I’ve put an item on our backlog to look into it seriously, for *them* to look into it who might pick it up faster than me – but I see you also talk about using it judiciously. I’m only concerned about its use for the database connections, btw.
        Thanks.

        • From the code you posted , seems like the issue is more because at certain times “SuspendExecutionStrategy() returns true and you set the Execution strategy to DefaultExecutionStrategy – I’m not sure what that will do. Seems like SqlAzureExecutionStrategy won’t even kick in sometimes.

          If I were you, I’d put logging around two things:
          1. What makes SuspendExecutionStrategy to return either true or false ?
          2. When you see retries getting suspeneded, which execution stratgey is in effect – the SqlAzureExecutionStrategy or DefaultExecutionStrategy ?

          • itellu3times

            Deb, that sounds good, thanks, I will follow up on that. Tell me again why anyone would want to suspend it (“use it judiciously”) on the SQL side? That would suspend retries. Offhand, I can’t think of why we would want or need to do that? A question I should ask my app devs, too. Thanks.

          • The primary reasoning behind backoff is to not overwhelm the server with excessive requests and give it some breathing room to recover from resource constrained situations. Without knowing too much about your app, it’s hard for me to say why it was built this way – but you’re right, it’s worth asking the app devs the reasoning behind coding it up this way – i.e., did they have a well thought out scenario they were trying to guard against or was it just following some best practices (sometimes, you need to take best practices with a grain of salt 🙂 ).

          • itellu3times

            Well, the strategies already do backoff. Nothing very special about our app.