facebook

Who has never heard the term resilience? In this period of pandemic, perhaps we heard it even more often, in fact it has been associated with the restarting plans of the various countries after this dark period. 

But what do we mean by resilience? With this word, science indicates the ability of a material to resist break caused by dynamic stress and, for some materials, also the ability to resume the original appearance after deformation.

It would not be surprising then that this term is also used in the software field. INCOSE (International Council on Systems Engineering), which is responsible for promoting systems engineering through conferences, working groups and publications, provides this definiton:

Resilience is the ability to provide required capability in the face of (avoiding, withstanding, recovering from, and evolving and adapting to) adversity.

In microservices cloud applications or, generally speaking, in distributed ones, in which there is continuous communication between services or external resources, transient errors are very frequent. These are errors due to loss of connection or temporary unavailability of a service. The nature of these types of errors involves that a subsequent request or operation can be successful. 

A way to make these systems resilient and robust and therefore stable is to use the RETRY Pattern. This pattern simply consists of repeating an operation after a failure.

The strategies to be applied according to this pattern are different and depend on the type of error found:

  • Cancel – Non-transient error, it would be useless to retry the failed operation. You have to stop the operation and notify the error through an exception or maybe manage it.  
  • Retry – Rare or unusual error. It is very likely that by repeating the operation it will have a positive outcome.  
  • Retry after delay – Known error due to connection problems or service occupancy. It may be necessary to delay the repetition of the operation before trying again.

When working on one of our customers’ system, in which several services communicate through messages exchanged through RabbitMQ, several times we need to make interventions on the retry strategy mechanism present on their software. This mechanism was implemented internally by the team and was not easily maintainable. Thus, we decided with the team to make a nice refactoring to the code and introduce an existing library, born to help developers with error management, and making applications resilient through different management policies. This library is called Polly and is officially recognized by Microsoft.

Polly logo

Polly offers several resilience policies that you can find here but today we will focus on that of Retry.

diagram policy polly
Fonte: https://github.com/App-vNext/Polly/wiki/Retry#how-polly-retry-works

To better show how Polly works, an example may be helpful. Let’s take the code from one of the first articles on RabbitMQ we published on our blog. 

In particular, let’s take the Sender code, a software that sends messages through a RabbitMQ Message Broker. 

To simulate a transient error we will stop the container on which the RabbitMQ broker is running and we will restart it immediately afterwards.  

Let’s start with the base code to which I made some changes. For illustrative purposes, I created an array of strings, that is the messages we will send, filled with the FillMessages () method. I have also extracted in a method the part of code that goes to exploit the RabbitMQ driver for publishing messages. To avoid the loss of messages, in the event of a temporary stop by the broker, we will set the queue declared as durable and we will specify that the messages published will be of the persistent type.

namespace Sender 
{ 
        class Sender 
        { 
        public static void Main() 
        { 
        var factory = new ConnectionFactory() { HostName = "localhost" }; 
        using (var connection = factory.CreateConnection()) 
        using (var channel = connection.CreateModel()) 
        { 
                channel.QueueDeclare(queue: "QueueDemo", 
                                     durable: true, 
                                     exclusive: false, 
                                     autoDelete: false, 
                                     arguments: null); 
  
                var messages = new string [100]; 
                FillMessages(messages); 
                foreach (var message in messages) 
                       { 
                        PublishMessage(message, channel); 
                       Thread.Sleep(2000); 
                       } 
        } 

        Console.WriteLine(" Press [enter] to exit."); 
        Console.ReadLine(); 
        } 
 
    	private static void FillMessages(string[] arrayToFill) 
    	{ 
        for (var j = 0; j < arrayToFill.Length; j++) 
        { 
                arrayToFill[j] = $"Message {j}"; 
        } 
    	} 
 
    	private static void PublishMessage(string message, IModel channel) 
    	{ 
        var body = Encoding.UTF8.GetBytes(message); 
        channel.BasicPublish(exchange: "", 
                routingKey: "QueueDemo", 
                basicProperties: new BasicProperties{DeliveryMode = 2}, 
            	body: body); 
        Console.WriteLine($"Sent {message}"); 
        } 
        } 
} 

To use Polly in one of our applications, there are some steps to follow. 

Step 1) Specify the exception(s) that the policy must handle 

If we launched the Sender and stopped the container on which RabbitMQ is running while sending messages, we would receive this type of exception: 

Unhandled Exception: RabbitMQ.Client.Exceptions.AlreadyClosedException: Already closed: The AMQP operation was interrupted 

We then define a policy to handle this exception using the Handle() method.

var retryPolicy = Policy.Handle< AlreadyClosedException >(); 

Step 2) Specify how the policy should handle the error

Suppose, therefore, that the loss of connection to the broker is a known problem or, in any case, a transient error which is manageable with a Retry strategy, as explained in the previous paragraph. There are several ways to define how you want to retry a certain operation with Polly. The simplest method is the Retry() method to which we can possibly pass an integer type parameter to indicate the number of times you want to repeat the operation. Otherwise, we can try again forever using the RetryForever() method.

As we said earlier, when we have a common connection error, waiting a certain period of time before trying again can be an excellent strategy. Furthermore, when there are continuous failures, we can increase the delay between attempts incrementally or exponentially to allow the service or resource to become available again. With Polly we can implement this type of Retry by means of the WaitAndRetry() method. This method provides, among the parameters, the number of attempts that you want to make, a time interval that you have to wait between one attempt and another and it is also possible to specify the logic to be performed before each attempt. 

The delay between one attempt and another can be defined exponentially (exponential backoff). This definition allows you to have intervals between attempts that increase as the number of failures increases, allowing the service to have more time to get back to working properly. Using the WaitAndRetry() method provided by Polly, we can indicate a base that will be raised to a power equal to the number of attempts made at each retry.  

 var retryPolicy = Policy.Handle<AlreadyClosedException>() 
            	.WaitAndRetry(5, 
                retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), 
              	(exception, timeSpan, retryCount, context) => 
               	{ 
                Console.WriteLine($"This is the {retryCount} try"); 
               	} 
            	); 

Since the exponential backoff is a fixed progression, in scenarios where the throughput is really high, it is also possible to introduce a jitter (random amount of time) to avoid load peaks and thus introduce a certain randomness in the delay calculation. 

Step  3) Run the code through the policy

Once the policy has been established, it’s time to execute it. The Execute() method will contain the part of code that interests us, or rather, the one that can return the exception we want to handle. In our case it is the PublishMessage() method. 

retryPolicy.Execute(() =>  
        PublishMessage(message, channel) 
);  

Let’s go to the simulation:

If we carry out the GetMessages () through the RabbitMQ UI Management, we can see that the message sent after the retry has also arrived correctly.

As already mentioned, this strategy is very useful in dealing with transient errors or in general when the errors you want to manage are temporary. However, it would not be effective in handling long-term errors. Furthermore, this approach should not be seen as an alternative to scalability as it does not solve the load problems. If you have a single broker system and a very high number of requests, the retry strategy can help in forwarding any failed messages to the broker but this does not mean that you can rely on a system with fewer resources and this also true if you think about performance.

In addition to reactive approaches such as the Retry strategy or the Circuit Braker, Polly also offers proactive strategies oriented towards stability or preventive resilience techniques. In addition, it is possible to use Polly together with the ASPNET Core HttpClientFactory (from version 2.1) to apply policies to HTTP calls. 

I leave you the link to the repository where you can find the example code. 

See you at the next article!