Measuring Non Functional Risk

There is a chance tomorrow when you set out in your car for work, you may get struck by a bus, or a truck, or another car, or a tree might jump out in front of you. The chance will increase if you forget to put in your contacts, or decide to talk on your phone whilst driving. Take a train and the risk is reduced significantly.

The same applies for computer system stability. Implement any sort of change to a computer system and the chance that the system will become unstable and possibly unavailable increases. Perform some thorough testing and planning and the chance decreases. Risk management at face value sounds relatively straight forward unfortunately, most often it is compromised as a result of the “Bermuda Triangle” of software development: Time, Cost and Quality. The triangle works on the principle that as more emphasis is placed on one element, less is placed on the others.

When a solution is designed, there are typically functional requirements that must be met however non-functional requirements are often over looked or left until far too late in the process. These non functional requirements relate to important issues such as security, performance, capacity & availability.

Consider a web page that displays a Product List. A non functional requirement may be that “the Product List page must be displayed for 99.75% of requests”. This defines our requirement for availability, but we still need to define what “available” means in functional terms.

When we specifically consider how certain functionality should behave in relation to availability, we might say “the product details on the Product List page must be displayed for 99.75% of requests”.  But what if it takes, on average, 28 seconds for them to display? When does slow approximate to unavailable?

There is such a point for every function.

If we include performance in our requirements, we can say “the product details on the Product List page must be displayed within 3 seconds for 99.75% of requests”. But what if 2% of the time, the advertisements engine that renders banner advertisements on the Product List page takes 10 seconds to display the banner? Do we need to rewrite the advertisement engine?

Depending on the functionality, we can consider a range of solutions available without having to rewrite the code – solution that include graceful functional degradation, smart defaults, live reconfiguration and bypass switches to name a few.

In this case, the advertisements are far less important than the actual products, so if we were to use graceful functional degradation, we would say “the product details on the Product List page page must be displayed within 3 seconds for 99.75% of requests, however the advertisements will be timed out, if necessary, to ensure this is achieved”.

Now lets consider the scenario where the page is displaying hundreds of products. If we consider that it may take longer to load and render a page where there is a long list of  products, we can use it define the scope of requirements and say something like “the product details on the Product List page must be displayed within 3 seconds for 99.75% for requests that return no more than 50 products, however the advertisements will be timed out, if necessary, to ensure this is achieved”.

Now, only after having considered:

  • Availability,
  • Functionality,
  • Performance,
  • Graceful Degradation, and
  • Data ,

we have truly considered High Availability in terms that relate to a user.

The important  message here is that for High Availability, one cannot consider availability independently from performance and function.

Lets also assume this web page uses a typical n tier architecture and calls a web service that then hits an application tier which makes a database call and returns the required data.

In order to ensure the non-functional requirement can be met, we must consider the critical failure points and raise risks for each of these to ensure they are either addressed, mitigated or at the very least communicated to the  stakeholders.

Continuing with the Product List page scenario, we can list the following risks:

  • The Advertisement Engine may not render in 3 seconds;
  • The connection to the database may be lost;
  • The synchronous web service may not be able to handle the volume of calls expected;
  • The webpage may take more than 3 seconds to render;
  • Queries that return more than 50 products are likely to exceed 3 seconds;
  • The Webhost IPS may experience an outage.

There is a variability in terms of the chance that these things may take place so we need to determine a way to take the likelihood into consideration:

and what are the consequences? The Risk Matrix shown below can be used to help ‘keep it real’ so that every risk raised does not come across Chicken Little style with “The sky is falling, the sky is falling”. Our opening risk of being struck by a bus may sound very serious on its own, but when we consider the likelihood is really quite rare, we can discount the risk and consider other serious issues like maintaining a safe operating vehicle and wearing a seatbelt.  Remember, Risk analysis is a process of risk reduction and is not necessarily about risk elimination.

So, if we consider this risks we raised for the Product List page example, we should end up with an overall risk rating for each risk. In addition, we can suggest mitigations – a means to reduce or completely remove the risk and provide a target risk rating that might be achievable once all of the mitigations are put into effect.

If you apply the suggested mitigations listed above, each of the risks can be considered a low risk. As you can see, even with the various likelihoods and consequences, all of these example risks ended up with an equivalent risk rating. In order to meet the original Non Functional Requirement, we have provided a solution that mitigates or prevents these risks from eventuating.

As well as being a valuable tool for determining risk, this table is also quite useful to testers. This provides a useful list of failure scenarios for them to test to validate if the risks are likely to occur. In many instances, the very act of performing non functional testing is a mitigation by verifying that a risk under production loads will not eventuate.

As to how to determine the best way to mitigate each risk, that is whole other story. That comes from experience. You have to play in the traffic a while before you know how to dodge the trucks and buses or avoid them altogether.

Comments are closed.