Introduction to observability 

Everyone’s experienced it. Your application gives you an error — and it’s just an inscrutable hex code. While the technicians know something is wrong, they can’t tell you what’s wrong. Even worse, maybe you’re the technician. Your application knows something has gone wrong, but it can’t tell you what.

It’s a failure of “observability.”

In science, everything must be observable and measurable. The disciplines of engineering and development are the same. Without observability, you can’t know what has gone wrong when something has gone wrong. But by improving observability, you can reduce troubleshooting time, and avoid system disruption.

What is observability?

So, what exactly is observability? It’s the ability to determine the internal state of something — whether hardware or software — by the information it gives you.

Let’s start with a small example. Every once in a while, a system will shut down, and you’ll get an error like this:

ERROR: SYSTEM STOPPED.

This error doesn’t provide for good observability because you have no idea where the error occurred. You don’t know what happened. On the other hand, how about this more common error:

ERROR: PRINT SPOOLER STOPPED.

Now, you know that the error is with the printer. (And of course it is; the error is always with the printer.) You have heightened observability, but you still don’t know why the error occurred. An even better error might be:

ERROR: PRINT SPOOLER FAILED. PRINTER LOW ON INK.

Now you know everything. You know where the error occurred and why. You have high levels of observability.

Naturally, in today’s technology, it’s not going to be a print spooler. It’s likely going to be the API of a third-party system malfunctioning deep within your infrastructure. And that’s what makes observability even more critical.

What does observability measure?

The error itself is not considered “observability.” Instead, observability is a metric. Observability itself is a measurement of how much you know regarding what has gone wrong and where to varying degrees of specificity. Thus, a system has high or low levels of observability.

And that means that the entirety of the system needs to have higher observability. It’s not enough for some of it to have robust reporting; all of it needs to have reliable reporting.

Why do you need observability?

In the old days, systems were quite simple. Though there might be reams and reams of code, you could usually pare down to where an error occurred due to the process of elimination. Entire infrastructures were set up on a single server with internal programs.

Today, system complexity is far greater. Companies are using public and private cloud architectures. They are using Internet of Things devices. They are using third-party applications. They are introducing endpoints: laptops, tablets, and smartphones. These endpoints may have any operating system under the sun.

Because of this, observability has become far more critical. When building applications and infrastructure, it’s now critical that observability standards be met. If they aren’t, then as the system grows in complexity and breadth, it becomes impossible to reliably and efficiently track down errors.

Observability vs. Monitoring

The modern infrastructure requires both better observability and better monitoring. But it’s essential to understand the distinction — because observability isn’t monitoring, nor vice versa. Monitoring involves reporting on the system’s performance both in real-time and in the past. Observability is the specificity and accuracy of that monitoring data. Observability and monitoring operate hand-in-hand. But they shouldn’t be confused.

Systems can have solid monitoring but imperfect observability. They may report precisely when a system goes down or how the system is performing, but these instances may not be traceable. Likewise, systems can have solid observability but inadequate monitoring. When an incident occurs, they may be able to tell you exactly where that incident occurred. But they may not have the robust logs required to identify how frequently these issues are happening or how bad they are.

Observability best practices

So, we know that systems today are complex, and that makes observability complex, too. Observability has to be baked in from the ground up; if it’s not at all levels of the infrastructure, it falters.

What are some best practices when it comes to observability?

  • Use utilities that make it easier to comb through and search through logs and events.
  • Track events through unique request IDs which do not change during data transfer.
  • Maintain access to raw events, rather than aggregating or consolidating them.
  • Collect high-level context data regarding how events occur and the environment.

Observability is something that is rapidly becoming more important throughout development. Systems are only moving forward, not back, in terms of complexity. But by following the right best practices, and pairing observability with better monitoring, developers can create systems that are optimized even for failure.

You might like

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.