Most critical failures do not happen by surprise. They happen because no one asked the right question at the right time: is this system truly reliable, or are we just hoping it is?
As long as the machine is running, other priorities take over. The problem is that when it stops working, the repercussions are often disproportionate to the time that could have been invested in preventing the issue.
Software reliability has long been a subject reserved for technicians. This is a mistake: when a system fails in a crucial area like finance, defense, healthcare, or public administration, engineers are usually not the first to deal with the fallout. It is the leaders who must answer to shareholders, regulators, and public opinion. They are directly involved on legal, financial, and moral levels. A bug is not a technical detail: it is a strategic risk.
Reliable does not mean "it never crashes"
This is the first misunderstanding to clear up. Trustworthy software does not mean flawless software: that doesn't exist. It means software whose limits are understood, whose points of vulnerability have been identified, tested, and documented. And software that, when something goes wrong, fails in a controlled manner rather than taking everything else down with it.
This distinction has tangible importance in critical systems: in defense, a command system that freezes during an operation doesn't just give a warning, it vanishes. The same applies to market finance, public infrastructure, or healthcare. A system capable of detecting its own failure and switching to a degraded mode in thirty seconds is radically different from a system that silently freezes and remains unnoticed for two hours. The former is under control; the latter only reveals itself once the damage is done.
The good news is that this level of software reliability is not exclusive to organizations with massive budgets. It essentially depends on the methodology used and deciding that it is a priority before an incident forces the decision.
A statistic that should give us pause: according to NIST, fixing a bug in production costs between 15 and 30 times more than in the development phase. The real challenge is therefore not to spend more, but to spend at the right time.
Three questions to ask your teams, no technical skills required
You don't need to understand how a compiler works to judge the solidity of your systems. The answers to these three questions will tell you a lot.
Who performs the tests on our critical software?
If it is the same team that developed them, there is a risk. Independent testing, automatic scenario generation, and formal verification help detect what the team might not have thought to check.
When were our legacy systems last audited?
Old systems, like the millions of lines of COBOL still in production in finance and administration, can contain unexpected flaws. Updates and new integrations create invisible risks if no one checks them. This is often what a silent software failure looks like.
What exactly will happen if this system stops working tomorrow?
A software risk that cannot be described precisely is a risk that is not under control. Defining the consequences allows for anticipation and mitigation planning. And a risk you don't manage is a risk that manages you.
What organizations where error is not an option do differently
We have worked with organizations operating in defense, tax administration, and market finance. What sets them apart from others is not their budget, but their relationship with proof: they don't just trust; they verify.
In practice, they don't deploy a critical system simply thinking, "the tests passed, everything should be fine." They automatically generate thousands of scenarios to account for cases that even their own engineers would never have written. They involve external teams. They document what happens at the system's limits, not just in standard situations.
Every year, the DGFiP (French Tax Authority) processes tax returns for tens of millions of taxpayers. This is precisely the type of context where Titagone intervenes. When asked how they ensure software quality at such a scale, their answer isn't that they simply trust their developers, but that they rely on independent verification processes applied systematically.
Blind trust has no place in critical systems.
It’s not about elitism: at some point in their history, these organizations paid the price for a poorly verified system, and they chose never to repeat that mistake.
Where to start to evaluate the reliability of your critical systems
Before any technical considerations, start by identifying your most critical systems. Even a rough estimate is enough; the important thing is to know for which ones a 24-hour interruption would cause real problems. For each case: who developed it, who maintains it today, and when was it last independently tested.
The majority of organizations have not updated this list. Sometimes, it only exists in the mind of a single person, which is already a significant data point regarding your level of software risk.
Next, it’s about finding the right person to talk to. Someone capable of looking at your situation objectively, without bias, and clearly telling you what needs priority attention and what can be handled later. It's not about redoing everything in detail, but simply getting an overview of your current situation.
At Titagone, we support companies in carrying out this assessment in a clear and pragmatic way, without alarmism or unnecessary overhauls.
Frequently Asked Questions
What exactly is software reliability?
In other words: it is the ability of a program to act as expected, even in the face of unforeseen circumstances. In critical contexts, this also includes how it fails gracefully, with a documented degraded mode, rather than in a chaotic and silent manner.
What is the real cost of a software failure within a company?
While it varies by field, costs almost always end up exceeding what would have been spent on prevention. In finance, a few minutes of downtime on a trading platform can lead to substantial losses. In the public sector, it’s about ensuring service continuity for millions of individuals. In the industrial sector, it’s the production line that stops. And beyond simple statistics: brand image, regulatory authorities, and customers who don't return.
How can I judge the reliability of my software without technical skills?
By asking three simple questions to your teams: are your critical systems tested by people independent of those who developed them? When was the last audit? And what exactly happens if this system stops? If the answers are vague or evasive, you already have your answer.
Is software reliability and cybersecurity the same thing?
No, although they are often confused. Cybersecurity deals with external threats: attacks, intrusions, and data theft. Software reliability concerns the internal behavior of the software, regardless of any threat. A system can be highly secure against cyberattacks while remaining completely fragile from the inside. The two subjects are complementary. Neglecting one or the other is like leaving a door unlocked.
Embedded code at Thales. Zero margin for error.
Thales set us an ambitious challenge: to guarantee the behavior of critical C code against thousands of scenarios that no one could have imagined. To meet this, we created SeaCoral, a tool capable of automatically generating these tests, which can now be integrated directly into their development chain.
Systems are verified and tested in depth using various coverage criteria, not just tested on the surface. If you work on systems where the slightest failure is unacceptable, this project will resonate with you. Discover the project.
About the Author
Titagone
Editorial Team
Expert in formal methods and software engineering at Titagone
