Satoshi Nakamoto had a vision. He wanted to create a transparent financial system that wouldn’t require the help of the financial institutions that had served as “trusted” third parties for centuries on end. Nakamoto didn’t want any middlemen at all. He wanted a system where anyone should be able to send money directly to someone else without any stops in between.
This might sound like an easy thing to do, because we exchange money directly all the time with cash, but it’s in fact incredibly hard to do this digitally. Consider person A sending $10 to person B while at the same time sending that same $10 to person C. Impossible with cash, but very possible with digital money. This problem is called double spending.
It’s an easy enough problem to solve for centralized currencies, because financial institutions manage the flow of money and can block that $10 from going to person B or C. But double spending is a lot harder to solve if you want to build a financial system without middlemen. Who will stop the money from going to both B and C when there’s no one in between?
Nakamoto proposed a decentralized system to solve this problem. A large group of people would independently look over all transactions and agree which transactions were legitimate and which were not.
The real brilliance in Nakamoto’s proposal is how those people would agree. Remember, this is all done digitally. There’s no focus group sitting somewhere in Japan looking over all transactions. The people agreeing on transactions are scattered all over the world. In order to build a decentralized system that can sustainably and securely validate transactions, Nakamoto had to create something that was what’s called Byzantine fault tolerant.
This article will explain what Byzantine fault tolerance is, dive into the Two Generals Problem and the Byzantine Generals Problem, and show how different blockchains (including Nakamoto’s Bitcoin) solve the difficult problem of finding agreement in a decentralized system.
What is a Byzantine Fault?
A Byzantine fault is any fault that causes parts of a system to receive different information while they should have received the same information. For example, node A wants to send (this) to node B and C, but a software bug causes it to send (this) to node B and (that) to node C. Or node A wants to send (this) to node B and C, but partial network failure causes it to send (this) to node B and nothing to node C (which it doesn’t see as an error, because receiving no response is also a possibility).
Byzantine faults are difficult to spot, because node A is still functional. It hasn’t shut down, it hasn’t produced any error, and the other nodes haven’t received any faulty input. So you wouldn’t know it’s not working as intended if you don’t look underneath the hood. Byzantine faults can do a lot of damage. They can cause a Byzantine failure, which happens when Byzantine faults cause the entire system - not just a particular node - to produce the wrong result.
Byzantine fault tolerance (BFT) is how well a system can tolerate these Byzantine faults. BFT is important in systems that need extreme reliability or where many different parts need to come to a consensus. Examples are the Boeing 777 flight control systems, distributed clock synchronization systems, as well as any blockchain.
So how do you make a system Byzantine fault tolerant? Well, it depends on how many parts need to agree. This is where the problem of Byzantine faults splits in two: when there are two parts in the system and when there are many.
The Two Generals Problem
Aragorn and Boromir want to conquer the tower of Saruman. But Saruman is strong. Aragorn and Boromir can only defeat Saruman if they both attack at the same time. The problem is that the armies of Aragorn and Boromir are in two separate locations. They can only communicate by sending messengers through the territory of Saruman.
This is problematic, because Saruman can intercept the messengers, kill them, change the content of their messages, send his own messengers, etc. How can either of our heroes reliably know when the other hero wants to attack? How can Aragorn and Boromir agree on when to attack Saruman? This is the so-called Two Generals Problem.
Sounds like a tough problem, doesn’t it? Computer scientists have tried solving it for decades. Akkoyunlu et al. were the first to mention this type of problem in their 1975 article Some constraints and tradeoffs in the design of network communications. As you can infer from the title of the article, they saw the problem as a constraint, a tradeoff, something you need to be aware of when designing a distributed system.
And there is indeed no perfect solution to the Two Generals Problem. Aragorn and Boromir can never come to an agreement on when to attack if they would only use a single messenger. Even if the messenger safely arrives on the other end, they’ll never know for sure if Saruman corrupted its message or not.
In the digital world, Aragorn and Boromir are computers and Saruman is the network between those computers. When the network is what’s causing Byzantine faults, it is impossible for the computers to find agreement. Fischer et al. proved this in their 1985 paper Impossibility of Distributed Consensus with One Faulty Process.
But there is a workaround. In reality, most networks aren’t fully Byzantine. They don’t corrupt all the messages all the time. They corrupt some of the messages some of the time. So there is a solution. Aragorn can send a hundred messengers to Boromir telling him to attack at first light of the fifth day if he receives the message. Even if Saruman is 99% Byzantine, Aragorn and Boromir can agree, because one messenger will have slipped through Saruman’s nets.
It’s not perfect, because it relies on the assumption that Saruman isn’t fully Byzantine, but these types of practical workarounds are how most distributed systems solve the Two Generals Problem today.
The Byzantine Generals Problem
Robert Shostak was the computer scientist who conceptualized and formalized the problem of coming to a consensus in spite of Byzantine faults. In 1982, together with Leslie Lamport and Marshall Pease, he published a seminal paper in computer science, The Byzantine Generals Problem, that gave the problem its name (it was called the interactive consistency problem before). Their paper showed when and how you could solve this tough problem.
The Two Generals Problem is impossible to solve perfectly because the network is Byzantine. But what if it’s not the network that’s Byzantine, but one of the nodes in the system. What if more than one node is corrupt and sends out different signals to the other nodes? How many nodes can get corrupted before the system produces a Byzantine failure? That’s the Byzantine Generals Problem.
Let’s return to our Lord of the Rings example, but add one character: Faramir. Aragorn, Boromir, and Faramir want to attack Saruman’s tower, but they can only win the fight if all three attack at the same time. Saruman won’t intervene with their messages this time, but unfortunately Boromir has been converted. He is a traitor.
Boromir intends to stop Aragorn and Faramir from agreeing. Conversely, Aragorn and Faramir want to come to a consensus. As a general rule, to solve the Byzantine Generals Problem, all loyal nodes need to agree on what one particular node says, regardless of whether that node is a traitor or not.
In our example, consider Aragorn and Faramir are waiting for the orders of Boromir. The vile traitor he is, Boromir sends (attack) to Faramir and (stay) to Aragorn. Faramir wants to make sure he’s not being tricked and asks Aragorn what order Boromir sent him. Aragorn says he was ordered to stay. At this point, Faramir has two differing orders: (attack) from Boromir and (stay) from Aragorn. He cannot know who the traitor is and doesn’t know which order to follow.
Now imagine Boromir doesn’t give the initial command, but Aragorn does. Aragorn sends (attack) to both other heroes. When Faramir checks in with Boromir, Boromir lies and says he’s been ordered to stay. Once again, Faramir is stuck. He cannot know who the traitor is and doesn’t know which order to follow.
The conclusion is that you cannot solve the Byzantine Generals Problem if there are three generals and one of them is a traitor. Shostak, Lamport, and Pease concluded that the Byzantine Generals Problem is solvable if and only if more than two- thirds of all nodes are loyal (with non-encrypted messages, at least).
If Gandalf joined the fight and Boromir was still a traitor, the loyal heroes could agree regardless of what Boromir said. He could say (attack) to Faramir and Aragorn, and (stay) to Gandalf, but any of the heroes could simply ask the others what they’ve been sent and follow the majority order (attack, in this case).
How Does a Blockchain Become Byzantine Fault Tolerant?
All this is relevant for blockchain technology, because a blockchain is a decentralized system where all nodes need to agree on what one particular node said (the one that found the key to the cryptographic puzzle). A blockchain is Byzantine fault tolerant because of the consensus protocol it uses.
A consensus protocol is a set of rules that incentives the right behavior and either doesn’t incentivize or punishes the wrong behavior. The two most popular consensus protocols in blockchain today are proof of work (PoW) and proof of stake (PoS). PoW is used in Bitcoin and Ethereum v1, while PoS is used in more modern blockchains such as NEAR and Ethereum v2. PoW requires nodes to solve a cryptographic puzzle by throwing a lot of computing power at it, while PoS requires nodes to stake a percentage of the blockchain’s coin in exchange for a chance of becoming a block creator.
Both PoW and PoS are Byzantine fault tolerant because nodes will always be able to find agreement as long as more than two-thirds of the nodes are loyal. Combine this with other rules in the protocol that make it extremely hard to actually send false information to other nodes and you have extremely safe distributed technology.
When nodes in a decentralized system need to come to an agreement, they must be protected against Byzantine faults. If the network between the nodes is what causes Byzantine faults, you cannot create a perfectly Byzantine fault tolerant system, although there are practical workarounds. If some of the nodes are causing Byzantine faults, there is a solution if and only if more than two-thirds of all nodes are working correctly.
Blockchains are Byzantine fault tolerant through their consensus protocols. These protocols make it very hard to send wrong or confusing information to other nodes in the network. As long as more than two-thirds of all nodes work, these blockchains cannot be corrupted.