A Hitchhiker's Guide To Blockchain
Blockchain ... it's almost too much to take on in one rant.
First and foremost, there is no single precise definition of "blockchain" like there is with, for
example, the derivative of x2+1. The term blockchain is now used broadly to cover a soup of approaches involving
immutability, transaction management, distribution of data, payload, and consensus. Here is a sample from both ends of the spectrum:
Concept | Bitcoin | Hyperledger (incl. many variations within) |
Participants |
Completely anonymous, known only by public key |
All participants well-known and identity is vetted |
Transactions |
Accumulate in an uncommitted block (effectively a bucket). Miners attempt to find the right
data (a "nonce") to add to the block that will produce a properly constructed hash
or fingerprint of the block, after which the block can be committed to the chain and the result broadcast to the mining network |
No mining, no nonces, and essentially no blocks. Each transaction
(e.g. modification of a loan agreement, updating a current assessed value, etc.) yields a new version, which is hashed and committed to the chain |
Consensus |
Statistically driven, relying on large number of participants. Side chains can emerge but eventually, participants add more and more blocks to one particular chain, leading to longest chain wins model. Statistics suggest that
after 6 blocks have been committed to a chain, the transactions within are nearly (but not 100%) guaranteed to be correct and without double-spending.
|
Workflow entitlements driven, relying on specific actions by specifically
named participants. No consensus required although the workflow might demand
two or more participants to do something before state change can take place.
But this is not the same thing as law-of-large-numbers statistical consensus.
|
Distribution Model |
Fully distributed data and processing running on any infrastructure
from the cloud to a PC on a desktop. Many nodes in the network, each with a copy of the blockchain. Nodes broadcast changes and listen for others
and each applies the same algorithms to achieve global consensus.
|
(Most extreme variation) Single copy of a single workflow running on infrastruture
in the cloud hosted by a major company. No other nodes, no other copies.
APIs typically exist for participants to "listen" for new activity on the
workflow, upon which they can manually "copy down" the latest versions
into their own technology (which may have nothing to do blockchain) for
local processing and querying. Note these local copies are NOT part of any
consensus / data integrity model.
|
Payload |
Completely objective and context-free, the value and bookkeeping data about the bitcoin. Any party examining the payload can understand it |
The digital asset is an arbitrary payload such as a loan that may have a
great deal of subjective and context-sensitive data. Consistent relevance/importance
and interpretation of all the data to every party involved in the workflow
is highly questionable, e.g. the building inspector does not care about the
LTV and toggle rate parameters of the loan -- and by extension does not want to
in any way be responsible for assuring their integrity |
So... which one is correct? Both.
Wikipedia summarizes blockchain thusly
and I believe it is not only a fair
description, but one that could be applied to both scenarios above:
A blockchain, originally block chain, is a continuously growing list of records, called blocks, which are linked and secured using cryptography. Each block typically contains a hash pointer as a link to a previous block, a timestamp and transaction data. By design, blockchains are inherently resistant to modification of the data. The Harvard Business Review describes it as "an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way." For use as a distributed ledger, a blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without the alteration of all subsequent blocks, which requires collusion of the network majority.
The landscape is filled with terms like "distributed ledger" and "smart
contract" and different definitions have been applied to each one depending
on the particular product at hand. In other words, there are many different
(and useful and interesting) products and solutions performing very
different workloads at different scale and performance -- and each
trying to tag the solution with as many blockchain terms as possible.
Instead of adding to the mess by proclaming another top-down definition of the
blockchain as it revolutionizes yet another business use case, let's instead start fresh from the bottom up: a chain of transactions.
Chain Immutability: The Foundation
A critical feature of a blockchain -- arguably, the most important feature -- is to provide cryptographically enforced
immutability of data.
It is about how
a series of versions of a thing (i.e. a "chain") can be "fingerprinted" in a
way that guarantees both the integrity of each version but also the exact sequence of the versions. The truth is some products and innovative
internal development has been doing this for 20 years but without the fanfare.
It's pretty simple and efficient. Let's start with an implementation that readily and
clearly exposes and ensures both individual version integrity and lineage (note: this
is not the implementation used in most blockchains; we are trying to highlight some concepts
without simplifications):
- Construct version 1 of a thing. The thing can be anything: a single string, a record of data, an Excel spreadsheet, an MP3 music file, a smart
contract (more on that later!). Anything. In bitcoin, the
thing is a set of candidate transactions enclosed in a block. In another
system, it might be a big JSON object with metadata (createDate, updateDate,
updateBy, etc. etc.) and rich shapes of domain-specific data like loan
parameters. All these things are simply a sequence of bytes.
The blockchain is agnostic to the payload!
- Get the fingerprint of the thing using a hash function such as SHA-256. In the old days (the
1990s) we used MD5 but significantly better / more secure options are now
widely used. There is an enormous
amount of material available on hashing so I won't go into detail here, but
two concepts are important:
- Any sequence of bytes no matter the length (i.e. big Excel spreadsheet, tiny email, etc.) always turns into a 32 byte fingerprint that is unique for that sequence of bytes. If two MS Word documents differ by a single space,
the fingerprints will be different. And no two different inputs will yield the
same output fingerprint.
- It is computationally infeasible to "reverse" the fingerprint, i.e. figure
out what sequence of bytes to put in to yield a desired fingerprint. This
is why hash functions are also called one-way functions. Given bytes, it is
easy to get the fingerprint -- but given a fingerprint, it is effectively
impossible to get the bytes. This means you cannot "cheat" and invent
target fingerprints.
- Make version 2 of a thing and get the fingerprint in the same way as version 1.
- Here's where the chain integrity part kicks in: We now take the
fingerprint from version 1 and the fingerprint from version 2, combine them,
and fingerprint that result. In pseudocode:
fingerprint1 = hash(version 1)
fingerprint2 = hash(version 2)
merged = concatenate(fingerprint1, fingerprint2)
chain_fingerprint_from_1_to_2 = hash(merged)
In practice, some more material is added in the concatentation step so that
newly created versions that have no changes are "forced" to have a
different chain fingerprint, but that's not important here.
- Make version 3 of a thing and get its fingerprint.
- Perform the chaining exercise again, but this time to the
fingerprint from version 2:
fingerprint2 = hash(version 2)
fingerprint3 = hash(version 3)
merged = concatenate(fingerprint2, fingerprint3)
chain_fingerprint_from_2_to_3 = hash(merged)
- One more time:
fingerprint3 = hash(version 3)
fingerprint4 = hash(version 4)
merged = concatenate(fingerprint3, fingerprint4)
chain_fingerprint_from_3_to_4 = hash(merged)
At this point you can probably see what is happening. The chain_fingerprint at any particular version, e.g. 3, can only be constructed from
version 3 of the thing PLUS the fingerprint in version 2 -- and that can only be constructed from the fingerprint in the version 1.
As a result, given a list of versions of a transaction, it is possible
for anyone to "walk" the list and recalculate all the fingerprints and ensure that the
recalculated data matches whatever was originally stored. Not a single byte
of any of the versions can change nor can the order of the list. No
secret keys are required; in fact, no keys are required at all and the process
is patently transparent. It
almost does not matter if a thing in fact has a "version number" as part of
its data payload. It is the creation order and fingerprint chaining that
is the ultimate guarantor of integrity and transaction activity over time.
Most popular blockchain implementations, however, assume (rightly so) that
individual version integrity and lineage cannot be separated. Thus, instead of tracking
both the individual version fingerprint and the chain_fingerprint, each
individual version fingerprint includes the fingerprint from the prior
version as well:
merged = concatenate(null, version 1); // 1st is special; no prior fingerprint!
fingerprint 1 = hash(merged)
merged = concatenate(fingerprint 1, version 2);
fingerprint 2 = hash(merged)
merged = concatenate(fingerprint 2, version 3);
fingerprint 3 = hash(merged)
merged = concatenate(fingerprint 3, version 4);
fingerprint 4 = hash(merged)
As valuable and important as chain immutability is, there are several
important points that should be made here:
- Chain immutability is independent of whether the chain is distributed, copied,
or centralized. When we say "originally stored" above, we assume that the transactions are being
stored somewhere -- but we can have one cloud-enabled instance of this chain
upon which all participants interact or 1000s
of copies around the world. Varying objectives for ownership/control, performance, and
other factors will determine how many duplicated chains exist and give rise
to soft terms like "local ledger", "distributed ledger", "centrally shared ledger", "shared ledger", etc. etc.
, but distribution and/or duplication of the chain is separate from the
fundamental immutability of the chain. The chain immutability guarantees that if
copies need to be reconciled, the process is both precise and efficient.
In a multiple local copy implementation, there is no point in tampering with my local copy because the tampering will generate a different set of fingerprints and my peers who also have a copy will not agree with my new fingerprints (see
signatures below for even stronger integrity).
Note: Even in a single shared central ledger
design like the IBM Cloud Hyperledger, it is very likely that participants
will have to make an out-of-process
copy of the ledger in order to integrate the data with other systems. Long
story short, there is no practical way you are going to issue a SELECT
statement to join the blockchain persistor to your local database.
- Chain immutability is not the same thing as database immutability.
The steps outlined above describe the cryptographic math and process around
immutability. In reality, all those versions and fingerprints have to go
somewhere, like a database. It is perfectly reasonable to assume someone
could, intentionally or otherwise, change the data in the database,
setting aside standard permissions design and issues for the moment. The
database doesn't natively know about hashing content and the formula for
computing the chain_fingerprint. It is only when the chain is
walked by software and the material rehashed and compared against
stored values that the integrity is assured. Unless the database is
periodically walked in this fashion, versions of things could be changed and
consumed by business processes despite the existence of a now-invalid
fingerprint. Turning off the update privilege and permitting only new versions
of things (and their fingerprints) to be added seems attractive but this could
be defeated by a database admin. Walking
each thing's version chain is a privilege-independent operation that cannot
be suborned by root or dbo.
- Chain immutability is independent of physical robustness of storage.
There's
nothing about blockchain that prevents you from deleting the database holding
the transactions and fingerprints. There's no spec about backup and recovery.
Of course, in a multicopy distributed model, statisically at least one or two
copies of the chain should survive destruction of backing storage so from a pure
consumer and not a chain provider point of view, the storage is robust
but each copy
still needs individual backup or face the tedious task of downloading the entire
transaction chain again.
Immutability refers only to the cryptographic integrity of a chain, not the
infrastructure to manage the data. Robust solutions still require HA and DR
for the chain. Cloud-based solutions may vastly simplify HA/DR but make no
mistake -- something in the stack has to be making copies of the chain
to defend against non-availability of the storage media. Don't confuse
assertions like "distributed ledgers are immune to single point-of-failure
problems" with basic HA/DR capability. Chances are very high that if you are
using a solution that features a full local copy under your management, you
will want to have an HA/DR strategy in place.
Speaking of databases, there's also startlingly little in the way of specification
of how you can
richly query for transactions (e.g. "find all transaction between these 2 dates where
owner is Bob, amount > 100, and product is X or Y. And do that in 20 milliseconds")
but that's a whole different rant.
- Chain immutability has nothing to do with entitlements on the data itself
such as protection of personally identifiable information. It is perfectly
fine to encrypt sensitive fields (any number, in any way) in the data; the
hash function does not care if it is hashing plaintext or ciphertext. The
fingerprinting process examines a sequence of bytes; that's all. In fact,
basic use of a blockchain requires no encryption whatsoever. A robust solution
using a blockchain will require a separate data entitlement model on top of
the chain machinery.
- Chain immutability has nothing to do with signatures. Signatures form
the basis for nonrepudiation which is a different concept than
immutability.
Signatures involve the use of public/private keypairs, not one-way hash
functions. Just because a chain
has immutable data does not mean that I as a participant agree with the
values; it simply means it cannot be changed. For me to "stamp my approval" on a
new version (or, by extension, declare that it is truly me creating the new
version), I sign the chain_fingerprint with my private key and
the result is stored "alongside" the main data in the blockchain database.
At this point, I have committed myself to this version. Anyone with my public
key can pass the signature through it to reverse it to the original chain_fingerprint, thereby guaranteeing that it was me and only me who could have
created the signature. Furthermore, signatures stop others from tampering
with their copies of the ledger and creating "alternate universes." If I sign
a chain_fingerprint this introduces crypto material into the chain
that could ONLY have come from me. A nefarious business partner creating
an alternate universe and recalculating all the fingerprints and saving them
and claiming that to be the real chain cannot reproduce my signatures
without getting me involved.
Signing transactions is a vital part of blockchain security/integrity but
it also introduces risk because unlike hashing which is completely
identity-independent, signing requires a private key and something in the
blockchain process must exist to deliver unsigned data to you so you can
securely sign in and pass the result back. You must be very careful to physically guard your private keys.
It is much, much easier to steal a private key than to computationally attack
encrypted material. This challenge has been present for more than 20 years.
But...
In the emerging world of smart contracts, this
could have devasting consequences as contracts signed by you (but not really you) automatically transfer ownership of your car to an unintended third
party, which quickly sells the car for bitcoins, remaining anonymous and leaving
you to deal with the new owner who can present cryptographically secure proof
that he owns the asset. Because people are fallable -- much more so than
strong cryptography -- clearly legal counsel will continue to be a needed
profession.
What is Mining?
Recall we said above that any "thing" can be a versioned entry on a chain.
Mining is the process by which new versions of blocks of transactions -- these
blocks being the "things" -- are committed to chain that is of the
distributed, multicopy form.
The process is designed to be deliberately difficult to
perform and, unfortunately, requires a lot of time and electricity, and in ever
increasing amounts. So much so that it is not practical to try to commit
individual transactions on the blockchain; instead, many of them
(anywhere from 1000 to perhaps 2500) are bundled into a block and that becomes
the unit of work for mining. Mining is an essential part of validating
the integrity of the transactions on a multiple copy distributed chain and
is also important to prevent double spending, which is too complex to
cover here. However, the basic process of transaction accumulation and time
to mine is a very important factor in understanding the dynamics of data in
a blockchain. Long story short: do not assume the blockchain is
a high performance, queryable database like MySQL or Oracle or DB2 or MongoDB.
The basic idea in mining a new block is to get the
fingerprint to look like a special target sequence with a certain number
of leading zeros. For example, instead of the fingerprint looking like this:
8e12fd1980258264f694cf2fa788388af9172c1ce9fc994aea3f6067e50414d5
it needs to look like this:
0000000000000000057fcc708cf0130d95e27c5819203e9f967ac56e4df598ee
As described above, it is infeasible to "back into" this value of fingerprint.
Instead, you must try to create it over and over again using a nonce
which is an extra "ingredient" in the hash. The pseudocode actually is
pretty straightforward:
fingerprint = null;
while(fingerprint does not contains required amount of leading zeros) {
nonce = 4 bytes of random material (32 bits, or 4 billion possibilities);
fingerprint = hash(block of transaction data + nonce);
}
When the right nonce is discovered, the block of transactions is considered
mined and the new block along with the nonce that created it is published to
the distributed network. Miners are rewarded for their effort and expense
by receiving bitcoins.
What's On a Chain?
As we mentioned above, anything can be on a chain. The chain immutability
and integrity machinery is agnostic to the payload. So what is a good use
case for this kind of capability?
- Need to have very strong proof of immutability
- Need to have precise and efficient reconcilation of physical copies
- Need to have good story around nonrepudiation particularly in multiparty
transactions
Smart Contracts
Again, the truth is implementations capturing the essence of smart contracts
have existed for over two decades in the form of domain
specific languages or embedding interpreted languages inside more compile-time
oriented languages.
In the mid 1990s, we stored smallish perl programs in a database
as a BLOB. These programs exploited the compactness and "quickness" of
perl to perform if/then/else logic and array and hashmap manipulation
without getting buried in the rigid and unterse syntax of C++.
Every night, a C++ program linked with the perl interpreter
would iteratively fetch these programs based on various criteria, determine the data needs, make market and other
information available to it, let it run its perl logic (which could also make use of
the parent C++ program's high performance functions and, indeed, the
distributed computing environment), and then save results back to the
database.
Sound familiar? It is also important to note that even today most smart
contract implementations have some sort of a runtime context around them.
In other words, the contract software as a unit of release just "sits
there"; something has to run it and bring it to life. In the example
above, the parent C++ program was the execution engine that took care of this.
Today, smart contracts require a similar engine that is live and sitting
on top of the blockchain. The code for the smart contract is part of
the data payload managed on the blockchain and enjoys the same benefits
of immutability as regular "simple" data like fields of numbers and text.
Note that you actually don't need a blockchain to make a smart contract
run, but the there are 2 important features that the blockchain brings to the table:
- Versioning, chain immutability, and signature integration
capabilities
- Tight/deep integration with cryptocurrency and wallets, right down to primitives
in the smart contract language. This is very useful in
that decisions made by the smart contract can actually result in actual value being
transfered amongst participants.
To be fair, smart contracts have a little more work to do than our 1995 version:
- Very dynamic contracts may wish to respond to events in realtime. This
means they must be able to "tell" their runtime environment what things they
are looking for -- and in a reasonably platform independent AND an expressive
way. For example,
a contract may want to listen to the 10 min delayed NASDAQ stock ticker but
only for stocks ABC and DEF.
- SLA becomes a very important component of the realtime environment, in
terms of data delivery uptime, latency, and sequencing/replay. This is a
broad and difficult problem and designs and solutions are only beginning to
emerge now. Note that precise SLAs are not new; it's just that the
cryptographically sealed "autonomy" of the logic
and frankly, the greater expressivity of the logic in the smart contract landscape appears to have raised the bar on this issue.
- The source(s) of the data needs to be agreed upon by parties participating
in the contract and the provenance / integrity of the source(s) needs to be
cyptographically ensured. The term for these sources is ... brace yourself ...
oracles. Will Oracle Corp. provide oracles? Who knows.
This article has great additional information on oracles and some of the
challenges facing "good" smart contract construction.
Consensus, Responsibility, Risk, and Incentivization
There is a lot of material already published on consensus and although it
is generally accepted that several fundamentally different types of consensus
models exist (e.g. practical byzantine fault tolerance (PBFT) and proof-of-work (PoW)),
a more important set of considerations is responsibility and incentivization.
In the Bitcoin system, consensus is achieved through proof-of-work by a
statistically important large number of participants performing extremely
objective and clear (but time/cost expensive) operations for which there is
specific incentivization. Perfect.
But the concept of consensus gets murkier when it is not a statistically
based problem involving data much more complex than a bitcoin value. For
example, consider a real estate processing blockchain involving 6 parties:
the buyer, the seller, the broker, the buyer's bank, the housing inspector,
and an escrow bank. There is no bitcoin-style consensus here. There are
not 1000s of miners performing the same task in parallel, each trying to
win the next block. Instead, there is only one of each type of participant,
each with a different set of responsibilities and incentivizations. This
gives rise to the following:
Moschetti's Conjencture
No participant will provide input to consensus or voting regarding
authenticity and/or accuracy on data or process for
which they are not:
- Incentivized (typically, but almost always through monetary compensation)
- Protected by legal precedence based on an in-economy set of risk mitigation
procedures.
This does not defeat the usefulness of the blockchain, of course, but developers
of solutions must be careful when using the term "consensus." Consensus is
only appropriate when 2 or more participants work in parallel and a mathematical
model is employed to determine if conditions are sufficient for workflow to move forward.
It is worth noting that consensus does not have to be a PhD-complex algorithm
or one that demands large numbers of participants -- and in fact, many consensus
models in longer cycle business-transaction workflows look very much like
standard workflow approval, e.g. if a simple majority of participants at stage n
say all is well, proceed to stage n+1. Or even simpler (and very common):
when all participants say all is well, move to stage n+1.
As such, in most
business workflows, it will be necessary to clearly define the fields for which a
participant has "vouching/review" responsibility. This is the next step beyond
basic read/write entitlements.
An exciting opportunity exists to hybridize single-actor workflow together
with consensus via crowdsourced incentivized participation. Relatively simple
but somewhat more subjective steps in a workflow could be tackled by dozens
or more participants, making their responses (mean and standard deviation)
more statistically relevant.
Like this? Dislike this? Let me know
Site copyright © 2013-2025 Buzz Moschetti. All rights reserved