A Hitchhiker's Guide To Blockchain

6-Jan-2018

Blockchain ... it's almost too much to take on in one rant.

First and foremost, there is no single precise definition of "blockchain" like there is with, for example, the derivative of x²+1. The term blockchain is now used broadly to cover a soup of approaches involving immutability, transaction management, distribution of data, payload, and consensus. Here is a sample from both ends of the spectrum:

Concept Bitcoin Hyperledger (incl. many variations within)

Participants Completely anonymous, known only by public key All participants well-known and identity is vetted

Transactions Accumulate in an uncommitted block (effectively a bucket). Miners attempt to find the right data (a "nonce") to add to the block that will produce a properly constructed hash or fingerprint of the block, after which the block can be committed to the chain and the result broadcast to the mining network No mining, no nonces, and essentially no blocks. Each transaction (e.g. modification of a loan agreement, updating a current assessed value, etc.) yields a new version, which is hashed and committed to the chain

Consensus Statistically driven, relying on large number of participants. Side chains can emerge but eventually, participants add more and more blocks to one particular chain, leading to longest chain wins model. Statistics suggest that after 6 blocks have been committed to a chain, the transactions within are nearly (but not 100%) guaranteed to be correct and without double-spending. Workflow entitlements driven, relying on specific actions by specifically named participants. No consensus required although the workflow might demand two or more participants to do something before state change can take place. But this is not the same thing as law-of-large-numbers statistical consensus.

Distribution Model Fully distributed data and processing running on any infrastructure from the cloud to a PC on a desktop. Many nodes in the network, each with a copy of the blockchain. Nodes broadcast changes and listen for others and each applies the same algorithms to achieve global consensus. (Most extreme variation) Single copy of a single workflow running on infrastruture in the cloud hosted by a major company. No other nodes, no other copies. APIs typically exist for participants to "listen" for new activity on the workflow, upon which they can manually "copy down" the latest versions into their own technology (which may have nothing to do blockchain) for local processing and querying. Note these local copies are NOT part of any consensus / data integrity model.

Payload Completely objective and context-free, the value and bookkeeping data about the bitcoin. Any party examining the payload can understand it The digital asset is an arbitrary payload such as a loan that may have a great deal of subjective and context-sensitive data. Consistent relevance/importance and interpretation of all the data to every party involved in the workflow is highly questionable, e.g. the building inspector does not care about the LTV and toggle rate parameters of the loan -- and by extension does not want to in any way be responsible for assuring their integrity

Concept	Bitcoin	Hyperledger (incl. many variations within)
Participants	Completely anonymous, known only by public key	All participants well-known and identity is vetted
Transactions	Accumulate in an uncommitted block (effectively a bucket). Miners attempt to find the right data (a "nonce") to add to the block that will produce a properly constructed hash or fingerprint of the block, after which the block can be committed to the chain and the result broadcast to the mining network	No mining, no nonces, and essentially no blocks. Each transaction (e.g. modification of a loan agreement, updating a current assessed value, etc.) yields a new version, which is hashed and committed to the chain
Consensus	Statistically driven, relying on large number of participants. Side chains can emerge but eventually, participants add more and more blocks to one particular chain, leading to longest chain wins model. Statistics suggest that after 6 blocks have been committed to a chain, the transactions within are nearly (but not 100%) guaranteed to be correct and without double-spending.	Workflow entitlements driven, relying on specific actions by specifically named participants. No consensus required although the workflow might demand two or more participants to do something before state change can take place. But this is not the same thing as law-of-large-numbers statistical consensus.
Distribution Model	Fully distributed data and processing running on any infrastructure from the cloud to a PC on a desktop. Many nodes in the network, each with a copy of the blockchain. Nodes broadcast changes and listen for others and each applies the same algorithms to achieve global consensus.	(Most extreme variation) Single copy of a single workflow running on infrastruture in the cloud hosted by a major company. No other nodes, no other copies. APIs typically exist for participants to "listen" for new activity on the workflow, upon which they can manually "copy down" the latest versions into their own technology (which may have nothing to do blockchain) for local processing and querying. Note these local copies are NOT part of any consensus / data integrity model.
Payload	Completely objective and context-free, the value and bookkeeping data about the bitcoin. Any party examining the payload can understand it	The digital asset is an arbitrary payload such as a loan that may have a great deal of subjective and context-sensitive data. Consistent relevance/importance and interpretation of all the data to every party involved in the workflow is highly questionable, e.g. the building inspector does not care about the LTV and toggle rate parameters of the loan -- and by extension does not want to in any way be responsible for assuring their integrity

So... which one is correct? Both. Wikipedia summarizes blockchain thusly and I believe it is not only a fair description, but one that could be applied to both scenarios above:

A blockchain, originally block chain, is a continuously growing list of records, called blocks, which are linked and secured using cryptography. Each block typically contains a hash pointer as a link to a previous block, a timestamp and transaction data. By design, blockchains are inherently resistant to modification of the data. The Harvard Business Review describes it as "an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way." For use as a distributed ledger, a blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without the alteration of all subsequent blocks, which requires collusion of the network majority.

The landscape is filled with terms like "distributed ledger" and "smart contract" and different definitions have been applied to each one depending on the particular product at hand. In other words, there are many different (and useful and interesting) products and solutions performing very different workloads at different scale and performance -- and each trying to tag the solution with as many blockchain terms as possible.

Instead of adding to the mess by proclaming another top-down definition of the blockchain as it revolutionizes yet another business use case, let's instead start fresh from the bottom up: a chain of transactions.

Chain Immutability: The Foundation

A critical feature of a blockchain -- arguably, the most important feature -- is to provide cryptographically enforced immutability of data. It is about how a series of versions of a thing (i.e. a "chain") can be "fingerprinted" in a way that guarantees both the integrity of each version but also the exact sequence of the versions. The truth is some products and innovative internal development has been doing this for 20 years but without the fanfare. It's pretty simple and efficient. Let's start with an implementation that readily and clearly exposes and ensures both individual version integrity and lineage (note: this is not the implementation used in most blockchains; we are trying to highlight some concepts without simplifications):

Construct version 1 of a thing. The thing can be anything: a single string, a record of data, an Excel spreadsheet, an MP3 music file, a smart contract (more on that later!). Anything. In bitcoin, the thing is a set of candidate transactions enclosed in a block. In another system, it might be a big JSON object with metadata (createDate, updateDate, updateBy, etc. etc.) and rich shapes of domain-specific data like loan parameters. All these things are simply a sequence of bytes. The blockchain is agnostic to the payload!
Get the fingerprint of the thing using a hash function such as SHA-256. In the old days (the 1990s) we used MD5 but significantly better / more secure options are now widely used. There is an enormous amount of material available on hashing so I won't go into detail here, but two concepts are important:
1. Any sequence of bytes no matter the length (i.e. big Excel spreadsheet, tiny email, etc.) always turns into a 32 byte fingerprint that is unique for that sequence of bytes. If two MS Word documents differ by a single space, the fingerprints will be different. And no two different inputs will yield the same output fingerprint.
2. It is computationally infeasible to "reverse" the fingerprint, i.e. figure out what sequence of bytes to put in to yield a desired fingerprint. This is why hash functions are also called one-way functions. Given bytes, it is easy to get the fingerprint -- but given a fingerprint, it is effectively impossible to get the bytes. This means you cannot "cheat" and invent target fingerprints.
Make version 2 of a thing and get the fingerprint in the same way as version 1.
Here's where the chain integrity part kicks in: We now take the fingerprint from version 1 and the fingerprint from version 2, combine them, and fingerprint that result. In pseudocode:
```
    fingerprint1 = hash(version 1)
    fingerprint2 = hash(version 2)
    merged = concatenate(fingerprint1, fingerprint2)
    chain_fingerprint_from_1_to_2 = hash(merged)
```
In practice, some more material is added in the concatentation step so that newly created versions that have no changes are "forced" to have a different chain fingerprint, but that's not important here.
Make version 3 of a thing and get its fingerprint.

Perform the chaining exercise again, but this time to the fingerprint from version 2:

    fingerprint2 = hash(version 2)
    fingerprint3 = hash(version 3)
    merged = concatenate(fingerprint2, fingerprint3)
    chain_fingerprint_from_2_to_3 = hash(merged)

One more time:

    fingerprint3 = hash(version 3)
    fingerprint4 = hash(version 4)
    merged = concatenate(fingerprint3, fingerprint4)
    chain_fingerprint_from_3_to_4 = hash(merged)

At this point you can probably see what is happening. The chain_fingerprint at any particular version, e.g. 3, can only be constructed from version 3 of the thing PLUS the fingerprint in version 2 -- and that can only be constructed from the fingerprint in the version 1.

As a result, given a list of versions of a transaction, it is possible for anyone to "walk" the list and recalculate all the fingerprints and ensure that the recalculated data matches whatever was originally stored. Not a single byte of any of the versions can change nor can the order of the list. No secret keys are required; in fact, no keys are required at all and the process is patently transparent. It almost does not matter if a thing in fact has a "version number" as part of its data payload. It is the creation order and fingerprint chaining that is the ultimate guarantor of integrity and transaction activity over time.

Most popular blockchain implementations, however, assume (rightly so) that individual version integrity and lineage cannot be separated. Thus, instead of tracking both the individual version fingerprint and the chain_fingerprint, each individual version fingerprint includes the fingerprint from the prior version as well:

    merged = concatenate(null, version 1); // 1st is special; no prior fingerprint!
    fingerprint 1 = hash(merged)

    merged = concatenate(fingerprint 1, version 2);
    fingerprint 2 = hash(merged)

    merged = concatenate(fingerprint 2, version 3);
    fingerprint 3 = hash(merged)

    merged = concatenate(fingerprint 3, version 4);
    fingerprint 4 = hash(merged)

As valuable and important as chain immutability is, there are several important points that should be made here:

Chain immutability is independent of whether the chain is distributed, copied, or centralized. When we say "originally stored" above, we assume that the transactions are being stored somewhere -- but we can have one cloud-enabled instance of this chain upon which all participants interact or 1000s of copies around the world. Varying objectives for ownership/control, performance, and other factors will determine how many duplicated chains exist and give rise to soft terms like "local ledger", "distributed ledger", "centrally shared ledger", "shared ledger", etc. etc. , but distribution and/or duplication of the chain is separate from the fundamental immutability of the chain. The chain immutability guarantees that if copies need to be reconciled, the process is both precise and efficient. In a multiple local copy implementation, there is no point in tampering with my local copy because the tampering will generate a different set of fingerprints and my peers who also have a copy will not agree with my new fingerprints (see signatures below for even stronger integrity).
Note: Even in a single shared central ledger design like the IBM Cloud Hyperledger, it is very likely that participants will have to make an out-of-process copy of the ledger in order to integrate the data with other systems. Long story short, there is no practical way you are going to issue a SELECT statement to join the blockchain persistor to your local database.
Chain immutability is not the same thing as database immutability. The steps outlined above describe the cryptographic math and process around immutability. In reality, all those versions and fingerprints have to go somewhere, like a database. It is perfectly reasonable to assume someone could, intentionally or otherwise, change the data in the database, setting aside standard permissions design and issues for the moment. The database doesn't natively know about hashing content and the formula for computing the chain_fingerprint. It is only when the chain is walked by software and the material rehashed and compared against stored values that the integrity is assured. Unless the database is periodically walked in this fashion, versions of things could be changed and consumed by business processes despite the existence of a now-invalid fingerprint. Turning off the update privilege and permitting only new versions of things (and their fingerprints) to be added seems attractive but this could be defeated by a database admin. Walking each thing's version chain is a privilege-independent operation that cannot be suborned by root or dbo.
Chain immutability is independent of physical robustness of storage. There's nothing about blockchain that prevents you from deleting the database holding the transactions and fingerprints. There's no spec about backup and recovery. Of course, in a multicopy distributed model, statisically at least one or two copies of the chain should survive destruction of backing storage so from a pure consumer and not a chain provider point of view, the storage is robust but each copy still needs individual backup or face the tedious task of downloading the entire transaction chain again. Immutability refers only to the cryptographic integrity of a chain, not the infrastructure to manage the data. Robust solutions still require HA and DR for the chain. Cloud-based solutions may vastly simplify HA/DR but make no mistake -- something in the stack has to be making copies of the chain to defend against non-availability of the storage media. Don't confuse assertions like "distributed ledgers are immune to single point-of-failure problems" with basic HA/DR capability. Chances are very high that if you are using a solution that features a full local copy under your management, you will want to have an HA/DR strategy in place. Speaking of databases, there's also startlingly little in the way of specification of how you can richly query for transactions (e.g. "find all transaction between these 2 dates where owner is Bob, amount > 100, and product is X or Y. And do that in 20 milliseconds") but that's a whole different rant.
Chain immutability has nothing to do with entitlements on the data itself such as protection of personally identifiable information. It is perfectly fine to encrypt sensitive fields (any number, in any way) in the data; the hash function does not care if it is hashing plaintext or ciphertext. The fingerprinting process examines a sequence of bytes; that's all. In fact, basic use of a blockchain requires no encryption whatsoever. A robust solution using a blockchain will require a separate data entitlement model on top of the chain machinery.
Chain immutability has nothing to do with signatures. Signatures form the basis for nonrepudiation which is a different concept than immutability.
Signatures involve the use of public/private keypairs, not one-way hash functions. Just because a chain has immutable data does not mean that I as a participant agree with the values; it simply means it cannot be changed. For me to "stamp my approval" on a new version (or, by extension, declare that it is truly me creating the new version), I sign the chain_fingerprint with my private key and the result is stored "alongside" the main data in the blockchain database. At this point, I have committed myself to this version. Anyone with my public key can pass the signature through it to reverse it to the original chain_fingerprint, thereby guaranteeing that it was me and only me who could have created the signature. Furthermore, signatures stop others from tampering with their copies of the ledger and creating "alternate universes." If I sign a chain_fingerprint this introduces crypto material into the chain that could ONLY have come from me. A nefarious business partner creating an alternate universe and recalculating all the fingerprints and saving them and claiming that to be the real chain cannot reproduce my signatures without getting me involved.
Signing transactions is a vital part of blockchain security/integrity but it also introduces risk because unlike hashing which is completely identity-independent, signing requires a private key and something in the blockchain process must exist to deliver unsigned data to you so you can securely sign in and pass the result back. You must be very careful to physically guard your private keys. It is much, much easier to steal a private key than to computationally attack encrypted material. This challenge has been present for more than 20 years.
But...
In the emerging world of smart contracts, this could have devasting consequences as contracts signed by you (but not really you) automatically transfer ownership of your car to an unintended third party, which quickly sells the car for bitcoins, remaining anonymous and leaving you to deal with the new owner who can present cryptographically secure proof that he owns the asset. Because people are fallable -- much more so than strong cryptography -- clearly legal counsel will continue to be a needed profession.

What is Mining?

Recall we said above that any "thing" can be a versioned entry on a chain. Mining is the process by which new versions of blocks of transactions -- these blocks being the "things" -- are committed to chain that is of the distributed, multicopy form. The process is designed to be deliberately difficult to perform and, unfortunately, requires a lot of time and electricity, and in ever increasing amounts. So much so that it is not practical to try to commit individual transactions on the blockchain; instead, many of them (anywhere from 1000 to perhaps 2500) are bundled into a block and that becomes the unit of work for mining. Mining is an essential part of validating the integrity of the transactions on a multiple copy distributed chain and is also important to prevent double spending, which is too complex to cover here. However, the basic process of transaction accumulation and time to mine is a very important factor in understanding the dynamics of data in a blockchain. Long story short: do not assume the blockchain is a high performance, queryable database like MySQL or Oracle or DB2 or MongoDB.

The basic idea in mining a new block is to get the fingerprint to look like a special target sequence with a certain number of leading zeros. For example, instead of the fingerprint looking like this:

    8e12fd1980258264f694cf2fa788388af9172c1ce9fc994aea3f6067e50414d5

it needs to look like this:

    0000000000000000057fcc708cf0130d95e27c5819203e9f967ac56e4df598ee

As described above, it is infeasible to "back into" this value of fingerprint. Instead, you must try to create it over and over again using a nonce which is an extra "ingredient" in the hash. The pseudocode actually is pretty straightforward:

    fingerprint = null;
    while(fingerprint does not contains required amount of leading zeros) {
        nonce = 4 bytes of random material (32 bits, or 4 billion possibilities);
        fingerprint = hash(block of transaction data + nonce);
    }

When the right nonce is discovered, the block of transactions is considered mined and the new block along with the nonce that created it is published to the distributed network. Miners are rewarded for their effort and expense by receiving bitcoins.

What's On a Chain?

As we mentioned above, anything can be on a chain. The chain immutability and integrity machinery is agnostic to the payload. So what is a good use case for this kind of capability?

Need to have very strong proof of immutability
Need to have precise and efficient reconcilation of physical copies
Need to have good story around nonrepudiation particularly in multiparty transactions

Smart Contracts

Again, the truth is implementations capturing the essence of smart contracts have existed for over two decades in the form of domain specific languages or embedding interpreted languages inside more compile-time oriented languages.

In the mid 1990s, we stored smallish perl programs in a database as a BLOB. These programs exploited the compactness and "quickness" of perl to perform if/then/else logic and array and hashmap manipulation without getting buried in the rigid and unterse syntax of C++. Every night, a C++ program linked with the perl interpreter would iteratively fetch these programs based on various criteria, determine the data needs, make market and other information available to it, let it run its perl logic (which could also make use of the parent C++ program's high performance functions and, indeed, the distributed computing environment), and then save results back to the database.

Sound familiar? It is also important to note that even today most smart contract implementations have some sort of a runtime context around them. In other words, the contract software as a unit of release just "sits there"; something has to run it and bring it to life. In the example above, the parent C++ program was the execution engine that took care of this. Today, smart contracts require a similar engine that is live and sitting on top of the blockchain. The code for the smart contract is part of the data payload managed on the blockchain and enjoys the same benefits of immutability as regular "simple" data like fields of numbers and text.

Note that you actually don't need a blockchain to make a smart contract run, but the there are 2 important features that the blockchain brings to the table:

Versioning, chain immutability, and signature integration capabilities
Tight/deep integration with cryptocurrency and wallets, right down to primitives in the smart contract language. This is very useful in that decisions made by the smart contract can actually result in actual value being transfered amongst participants.

To be fair, smart contracts have a little more work to do than our 1995 version:

Very dynamic contracts may wish to respond to events in realtime. This means they must be able to "tell" their runtime environment what things they are looking for -- and in a reasonably platform independent AND an expressive way. For example, a contract may want to listen to the 10 min delayed NASDAQ stock ticker but only for stocks ABC and DEF.
SLA becomes a very important component of the realtime environment, in terms of data delivery uptime, latency, and sequencing/replay. This is a broad and difficult problem and designs and solutions are only beginning to emerge now. Note that precise SLAs are not new; it's just that the cryptographically sealed "autonomy" of the logic and frankly, the greater expressivity of the logic in the smart contract landscape appears to have raised the bar on this issue.
The source(s) of the data needs to be agreed upon by parties participating in the contract and the provenance / integrity of the source(s) needs to be cyptographically ensured. The term for these sources is ... brace yourself ... oracles. Will Oracle Corp. provide oracles? Who knows.

This article has great additional information on oracles and some of the challenges facing "good" smart contract construction.

Consensus, Responsibility, Risk, and Incentivization

There is a lot of material already published on consensus and although it is generally accepted that several fundamentally different types of consensus models exist (e.g. practical byzantine fault tolerance (PBFT) and proof-of-work (PoW)), a more important set of considerations is responsibility and incentivization.

In the Bitcoin system, consensus is achieved through proof-of-work by a statistically important large number of participants performing extremely objective and clear (but time/cost expensive) operations for which there is specific incentivization. Perfect.
But the concept of consensus gets murkier when it is not a statistically based problem involving data much more complex than a bitcoin value. For example, consider a real estate processing blockchain involving 6 parties: the buyer, the seller, the broker, the buyer's bank, the housing inspector, and an escrow bank. There is no bitcoin-style consensus here. There are not 1000s of miners performing the same task in parallel, each trying to win the next block. Instead, there is only one of each type of participant, each with a different set of responsibilities and incentivizations. This gives rise to the following:

Moschetti's Conjencture

No participant will provide input to consensus or voting regarding authenticity and/or accuracy on data or process for which they are not:

Incentivized (typically, but almost always through monetary compensation)
Protected by legal precedence based on an in-economy set of risk mitigation procedures.

This does not defeat the usefulness of the blockchain, of course, but developers of solutions must be careful when using the term "consensus." Consensus is only appropriate when 2 or more participants work in parallel and a mathematical model is employed to determine if conditions are sufficient for workflow to move forward. It is worth noting that consensus does not have to be a PhD-complex algorithm or one that demands large numbers of participants -- and in fact, many consensus models in longer cycle business-transaction workflows look very much like standard workflow approval, e.g. if a simple majority of participants at stage n say all is well, proceed to stage n+1. Or even simpler (and very common): when all participants say all is well, move to stage n+1. As such, in most business workflows, it will be necessary to clearly define the fields for which a participant has "vouching/review" responsibility. This is the next step beyond basic read/write entitlements.

An exciting opportunity exists to hybridize single-actor workflow together with consensus via crowdsourced incentivized participation. Relatively simple but somewhat more subjective steps in a workflow could be tackled by dozens or more participants, making their responses (mean and standard deviation) more statistically relevant.

Like this? Dislike this? Let me know