A Hitchhiker's Guide To Blockchain

6-Jan-2018 Like this? Dislike this? Let me know

Blockchain ... it's almost too much to take on in one rant.

First and foremost, there is no single precise definition of "blockchain" like there is with, for example, the derivative of x2+1. The term blockchain is now used broadly to cover a soup of approaches involving immutability, transaction management, distribution of data, payload, and consensus. Here is a sample from both ends of the spectrum:
ConceptBitcoinHyperledger (incl. many variations within)
Participants Completely anonymous, known only by public key All participants well-known and identity is vetted
Transactions Accumulate in an uncommitted block (effectively a bucket). Miners attempt to find the right data (a "nonce") to add to the block that will produce a properly constructed hash or fingerprint of the block, after which the block can be committed to the chain and the result broadcast to the mining network No mining, no nonces, and essentially no blocks. Each transaction (e.g. modification of a loan agreement, updating a current assessed value, etc.) yields a new version, which is hashed and committed to the chain
Consensus Statistically driven, relying on large number of participants. Side chains can emerge but eventually, participants add more and more blocks to one particular chain, leading to longest chain wins model. Statistics suggest that after 6 blocks have been committed to a chain, the transactions within are nearly (but not 100%) guaranteed to be correct and without double-spending. Workflow entitlements driven, relying on specific actions by specifically named participants. No consensus required although the workflow might demand two or more participants to do something before state change can take place. But this is not the same thing as law-of-large-numbers statistical consensus.
Distribution Model Fully distributed data and processing running on any infrastructure from the cloud to a PC on a desktop. Many nodes in the network, each with a copy of the blockchain. Nodes broadcast changes and listen for others and each applies the same algorithms to achieve global consensus. (Most extreme variation) Single copy of a single workflow running on infrastruture in the cloud hosted by a major company. No other nodes, no other copies. APIs typically exist for participants to "listen" for new activity on the workflow, upon which they can manually "copy down" the latest versions into their own technology (which may have nothing to do blockchain) for local processing and querying. Note these local copies are NOT part of any consensus / data integrity model.
Payload Completely objective and context-free, the value and bookkeeping data about the bitcoin. Any party examining the payload can understand it The digital asset is an arbitrary payload such as a loan that may have a great deal of subjective and context-sensitive data. Consistent relevance/importance and interpretation of all the data to every party involved in the workflow is highly questionable, e.g. the building inspector does not care about the LTV and toggle rate parameters of the loan -- and by extension does not want to in any way be responsible for assuring their integrity

So... which one is correct? Both. Wikipedia summarizes blockchain thusly and I believe it is not only a fair description, but one that could be applied to both scenarios above:

A blockchain, originally block chain, is a continuously growing list of records, called blocks, which are linked and secured using cryptography. Each block typically contains a hash pointer as a link to a previous block, a timestamp and transaction data. By design, blockchains are inherently resistant to modification of the data. The Harvard Business Review describes it as "an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way." For use as a distributed ledger, a blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without the alteration of all subsequent blocks, which requires collusion of the network majority.
The landscape is filled with terms like "distributed ledger" and "smart contract" and different definitions have been applied to each one depending on the particular product at hand. In other words, there are many different (and useful and interesting) products and solutions performing very different workloads at different scale and performance -- and each trying to tag the solution with as many blockchain terms as possible.

Instead of adding to the mess by proclaming another top-down definition of the blockchain as it revolutionizes yet another business use case, let's instead start fresh from the bottom up: a chain of transactions.

Chain Immutability: The Foundation

A critical feature of a blockchain -- arguably, the most important feature -- is to provide cryptographically enforced immutability of data. It is about how a series of versions of a thing (i.e. a "chain") can be "fingerprinted" in a way that guarantees both the integrity of each version but also the exact sequence of the versions. The truth is some products and innovative internal development has been doing this for 20 years but without the fanfare. It's pretty simple and efficient and works more or less as follows:
  1. Construct version 1 of a thing. The thing can be anything: a single string, a record of data, an Excel spreadsheet, an MP3 music file, a smart contract (more on that later!). Anything. In bitcoin, the thing is a set of candidate transactions enclosed in a block. In another system, it might be a big JSON object with metadata (createDate, updateDate, updateBy, etc. etc.) and rich shapes of domain-specific data like loan parameters. All these things are simply a sequence of bytes. The blockchain is agnostic to the payload!

  2. Get the fingerprint of the thing using a hash function such as SHA-256. In the old days (the 1990s) we used MD5 but significantly better / more secure options are now widely used. There is an enormous amount of material available on hashing so I won't go into detail here, but two concepts are important:
    1. Any sequence of bytes no matter the length (i.e. big Excel spreadsheet, tiny email, etc.) always turns into a 32 byte fingerprint that is unique for that sequence of bytes. If two MS Word documents differ by a single space, the fingerprints will be different. And no two different inputs will yield the same output fingerprint.
    2. It is computationally infeasible to "reverse" the fingerprint, i.e. figure out what sequence of bytes to put in to yield a desired fingerprint. This is why hash functions are also called one-way functions. Given bytes, it is easy to get the fingerprint -- but given a fingerprint, it is effectively impossible to get the bytes. This means you cannot "cheat" and invent target fingerprints.

  3. Make version 2 of a thing and get the fingerprint in the same way as version 1.
  4. Here's where the chain integrity part kicks in: We now take the fingerprint from version 1 and the fingerprint from version 2, combine them, and fingerprint that result. In pseudocode:
        fingerprint1 = hash(version 1)
        fingerprint2 = hash(version 2)
        merged = concatenate(fingerprint1, fingerprint2)
        chain_fingerprint_from_1_to_2 = hash(merged)
    In practice, some more material is added in the concatentation step so that newly created versions that have no changes are "forced" to have a different chain fingerprint, but that's not important here.

  5. Make version 3 of a thing and get its fingerprint.
  6. Perform the chaining exercise again, but this time to the fingerprint from version 2:
        fingerprint2 = hash(version 2)
        fingerprint3 = hash(version 3)
        merged = concatenate(fingerprint2, fingerprint3)
        chain_fingerprint_from_2_to_3 = hash(merged)

  7. One more time:
        fingerprint3 = hash(version 3)
        fingerprint4 = hash(version 4)
        merged = concatenate(fingerprint3, fingerprint4)
        chain_fingerprint_from_3_to_4 = hash(merged)
At this point you can probably see what is happening. The chain_fingerprint at any particular version, e.g. 3, can only be constructed from version 3 of the thing PLUS the fingerprint in version 2 -- and that can only be constructed from the fingerprint in the version 1. Note: the example above has variations in practice. Some chain implementations forego storing the discrete fingerprint for a version, instead capturing only fingerprint that includes the fingerprint from the prior version.

As a result, given a list of versions of a transaction, it is possible for anyone to "walk" the list and recalculate all the fingerprints and ensure that the recalculated data matches whatever was originally stored. Not a single byte of any of the versions can change nor can the order of the list. No secret keys are required; in fact, no keys are required at all and the process is patently transparent. It almost does not matter if a thing in fact has a "version number" as part of its data payload. It is the creation order and fingerprint chaining that is the ultimate guarantor of integrity and transaction activity over time.

As valuable and important as chain immutability is, there are several important points that should be made here:

What is Mining?

Recall we said above that any "thing" can be a versioned entry on a chain. Mining is the process by which new versions of blocks of transactions -- these blocks being the "things" -- are committed to chain that is of the distributed, multicopy form. The process is designed to be deliberately difficult to perform and, unfortunately, requires a lot of time and electricity, and in ever increasing amounts. So much so that it is not practical to try to commit individual transactions on the blockchain; instead, many of them (anywhere from 1000 to perhaps 2500) are bundled into a block and that becomes the unit of work for mining. Mining is an essential part of validating the integrity of the transactions on a multiple copy distributed chain and is also important to prevent double spending, which is too complex to cover here. However, the basic process of transaction accumulation and time to mine is a very important factor in understanding the dynamics of data in a blockchain. Long story short: do not assume the blockchain is a high performance, queryable database like MySQL or Oracle or DB2 or MongoDB.

The basic idea in mining a new block is to get the fingerprint to look like a special target sequence with a certain number of leading zeros. For example, instead of the fingerprint looking like this:

it needs to look like this:
As described above, it is infeasible to "back into" this value of fingerprint. Instead, you must try to create it over and over again using a nonce which is an extra "ingredient" in the hash. The pseudocode actually is pretty straightforward:
    fingerprint = null;
    while(fingerprint does not contains required amount of leading zeros) {
        nonce = 4 bytes of random material (32 bits, or 4 billion possibilities);
        fingerprint = hash(block of transaction data + nonce);
When the right nonce is discovered, the block of transactions is considered mined and the new block along with the nonce that created it is published to the distributed network. Miners are rewarded for their effort and expense by receiving bitcoins.

What's On a Chain?

As we mentioned above, anything can be on a chain. The chain immutability and integrity machinery is agnostic to the payload. So what is a good use case for this kind of capability?

Smart Contracts

Again, the truth is implementations capturing the essence of smart contracts have existed for over two decades in the form of domain specific languages or embedding interpreted languages inside more compile-time oriented languages.
In the mid 1990s, we stored smallish perl programs in a database as a BLOB. These programs exploited the compactness and "quickness" of perl to perform if/then/else logic and array and hashmap manipulation without getting buried in the rigid and unterse syntax of C++. Every night, a C++ program linked with the perl interpreter would iteratively fetch these programs based on various criteria, determine the data needs, make market and other information available to it, let it run its perl logic (which could also make use of the parent C++ program's high performance functions and, indeed, the distributed computing environment), and then save results back to the database.
Sound familiar? It is also important to note that even today most smart contract implementations have some sort of a runtime context around them. In other words, the contract software as a unit of release just "sits there"; something has to run it and bring it to life. In the example above, the parent C++ program was execution engine that took care of this. Today, smart contracts require a similar engine that is live and sitting on top of the blockchain. The code for the smart contract is part of the data payload managed on the blockchain and enjoys the same benefits of immutability as regular "simple" data like fields of numbers and text.

Note that you actually don't need a blockchain to make a smart contract run, but the versioning, chain immutability, and signature integration capabilities of a blockchain stack tremendously improves the robustness and integrity of the actions autoexecuted by the smart contract.

To be fair, smart contracts have a little more work to do than our 1995 version:

This article has great additional information on oracles and some of the challenges facing "good" smart contract construction.

Consensus, Responsibility, Risk, and Incentivization

There is a lot of material already published on consensus and although it is generally accepted that several fundamentally different types of consensus models exist (e.g. practical byzantine fault tolerance (PBFT) and proof-of-work (PoW)), a more important set of considerations is responsibility and incentivization.

In the Bitcoin system, consensus is achieved through proof-of-work by a statistically important large number of participants performing extremely objective and clear (but time/cost expensive) operations for which there is specific incentivization. Perfect.
But the concept of consensus gets murkier when it is not a statistically based problem involving data much more complex than a bitcoin value. For example, consider a real estate processing blockchain involving 6 parties: the buyer, the seller, the broker, the buyer's bank, the housing inspector, and an escrow bank. There is no bitcoin-style consensus here. There are not 1000s of miners performing the same task in parallel, each trying to win the next block. Instead, there is only one of each type of participant, each with a different set of responsibilities and incentivizations. This gives rise to the following:

Moschetti's Conjencture

No participant will provide input to consensus or voting regarding authenticity and/or accuracy on data or process for which they are not:
  1. Incentivized (typically, but almost always through monetary compensation)
  2. Protected by legal precedence based on an in-economy set of risk mitigation procedures.
This does not defeat the usefulness of the blockchain, of course, but developers of solutions must be careful when using the term "consensus." Consensus is only appropriate when 2 or more participants work in parallel and a mathematical model is employed to determine if conditions are sufficient for workflow to move forward. It is worth noting that consensus does not have to be a PhD-complex algorithm or one that demands large numbers of participants -- and in fact, many consensus models in longer cycle business-transaction workflows look very much like standard workflow approval, e.g. if a simple majority of participants at stage n say all is well, proceed to stage n+1. Or even simpler (and very common): when all participants say all is well, move to stage n+1. As such, in most business workflows, it will be necessary to clearly define the fields for which a participant has "vouching/review" responsibility. This is the next step beyond basic read/write entitlements.

An exciting opportunity exists to hybridize single-actor workflow together with consensus via crowdsourced incentivized participation. Relatively simple but somewhat more subjective steps in a workflow could be tackled by dozens or more participants, making their responses (mean and standard deviation) more statistically relevant.

Like this? Dislike this? Let me know

Site copyright © 2014-2018 Buzz Moschetti. All rights reserved