Theory of Blockchain

CIS1902 Python Programming

Agenda

  1. Bitcoin Motivation
  2. One-Way Functions
  3. Consensus
  4. Public-Private Key Encryption
  5. Validation
  6. Ethereum
  7. Data Engineering at Scale

Bitcoin

The motivation for Bitcoin came from the paranoia of centralized entities. That is, say one day your bank says you have $0. What can you do? Technically nothing, since you trust your bank to keep your money securely. If they wanted to act maliciously, they could do so.

In the real world, we have regulators and laws, so this isn't as easy, but you can imagine how this concept could apply to something like the Chinese Firewall.

Bitcoin aimed to solve this by decentralizing trust, specifically for a financial system.

Consensus

The idea is that Bitcoin builds consensus on a ledger of financial transactions. That is, simply an ordered list of transfers of Bitcoin. But how is consensus achieved?

Problem Statement: We need a network of computers to agree on some state . We assume that there is some public list of pending transactions submitted by users to be executed on resulting in . How do we ensure that (1) the network eventually sees and (2) the network knows is correct?

One-Way Functions

A one-way function is a function that is "easy" to compute, but is "difficult" to compute. Typically, this means that if one wants to find such that , one just has to try all inputs until they find it.

Some examples used in practice are multiplication/factoring and hash functions.

Consensus

Bitcoin uses Proof-of-Work (PoW) as its consensus mechanism. This is effectively a repeating lottery. Nodes compete to win the right to publish a block of transactions that will be executed. We'll come back to what this block is, but for now, we'll show that we can build consensus on this list of blocks.

Consensus

Each block needs to have a nonce and a pointer to its previous block. The nonce is the "lottery ticket". In Bitcoin, it is the input to a hash function that results in an output with some number of leading zeros, currently 19. Thus, nodes are constantly working to determine the nonce for the next block. Once one is found, it is broadcast to the network so everyone can update their state. This is known as mining.

The pointer to the previous block is the hash of the previous block. This way, it ensures that no one can hoard nonces.

Public-Private Key Encryption

Let's say there are two parties, Alice and Bob. Alice wants to send Bob a message , but wants to make sure that only Bob can read it, even if we assume the in-flight message is public.

How do we achieve this? There are symmetric key encryption schemes where both Alice and Bob have the same secret key (e.g. Caesar cipher), but they would have to meet up in person first.

However, it would be ideal if we could not require them to meet in person.

Public-Private Key Encryption

Public-private key encryption solves this by requiring the receiving party to generate a keypair. One part is to be published publically (public key), and one part is meant to always stay secret.

At a high level, the public key is used for encryption, and the secret key is used for decryption. Additionally, one can use their keys to sign messages.

Public-Private Key Encryption

A public-private key encryption scheme is defined by a pair of functions

where is the message, is the public key, is the ciphertext, and is the secret key. Notably, the following must always be true

Public-Private Key Encryption

Further, Alice can additionally sign messages. These act as an endorsement of sorts, anyone can verify that Alice signed the message.

Specifically, sign and verify are two functions that as defined as follows

Validation

So we know how we can build consensus on a list of blocks, but how do we know for sure that the transactions are valid? That is, if a transaction is Alice sending Bob 1 BTC, how do we know that it is indeed Alice sending it?

Accounts on Bitcoin are effectively keypairs generated for a public-private key scheme. When Alice wants to send Bob some amount of Bitcoin, she signs a message stating who she's sending it to and the amount.

Recap

  1. Miners compute to determine the next nonce
  2. When it is found, the miner broadcasts the next block with the new nonce, hash of the previous block, and transactions to be included in the new block
  3. The network validates the newly broadcasted block by checking the nonce and transactions within the block.

Ethereum

Ethereum follows the same overall structure, but instead of a ledger, the state is now the memory of a virtual machine. This means that Ethereum is essentially a shared computer. Think of it as a PC at the public library. Transactions can now either be transfers of ETH or executions of programs.

Data Engineering

In a system like OpenAI, essentially there are two pipelines: the training of the model and the serving of the model.

For the training, a separate pipeline will exist offline for researchers to test new iterations of the model. This could involve scraping new training data, formatting and parsing training data, or more techincal updates to the underlying algorithm.

Serving the model is a more scalability and distributed systems problem. The model itself is huge, on the order of a terrabyte. How do we answer millions of queries per day?

Data Engineering

The model is typically stored in memory on a huge machine. This allows for quick retrieval of the parameters. Essentially, each query is an iteration of matrix multiplications. Thus, serving the model can be thought of as a natural web scaling problem.

Data Engineering

At a company like Meta or Google, the problem is quite different. These companies generate tons of data everyday on their users. Their data scientists are not training a single large model, but instead need to be able to train many models to answer business questions. Eventually, they may need to serve these models as well.

Typically, these companies will aggregate copies of the data that they need for data science into a single location, sometimes called a datalake. This allows for fast iteration when developing smaller models to answer business questions. If they need to serve the models, they do so in a similar way to LLMs.