Blog Archives

Blockchain Analytics with Cayley DB

Posted in Code, Data Analytics, Inside TFG

Bitcoin (and consequently, the Blockchain) have been making waves in the media over the past few years.

In this blog post I will be covering the process of building relationships between blocks, transactions and addresses using Google’s Cayley DB. With this information I may be able to pinpoint important transfers of value and also build an ownership database on top of it to track high value individuals.

I’m going to call this project Bayley, an amalgamation of “Bitcoin” and “Cayley”. I never was that creative.

The Blockchain

So what are the advantages of the Blockchain? What’s so special about it? For starters, for the first time in history we have a database that:

  • Can’t easily be rewritten
  • Eliminates the need for trust
  • Resists censorship
  • Is widely distributed.

A lot of people like to call it a distributed ledger. This is true if you use Bitcoin as a currency. The Blockchain however, has the capability for much more. As a new and possibly disruptive technology I figured it would be a good idea to learn more about it. In the process we also might glean enough of its processes for building unique services on top of the Blockchain.

The Database

I originally tried to work with this project using MongoDB. I ended up shelving the idea as MongoDB is not suitable for this task. The schema is consistent across blocks and I need to be able to easily find relationships between datapoints.

I had a look at LevelGraph and Neo4j but in the end decided to go with Cayley. Cayley has been explored previously by The Frontier Group, and is a very new technology and I wanted to learn how to use it.

Setup Considerations

The first step will be to synchronise a copy of the blockchain locally for your use. I used the testnet instead of mainnet for testing purposes.

Originally I used BTCD as I wanted a server-side, headless daemon. Bitcoin Core can do this, but not in OSX. I constantly ran into bugs and inconsistencies such as:

  • RPC setup using a different variable names making existing libraries that hook into Bitcoin Core useless
  • JSON batch requests not supported

In the end I just opted to run an instance (with GUI and all) of Bitcoin Core on my machine. Get it here!

Before starting to synchronise the Blockchain it might be useful to note that transaction data is not saved in the local blockchain to conserve disk space. Transaction indexing can be turned on with the commandline switch -txindex or adding the line txindex=1 to your bitcoin.conf.

RPC needs to be enabled. Using RPC calls to the Bitcoin daemon will allow you to pull out the block data.

Spinup instructions

Overview of Process

From a high level, the process will look like this:

  • Get block hashes from height
    • Get blocks from block hashes
  • Send an HTTP POST request to Cayley DB of the above data

This does not take into account transaction data either. That will be a topic for a future blog post. So lets get started!

Setting up Bitcoin Core

The Bitcoin Core standard conf file has a lot of stuff in there, but in general you’ll need to make sure the following lines are as follows:

txindex=1
testnet=1
server=1
rpcuser=bitcoinrpc
rpcpassword=BHaVEDoMkVr1xKudcLpVbGi2ctNJsseYrsuDufZxwEXb
rpcport=8332

The rpcpassword is autogenerated by Bitcoin Core. You can use an environment variable if you’re concerned about security and such. Since this project is just for testing purposes and the password is randomised, I’m not too bothered that its sitting there in plaintext.

Block Extraction

We’ll be using Peter Todd’s python-bitcoinlib library. The pprint library is also used printing to console for quick and dirty debugging purposes. Install these using Pycharm, then add to the top of your bayley.py file:

import bitcoin
import bitcoin.rpc
from pprint import pprint as pp

The next step will be to write some simple code to extract some blocks.

def main():
    # This will create a batch of commands that requests the first 100 blocks of the blockchain
    commands = [ {"method": "getblockhash", "params": [height]} for height in range(0, 100 ]
    # Connect to the RPC server, send the commands and assign to the results variable
    conn = bitcoin.rpc.RawProxy()
    results = conn._batch(commands)</p>
    # Extract the hashes out of the result
    block_hashes = [res['result'] for res in results]</p>
    # Prepare to extract specific block data
    blocks = []
    for hash in block_hashes:
        blocks.append(conn.getblock(hash))</p>
    # Call the function to make the triples to prepare for importing to CayleyDB
    block_triples = make_triples_for_block(blocks)

Block Structure

Here is an example of a single block’s data:

{'bits': '1d00ffff',
 'chainwork': '0000000000000000000000000000000000000000000000041720ccb238ec2d24',
 'confirmations': 1,
 'difficulty': Decimal('1.00000000'),
 'hash': '0000000084ee00066214772c973896dcb65946d390f64e5d14a1d38dfa2e4d90',
 'height': 445610,
 'merkleroot': 'eaf042fa845ea92aba661632bc6b8e78e8e64c2917a92f1a7da0800ed793b819',
 'nonce': 1413010373,
 'previousblockhash': '0000000087a272f48c3785de422e232c0771e2120c8fdd741a19ea98d122132b',
 'size': 315,
 'time': 1432705094,
 'tx': ['eaf042fa845ea92aba661632bc6b8e78e8e64c2917a92f1a7da0800ed793b819'],
 'version': 3}

With this in mind we can begin working on pulling the data from the blockchain and parsing the specific blocks.

Making Triples

Cayley uses the subject, predicate, object system, known as a triplestore. We need to parse the block data from the previous section into this triplestore format.

One of the limitations of the triplestore is that you can not add much metadata to each node. Array indexing and similar are a problem in this regard. In this case we will use the blockhash as the subject for all block data, the key value for all predicates, and the block data (excluding the block hash) as the object variable.

Lets create a function that does this:
At the top of my bayley.py file I will create a global variable which specifies which key value pairs for which I want to create a triplestore.

DESIRED_BLOCK_KEYS = ("height", "nextblockhash", "previousblockhash", "size", "time", "difficulty")

Next I wish to declare the function:

def make_triples_for_block(blocks):
    triples = []

We will next need to iterate through the blocks and their respective keys to start pulling the relevant data. The first thing to do is to ignore the blockhash key:

def make_triples_for_block(blocks):
    triples = []
    for block in blocks:
        for key in block:
            # Ignore self reference
            if (key == "hash"):
                continue

The transactions value is an array so its best to iterate through these separately.

def make_triples_for_block(blocks):
    triples = []
    for block in blocks:
        for key in block:
            # Ignore self reference
            if (key == "hash"):
                continue
            # Iterate through transactions
            if (key == "tx"):
                for t in block[key]:
                    triples.append({
                        "subject": block['hash'],
                        "predicate": key,
                        "object": t
                    })

And finally we can now append our block data to the triples array we declared in the beginning. Note how I casted the values to strings, this was to prevent an issue later on when you want to import into CayleyDB. Cayley is happiest when you give her JSON files that are all strings.

def make_triples_for_block(blocks):
    triples = []
    for block in blocks:
        for key in block:
            # Ignore self reference
            if (key == "hash"):
                continue
            # Iterate through transactions
            if (key == "tx"):
                for t in block[key]:
                    triples.append({
                        "subject": block['hash'],
                        "predicate": key,
                        "object": t
                    })
            # Iterate through first level block data
            if (key in DESIRED_BLOCK_KEYS):
                triples.append({
                    "subject": str(block['hash']),
                    "predicate": key,
                    "object": str(block[key])
                })
    return triples

So now we have a triples variable returned which contains all of our triples ready for importing!

Here is an example of the triples for a single block for your reference:

[{'object': '1',
  'predicate': 'height',
  'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'},
 {'object': '190',
  'predicate': 'size',
  'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'},
 {'object': 'f0315ffc38709d70ad5647e22048358dd3745f3ce3874223c80a7c92fab0c8ba',
  'predicate': 'tx',
  'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'},
 {'object': '000000006c02c8ea6e4ff69651f7fcde348fb9d557a06e6957b65552002a7820',
  'predicate': 'nextblockhash',
  'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'},
 {'object': '1.00000000',
  'predicate': 'difficulty',
  'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'},
 {'object': '000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943',
  'predicate': 'previousblockhash',
  'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'},
 {'object': '1296688928',
  'predicate': 'time',
  'subject': '00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206'}]

Setting up Cayley

Cayley uses golang. A packaged binary is available (so you shouldn’t need to setup golang separately) from here.

This is my Cayley config file:

{
"database": "bolt",
"db_path": "./blockchain",
"read_only": false,
"replication_options": {
  "ignore_duplicate": true,
  "ignore_missing": true
}
}

I’m using bolt db over leveldb because bolt is slightly better for high reads. You can read more here.

After making the cayley.cfg file, initialise the database by running the init command like so (from the Cayley folder):

./cayley init -config cayley.cfg

This will create a blockchain file and prep the backend database for Cayley goodness. The next step will be to run the HTTP server:

./cayley http -config cayley.cfg

Now we’re ready to send all the data in!

Sending to Cayley

Cayley’s HTTP documentation will help with this section. It receives JSON triples in the form of the following:

[{
    "subject": "Subject Node",
    "predicate": "Predicate Node",
    "object": "Object node",
    "label": "Label node"  // Optional
}]   // More than one quad allowed.

We’ll need to POST this data to our Cayley server’s write API via http://localhost:64210/api/v1/write.

Now we need to make use of the excellent requests python library. Install it in Pycharm then add the following to the top of the bayley.py file. Cayley is expecting a json file so we’ll also need to install and import that.

You’ll also want to put in a global variable there for Cayley’s URL and also tell Cayley that we’re sending a JSON file.

import requests
import json
DB_WRITE_URL = "http://127.0.0.1:64210/api/v1/write"
DB_WRITE_HEADERS = {'Content-type': 'application/json'}

We’re going to create a function to send the data over to Cayley. Note how the data is converted to json in the data= argument.

def send_data(data):
    r = requests.post(DB_WRITE_URL, data=json.dumps(data), headers=DB_WRITE_HEADERS)
    pp(r)
    pp(r.text)

If the pp(r) prints out a response of 200 then we’re good! If not then we’ll need to look at what went wrong which is usually explained well in the r.text variable. This is the result I got:

<Response [200]>
'{"result": "Successfully wrote 693 quads."}'

Go back to your main function and call the send_data function:

def main():
    ...
    send_data(block_triples)
    ...

And that should do it.

Graphing the result

By now we should have 100 blocks in Cayley! Head over to http://localhost:64210 and lets start graphing!

In the query page we can test out our queries. I wrote a simple one that loops through the first 5 blocks, gets all objects that are one edge away (Out()) and gets the result:

for(var i=0; i<5; i++){
    g.V().Has("height", String(i)).Tag("source").Out().Tag("target").All();
}

Here is the result of the first block:

{
 "result": [
  {
   "id": "4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b",
   "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943",
   "target": "4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b"
  },
  {
   "id": "1296688602",
   "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943",
   "target": "1296688602"
  },
  {
   "id": "0",
   "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943",
   "target": "0"
  },
  {
   "id": "285",
   "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943",
   "target": "285"
  },
  {
   "id": "1.00000000",
   "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943",
   "target": "1.00000000"
  },
  {
   "id": "00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206",
   "source": "000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f4943",
   "target": "00000000b873e79784647a6c82962c70d228557d24a747ea4d1b8bbe878e1206"
  }
 ]
}

The query shape looks like this:

queryshape

The visualisation itself looks like these following images.

Single block:
oneblock

Five blocks:
fiveblocks

As you can see there are shared nodes, this is because the nodes have the same predicate and objects, but different subject (blockhash). This is a good example of how cayley helps in visualising relationships.

One hundred blocks:

hundredblocks

The shared nodes here are due to the common block size and block difficulty (the latter changes every 2 weeks). You can see a close up below:

zoom

Conclusion

This is just the early stage. The next step will be to parse the transactions for Bitcoin addresses and start drawing all the relationships between them. Once a strong system is in place for parsing the block chain, you might want to begin parsing the other 400,000 blocks or so, and also switch to the mainnet. Web scraping usernames for addresses and also estimating relationships based on round number transferring of value also is in the pipeline.

Search Posts

Featured Posts

Categories

Archives

View more archives