Recently I took the time to check out Cayley, a graph database written in Go that’s been getting some good attention.

Cayley Logo

https://github.com/google/cayley

From the Github:

Cayley is an open-source graph inspired by the graph database behind Freebase and Google’s Knowledge Graph.

Also to get the project owners disclaimer out of the way:

Not a Google project, but created and maintained by a Googler, with permission from and assignment to Google, under the Apache License, version 2.0.

As a personal disclaimer, I’m not a trained mathematician and my interest comes from a love of exploring data. Feel free to correct me if something should be better said.

I’ve seen Neo4j.. I know GraphDB’s

Many people exploring graph databases start with Neo4j and conceptually it’s similar but in usage terms there is a bit of a gap.

Neo4j has the Cyper query language which I find very expressive but also more like SQL in how it works. Cayley uses a Gremlin inspired query language wrapped in JavaScript. The more you use it the more it feels like writing code based interactions with chained method calls. The docs for this interface take some rereading and it was only through some experimentation that I started to see how it all worked. They can be accessed via the Github docs folder. I worked my way through the test cases for some further ideas.

Another major difference is that in Neo4j it’s a bit of a gentler transition from relational databases.  With Neo4j you can group properties on nodes and edges so that as you pull back nodes it feels a little more like hitting a row in a table. Cayley, however, is a triple / quad store based system so everything is treated as a node or vertex. You store only single pieces of related data (only strings in fact) and a collection of properties that would traditionally make up a row or object is built through relationships. This feels extreme at first as to get one row like object you need multiple traversals but over time for me it changed how I looked at data.

unnamed0_-_yEd

As an example (ignoring the major power of graph databases for starters) we might have the question “What is user 123’s height”. In Neo4j we can find a person with id 123, pulling back a node with that person’s name and height. We can then extract the height value. In Cayley you could find the persons id node and then move via the height relationship to the value 184. So in the first case we are plucking property data from a returned node. In the second we collect the information we want to return. This is more a conceptual difference than a pro or a con but it becomes a very clear difference when you start to import data via quad files.

What is an  n-quad?

As mentioned Cayley works on quads / triples which are a simple line of content describing a start, relationship and finish. This can be imagined as two nodes joined by an edge line. What those nodes and relationships are can be many things. Some people have schemas or conventions for how things are named. Some people are using URLs to link web based data. There is a standard that can be read about at www.w3.org:

http://www.w3.org/TR/n-quads/

A simple example might be from the above:

"/user/123" "named" "john" .
"/user/124" "named" "kelly" .
"/user/124" "follows" "/user/123" .

When is a database many databases?

One of the tricky parts of a graph database is how to store things. Many of the graph dbs out there don’t actually store the data but rather sit on an existing database infrastructure and work with information in memory. Cayley is no different as you can layer it upon a few different database types – LevelDB, Bolt, MongoDB and an in memory version.

An interesting part of this is the vague promise of scaling. Most graph databases start off the conversation with node traversal, performance, syntax but they almost all end in scaling. I think Cayley is now entering this territory. As it moves from a proof of concept to something that gets used more heavily, it’s acquiring backends that can scale and the concept of layering more than one Cayley instance in front of that storage layer.

One think to keep in mind is performance is a combination of how the information stored and accessed so put a fast graph db in front of a slow database and you’ll average out a little in speed. For my testing I used a built in leveldb store as it is built in and easy to get started with.

Show me the graph!

One of the first issues I had with Cayley was not 100% knowing how to get graph to page. Neo4j spin up was a little clearer and error handling is quite visual. Cayley you have to get syntax and capitalisation just right for things to play nicely.

Lets assume you have the following graph:

graphy

Node A is connected out to B,C and D. This can be described in a n-quads file as:

"a" "follows" "b" .
"a" "follows" "c" .
"a" "follows" "d" .

If we bring up the web view using a file with that content we can query:

g.V('a').As('source').Out('follows').As('target').All()

Running it as a query should give you some json:

{
  "result": [
    {
      "id": "b",
      "source": "a",
      "target": "b"
    },
    {
      "id": "c",
      "source": "a",
      "target": "c"
    },
    {
      "id": "d",
      "source": "a",
      "target": "d"
    }
  ]
}

Swap to the graph view, run it again and you should see a graph. Not all that pretty but it’s a start.

Cayley

So what’s happening here? Starting at ‘A’ and calling it “source” we traverse joins named “follows” that go out from A and take note of the end node calling it “target”. Be aware that the source / target is case sensitive and if you get it wrong you won’t see anything. When I say “calling” what I mean is that as the nodes are being traversed it will “emit” the value found with the name provided as the key. This is building up the JSON objects with each traversal as a new object in the returned list.

Doing more

So now we have the basics and that’s as far as a lot of the examples go. Taking things a little further.

I recently read an article 56 Experts reveal 3 beloved front-end development tools and in doing so I came across entry after entry of tools and experts. My first reflex was where are the intersections and which are the outliers.  So I decided to use this as a datasource. I pulled each entry into a spread sheet and then ran a little script over it to produce the quads file with:

"<person>" "website" "<url>" .
"<person>" "uses" "<tool name>" .

and for each first mention of a tool:

"<tool>" "website" "<url>" .

The results was a 272 line quads file with people, the software they used and the urls for the software.

From there I started Cayley with the usual command:

cayley http --dbpath=userreviews.nq

So what next? We can find a product and see who is using it:

g.Emit(g.V('sublime text').In('uses').ToArray())

Which results in:

{
 "result": [
  [
   "stevan Živadinovic",
   "bradley neuberg",
   "sindre sorus",
   "matthew lein",
   "jeff geerling",
   "nathan smith",
   "adham dannaway",
   "cody lindley",
   "josh emerson",
   "remy sharp",
   "daniel howells",
   "wes bos",
   "christian heilmann",
   "rey bango",
   "joe casabona",
   "jenna gengler",
   "ryan olson",
   "rachel nabors",
   "rembrand le compte"
  ]
 ]
}

Note I used the specific emit of the array values to avoid a lengthy hash output.

Sure that’s interesting but how about we build a recommendation engine?

Say you are a user that is a fan of SASS and Sublime Text. What are some other tools experts using these tools like?

// paths that lead to users of the tools
var a = g.V('sass').In('uses')
var b = g.V('sublime text').In('uses')

// Who uses both tools
var c = a.Intersect(b).ToArray()

// What tools are used by all of those people
var software = g.V.apply(this, c).Out('uses').ToArray()

// Convert an array to a hash with counts
var results = {}
_.each(software, function(s){
  if(results[s]==null){ results[s]=0; }
  results[s] +=1;
})

// Remove search terms
delete results['sass']
delete results['sublime text']

// Emit results
g.Emit({tools: results, users: c})

Here we are:

  1. finding the people that use sass and sublime text
  2. finding all the tools they use
  3. counting the number of times a tool appears
  4. removing our search tools
  5. emitting the results as the response

This gives us:

{
 "result": [
  {
   "tools": {
    "angularjs": 1,
    "chrome dev tools": 5,
    "jekyll": 1,
    "jquery": 1
   },
   "users": [
    "bradley neuberg",
    "nathan smith",
    "adham dannaway",
    "wes bos",
    "joe casabona",
    "jenna gengler",
    "ryan olson",
    "rachel nabors"
   ]
  }
 ]
}

Note how Cayley is pretty happy for us to move in and out of JavaScript and that underscore.js is available by default. Handy. Also I returned a custom result construction with both the results hash and the users it was derived from.

So this isn’t necessarily the most efficient way of doing things but it’s pretty easy to follow.

I think for many, the fact that Cayley uses a JavaScript based environment will make it quite accessible compared to the other platforms. I hope to keep exploring Cayley in future articles.