Open source Microsoft Graph Engine takes on Neo4j

The project allows for graph data to be kept in a distributed in-memory, key-value store and crunched at scale

Sometimes the relationships between the data you've gathered are more important than the data itself. (See: Facebook monetizing your list of friends.) That's when a graph processing system comes in handy. It's an important but often poorly understood method for exploring how items in a data set are interrelated.

Microsoft's been exploring this area since at least 2013, when it published a paper describing the Trinity project, a cloud-based, in-memory graph engine. The fruits of the effort, known as the Microsoft Graph Engine, are now available as an MIT-licensed open source project as an alternative to the likes of Neo4j or the Linux Foundation's recently announced JanusGraph.

Everything is connected

Microsoft calls Graph Engine (GE) as "both a RAM store and a computation engine." Data can be inserted into GE and retrieved at high speed since it's kept in-memory and only written back to disk as needed. It can work as a simple key-value store like Memcached, but Redis may be the better comparison, since GE stores data in strongly typed schemas (string, integer, and so on).

The "computation engine" part of the equation means GE implements distributed algorithms across nodes, written in C#. It's not optimized out of the box for a specific kind of graph algorithm, so it'll likely appeal to those who want to write their own graph-exploration algorithms from the ground up -- or simply write their own distributed algorithms.

"Instead of trying to provide an exhaustive set of built-in computation modules," states Microsoft's documentation, "GE tries to provide generic building blocks to allow us to easily build such modules." Those blocks include a system for synchronous and asynchronous message passing, as well as the LIKQ graph query language that's already used by the Academic Graph Search API in Microsoft Cognitive Services.

Different ways through the maze

How does all this shape up against the leading open source graph database, Neo4j? For one, Neo4j has been in the market longer and has an existing user base. It's also available in both an open source community edition and a commercial product, whereas GE is only an open source project right now.

That said, only the commercial, enterprise-oriented edition of Neo4j supports sharding and replication. GE, by contrast, is clustered in its default open source incarnation, although clustering on both Neo4j and GE requires manual setup. In GE's case, the roles for each node in the cluster (servers and, optionally, query-aggregating proxies) need to be configured manually depending on the use case.

Another distributed graph database worth comparing to GE is JanusGraph, a new project under the sponsorship of the Linux Foundation with contributions by Google, Hortonworks, and IBM. It's been built to work closely with and leverage the Hadoop ecosystem. Elasticsearch and Lucene can be used as indexing engines, and Cassandra and HBase can be used as data stores. With GE, data has to be imported into it first.

What Microsoft appears to be aiming for with GE isn't head-on competition with those projects. Instead, GE is a piece of distributed data-storage infrastructure that receives new data and provides graph computation as one of its multiple benefits. Its liberal licensing also makes it easily refittable into other products or readily repurposed for hosting at scale. It isn't clear if Microsoft has used GE as part of any of its own systems (although it has used LIKQ, as noted above).

If those building on non-Microsoft platforms are interested in trying out Graph Engine, cross-platform support for Linux/BSD is coming shortly, according to one of the developers.

Next read this:

Serdar Yegulalp is a senior writer at InfoWorld, focused on machine learning, containerization, devops, the Python ecosystem, and periodic reviews.