Successful Adoption of Graph Database

Ram Tyagi
6 min readApr 26, 2021

Before I deep dive into the article, I would like to mention on a high level, the world of databases is divided in two types:

  • Traditional Relational Databases which hold table-like structured data,
  • NoSQL or non-relational databases which is a category used to define anything that is non relational. In this category we have key-value databases, document databases and of course, graph databases.

Graph Databases in particular, is a collection of Vertices (A vertex represents data entity and it’s properties) and Edges (represents the relationships and properties of relationship, in other words, relationship is constructed between vertices through edges)

Journey Before

Recently, I led an initiative to assess the current state, and create a multiyear data & tech strategy roadmap for a client. We studied the current state of their data landscape & technology architecture and discovered that there is a massive opportunity to overhaul engineering tools and processes. We were getting ready for an interesting journey of DBT (Digital Business Transformation). Some of the patterns we discovered from our scan:

  1. Highly manual intervention in data preparation
  2. Multiple data sources and data load is mostly manual
  3. Complex data relationships maintained manually
  4. Highly manual process to handle fitment and roll over
  5. Nested and complex sql procedures and functions, absence of APIs
  6. Absence of master data and canonical data model
  7. Fragmented data storage
  8. Key person dependencies to perform specialized data transformation and acquisition work….and so on…

After a marathon of interviews and a series of collaborative Mural workshops, We decided to build a foundational data intelligence layer as part of MVP offering that can serve the data need in a smart and intelligent way.

In order to design the future generation scalable and sustainable data architecture we adopted azure as an platform for data integration work. We also wanted to bring intelligence and agility into the data, and at the same time build the dynamic complex relationships among data entities. We brought NLP, machine learning models to detect data loopholes and predict recommendations and fitments.

Use Cases

We had all pieces of architecture in place to handle raw data acquisition, transformation, enrichment, cleansing, intelligence and so forth. We also had a clear view on APIs and how our architecture would look like on Azure. We adopted Azure data lake for raw and refined data. However, we had an important piece missing on our puzzle board, where to store transformed data? We researched and experimented multiple options to address following use cases:

  1. Complex nested relationships and massive data hierarchy with multiple hops
  2. Need to tag products dynamically and as frequent as possible, enrich data with knowledge
  3. Agile data construction — so we don't commit on final data structure in the the beginning, rather we wanted to evolve data foundation as we progress in the project
  4. Speedy Resolution of the complex relationships and queries
  5. Handle dynamically and frequently changing Relationships

Imagine relationship formation like below:

Image from https://neo4j.com/blog/neo4j-rdf-graph-database-reasoning-engine/

Selection Day

Eventually, we decided to use Graph database as it met all our use cases, particularly use case 1 and 2 to handle complex nested relationships with multiple hops. I loved the concept of graph database where user has control to define the relationship on fly, change on need basis, and you can have relationship in any direction. In fact, relationship is driven by code at the time of data creation based on business logic. Organizations don't need to worry about key person dependencies for having database knowledge to write complex joins, queries, stored proc or functions.

We shortlisted a graph database based on our use cases. Cosmos Graph, which Azure’s fully managed database, was selected as our entire architecture was on azure so native fully managed database was a natural choice, in addition, azure provides multi-region replication, horizontal auto-scaling, automatic provisioning, and failover.

“As the graph database scale grows, the data will be automatically distributed using graph partitioning” — Microsoft

Couple of other features that caught my attention were its compatibility with Apache TinkerPop and automatic indexing where by default, Azure Cosmos DB automatically indexes all the properties within vertices and edges.

Adoption

It wasn't a smooth journey partly because of new technology. Not many people have experience in Graph, and the implementation requires totally new thinking, we need to get our RDBMS lens off. You need to think from a different angle. Good news was that our team was truly agile, so mindset was there to think in smaller scope and from a business point of view.

Some key takeaways from our adoption:

Edges and Vertices (image from http://www.mathcs.emory.edu)
  1. Team members should start thinking about data as small business entities
  2. Don’t try to design all edges and vertices in waterfall model. Design based on what you know at that point. Graph provides you flexibility to add vertex and attributes in your vertex as you get to know, so you don't need to think about all the attributes in the beginning. Evolve your thinking as you know. Same goes for edges — create them as you know.
  3. Practice Gremlin queries. Gremlin is a query languages that provide set of APIs to explore and query graph database.
  4. Become familiar with Cosmos DB Data Explorer to learn how to from traversal queries and become familiar with Paths to query a traversal
  5. Become aware about read and write throughputs as that is something you need to tweak to balance between your budget constraints and auto-scaling

Challenges

Every cutting-edge technology adoption comes with its own challenges and set of issues. Cosmos Graph is not different, while it provides several benefits, it also suffers with nuisances.

  1. Performance issues with immature queries where request units (RUs) are always running over threshold. It required a lot of practice and time to master Gremlin queries. Initially we had a lot of issues with our queries causing the performance issues. I found an interesting article here on the same issue. It takes time to get the maturity to develop optimal queries with least hops.
  2. Limited outside support availability due to its relatively low adoption and new concept
  3. Maintenance of edges can be troublesome if not designed properly
  4. Limited options for restoring the database
  5. Limited options and tools for CRUD operations (the way you will find numerous options for relational and no sql databases, I have seen very limited UI tools available to operate on Graph database)

Recommended Documentation

Microsoft Documentation

Quick start with GraphQL

Performance and Gremlin

Apache TinkerPop

Conclusion

While there are multiple options available, I feel that Graph database is the best to use when you have to store complex multi-hop data sets such as IoT data, Geospatial, Recommendation engines, and data related to Social networks/Customer 365. There are various options available to chose from such as Azure Cosmos, Neo4j, Amazon Neptune, ArangoDB etc.. Our selection was based on our use cases and our Azure based data intelligence architecture. We faced challenges in adoption due to limited support and knowledge pool, but with agile mindset and learning attitude, we were able to successfully leverage the best features of Graph Database to the success of our platform. Graph DB has a great potential to grow from here as long as the user community collaborate and contribute to the knowledge pool.

--

--