Knowledge Graph Feed API

API Family: Knowledge Graph

Graph Feed and Amazon Neptune

How to use Amazon Neptune as a graph-database alternative to explore Refinitiv Graph Feed.

Table Of Contents

  1. Export the content set from Graph Feed.
  2. Setup Neptune
  3. Create an S3 import bucket
  4. Create Endpoint
  5. Launch EC2 instance in one of the subnets inside the Neptune VPC
  6. SSH into the EC2 instance and initiate load command
  7. Validate

End of November 2017, Randall Hunt, an author and tech evangelist at Amazon, announced Amazon’s foray into graph database with Neptune. Graph databases are dime a dozen, but Neptune stands out as being a fully-managed service with the ability to store billions of relationships of Property Graph and RDF data.

Neptune in a Nutshell

In this tutorial on Neptune, we will explore how we can export data from Refinitiv Knowledge Graph Feed and import it into Neptune.  Refinitiv Knowledge Graph Feed is an API that delivers the Refinitiv Knowledge Graph. The raw size of Graph Feed export requires a graph analytics application to consume and query, and we’ll see how to use Neptune for this.

By the end of this tutorial, you'll export content-set from Refinitiv Graph Feed, spin up your own Neptune instance, load it onto Neptune and query the data.

1. Export the content set from Graph Feed.

Let's go to Data Fusion Graph Feed (DFGF) to download a content set that you would like to import into Neptune. You can do that in few ways (Swagger Docs, Command-line with CURL, or from a client to the API), but at the end, it all boils down to a REST call to the Graph Feed APIs.

Here we’ll be using the docs. First, use the client_id and client_secret to generate a short-lived token. If you do not have a client_id and secret, please reach out to Brian Rohan. Let’s go to the Graph Feed site.  

Use /contentSet to find a list of content-set that you’re entitled to.

Grab the ID of the content set you want to export and supply that to /contentSet/{id}/download API endpoint. The Response here is a short-lived URL to download the content set. Go ahead and start the download while we setup Neptune.

2. Setup Neptune

At this time of this writing, Neptune is only available in the us-east-1 region with a request for a preview, which you can do here. https://pages.awscloud.com/NeptunePreview.html. Once you have access to Neptune you can setup Neptune service in AWS (https://yukon.aws.amazon.com/rds/gdb?region=us-east-1). Notice that this is not the regular console URL for AWS. Once you login to the yukon.aws console, setting up Neptune is a quick two-step process. First, specify database details.

Second in the Configure Advanced Settings make a selection of your choices on VPC, backups, security, etc. I went with creating a new VPC and choose defaults for other choices.

Once you configure everything and hit submit AWS will take a few minutes to instantiate the cluster. If everything is successful you will see the VPC and all the subnets Neptune cluster created.

Now we will need to do few things in AWS so set up a pipeline to bulk import data into Neptune instance. Although in this tutorial we’re focused on importing data from Graph Feed, this setup is required for bulk upload from any data source into Neptune.

3. Create an S3 import bucket

Create a new S3 bucket. Make sure you pick US East region. Once you have the bucket setup upload the file you downloaded from GraphFeed.

4. Create Endpoint

Creating an endpoint simplifies access to S3 resources from within a VPC and provides a secure connection to S3 that does not require a gateway or NAT instances. You can access endpoint configuration under VPC services. Make sure you choose com.amazonaws.us-east.s3 for service and the Neptune VPC you created earlier.

5. Launch EC2 instance in one of the subnets inside the Neptune VPC

Create an EC2 instance in one of the subnets inside the Neptune VPC. The only thing to make sure here is that you pick the VPC that Neptune created.

6. SSH into the EC2 instance and initiate load command

Now it is just a matter of running a CURL with a POST command in the ec2 instance to load data into Neptune. Make sure to replace your own Neptune Endpoint, source, format, and credentials. If all goes well, you'll see an HTTP 200 response and a loadId in the payload.

7. Validate

Once you see the response with loadId, you can use that to check the status of the import with a GET call to Neptune endpoint. As you see here in the following screenshot, there were 7,225,094 total records in this content set.

 

You can now use SPARQL to query your data with an HTTP POST.

 

Now, we’ll take a deeper dive into the internals of Neptune, understand what can be changed, understand its limitations, and finally ssh into the ec2 instance to query against the data we’ve moved over.

Neptune Internal

Updates

Before loading all of the data into Neptune, I went ahead and modified the DB instance class from db.r4.4xlarge to db.r4.8xlarge which bumps up the virtual CPU and memory for all the instances in the Neptune cluster.

You can modify instance type on a live production data but bear in mind it will have a few minutes of performance hit while the instances are being updated.

Other than instance specification you can modify the starting base storage of 10GB all the way to 64TB. You can, however, leave it empty and let it auto scale as you add more data.

Internally Neptune cluster has 6 copies of data in the multiple Availability Zone (AZ) in the same region. There can be up to 15 read replicas. In case of a Primary node failure, it can automatically fail-over to a read replica making it the new primary node.

Here is a good cheat-sheet I found from Jerry Hargrove which details the internals and provide insights into Neptune’s inner layout.

ETL

As part of this tutorial, we created a new instance of ec2 (t2.xlarge) with an additional data-drive to extract and load data from graph-feed into Neptune. I was also able to make the ETL process a little more efficient than before. Instead of downloading the reference data to a local drive. I opted to download it on to the ec2 instance and use S3 cli to upload it to the s3 bucket before initiating upload into Neptune. This along with an S3 endpoint from Neptune VPC cut the time to ETL significantly.

Neptune Limitations

Lets quickly touch on a few limitations that I’ve come across.

Loading Data

Neptune graph-database can only be in a single mode. Either RDF or Property database. Depending on the type of file you load into Neptune first, it sets the database into that mode. Any attempt to load property data into an RDF mode Neptune will give you the following error message.



Another limitation with loading is that you can only load one data set at a time. Max concurrent load limit is 1 for the beta version. I have an email out to AWS folks to see if this will change on GA version.



Scaling

Neptune scales up vertically to a certain point instead of scaling out unlimitedly. The storage of the primary node and the six backup nodes in the cluster auto scales up to 64TB from a 10GB storage.

There is performance related advantage to not scaling out because if the data gets fragmented into different nodes there is a cost for network hops when querying this data. This is definitely an architectural decision with some pros and cons. This can qualify as a limitation if the total data size is bigger than 64TB.

Single Region

All the instances of Neptune cluster in the same Region albeit in different AZz. Again single-region could be an architectural decision with reading performance in mind. The side effect of this is that in the worst case Neptune will not recover from a region-wide failure and in a not-so worst case your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) will both increase because you’ll have to implement a multi-region Disaster Recovery (DR) strategy. More about DR strategy can be read here.

Single Modal

There could some limitations from Neptune being a single modal in terms of what it stores (RDF vs Property). This will force the application architecture to deal with multiple data sources. This is already a norm for most of the application stack anyway, so not a huge limitation, unless I’m missing something.

Soft Limitations

There are other soft limitations like you can only have 3  Neptune instances per account, and authentication during the beta phase is by limiting access to Neptune from an ec2 instance on the same subnet as the cluster instances.

There are other soft limitations, which you can read here. Also, take a look at Dan’s blog post on comparing Neptune with CM-well to get an idea of other limitations. When you are there check out the comment section to get a deeper understanding of some of these limitations.

SSH

Now comes the fun part. Lets SSH into the box. Please reach out to me if you would like to try querying Neptune on this box.

Once you are on the box we’ll look at a simple query based on this CM well tutorial and modify the following snippet to run it against Neptune.

As you can see you won’t be able to provide the path information, but you can use the same SPARQL query to get a similar result.

Another thing to note here is that the $NEPTUNE_URL in the above command echoes out to http://neptune.cluster-cubddbuiu6ju.us-east-1-beta.rds.amazonaws.com:8182. It's stored as an environment variable.

---

By Pratik Pandey