Knowledge Graph Feed API

API Family: Knowledge Graph

Graph Feed and TigerGraph

This tutorial was originally Published Here.

Sample data

We downloaded data from Thomson Reuters | PermID which is stored as RDF format (N-Triple or Turtle). Available entities are:

  • Person (12GB): 82,920,763 triples
  • Organization (5.1GB): 35,932,311 triples
  • Quote (1.4GB): 10,385,788 triples
  • Instrument (222MB): 1,702,020 triples
  • Industry (981KB): 6,108 triples
  • Asset Class (610KB): 4,356 triples
  • Currency (472KB): 3,826 triples

Sample Organization triple lines:

Design scheme

  • Read through n-triple files, collect all vertexes and its type based on Predicate name:

       "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>".

  • Read through all n-triple files again to collect Vertex attributes and Edges.

  • Generate GSQL schema file programmatically based on previous CSV schema file.

  • Schema visualization on TigerGraph Studio

Loading data

TigerGraph doesn't support import RDF file format, so we have to convert n-triples files to the column-based file format (e.g CSV).

  • Loading Vertex2Type mapping file (from design schema step).
  • Read line by line through n-triple files (not loading the whole file to memory because the file size is large), parsing triple line: extract subject, predicate, and subject:
    • Subject: is a Vetex.
    • Predicate: could be a Vertex Attribute or Edge.
    • Subject: could be an Attribute Value or Edge TO (FROM Vertex TO Vertex).
  • Write to CSV file line by line each time read a line from the n-triple file, output 4 columns per triple:
    • Vertex type: based on Vertex2Type mapping at the first step, Vertex type equal to "Edge" if the Attribute name is an Edge name.
    • Vertex id: extract from Subject of the triple.
    • Attribute name: extract from Predicate of triple, attribute name could be also an Edge name.
    • Attribute value: extract from Object of triple, attribute value could be also an Edge TO if Attribute name is an Edge name.

  • Write GSQL loader file programmatically.

  • Data statistic on TigerGraph Studio.

Evaluation

Convert to CSV format

  • Time to parse and convert n-triple files to CSV file: 51 minutes, this could be reduced by parallel processing (e.g Spark).
    • Person: 32 minutes.
    • Organization: 14 minutes.
    • Others: 5 minutes.

Loading data to TigerGraph

  • Time to load CSV file (including all Vertexes and Edges): 12 minutes
    • Person: 7.6 minutes.
    • Organization: 3.2 minutes.
    • Others: 1.2 minutes.
  • TigerGraph could load 154,000 triples/second.

TigerGraph Studio

http://18.233.30.9:14240/#/graph-explorer

Source Code

Python scripts to prepare data and GSQL files to work with TigerGraph