Graph Feed and TigerGraph
This tutorial was originally Published Here.
- Sample data
- Design scheme
- Loading data
- TigerGraph Studio
- Source Code
We downloaded data from Thomson Reuters | PermID which is stored as RDF format (N-Triple or Turtle). Available entities are:
- Person (12GB): 82,920,763 triples
- Organization (5.1GB): 35,932,311 triples
- Quote (1.4GB): 10,385,788 triples
- Instrument (222MB): 1,702,020 triples
- Industry (981KB): 6,108 triples
- Asset Class (610KB): 4,356 triples
- Currency (472KB): 3,826 triples
Sample Organization triple lines:
- Read through n-triple files, collect all vertexes and its type based on Predicate name:
- Read through all n-triple files again to collect Vertex attributes and Edges.
- Generate GSQL schema file programmatically based on previous CSV schema file.
- Schema visualization on TigerGraph Studio
TigerGraph doesn't support import RDF file format, so we have to convert n-triples files to the column-based file format (e.g CSV).
- Loading Vertex2Type mapping file (from design schema step).
- Read line by line through n-triple files (not loading the whole file to memory because the file size is large), parsing triple line: extract subject, predicate, and subject:
- Subject: is a Vetex.
- Predicate: could be a Vertex Attribute or Edge.
- Subject: could be an Attribute Value or Edge TO (FROM Vertex TO Vertex).
- Write to CSV file line by line each time read a line from the n-triple file, output 4 columns per triple:
- Vertex type: based on Vertex2Type mapping at the first step, Vertex type equal to "Edge" if the Attribute name is an Edge name.
- Vertex id: extract from Subject of the triple.
- Attribute name: extract from Predicate of triple, attribute name could be also an Edge name.
- Attribute value: extract from Object of triple, attribute value could be also an Edge TO if Attribute name is an Edge name.
- Write GSQL loader file programmatically.
- Data statistic on TigerGraph Studio.
Convert to CSV format
- Time to parse and convert n-triple files to CSV file: 51 minutes, this could be reduced by parallel processing (e.g Spark).
- Person: 32 minutes.
- Organization: 14 minutes.
- Others: 5 minutes.
Loading data to TigerGraph
- Time to load CSV file (including all Vertexes and Edges): 12 minutes
- Person: 7.6 minutes.
- Organization: 3.2 minutes.
- Others: 1.2 minutes.
- TigerGraph could load 154,000 triples/second.