Article

PermID Entity Bulk Download with Apache Jena

Jirapongse Phuriphanvichai
Developer Advocate Developer Advocate

Note: The bulk files have been stopped publishing since August 8, 2021.

This article demonstrates a method to find an answer to this question.

“What are organizations in which a person with ‘Trump’ last name holds positions?”

 

 

Overview

In our daily life, we use a unique identifier for many things. For example, we have our own identification numbers used to identify ourselves and use our unique email addresses to send and received electronic mails or login to Facebook or Amazon. In addition, we also use unique identifications to distinguish other things, such as the International Standard Book Number (ISBN) for books, the International Securities Identification Number (ISIN) for securities, and the International Maritime Organization (IMO) number for ships

A unique identifier is an identifier that is guaranteed to be unique among all identifiers used for those objects and for a specific purpose[1]. It could be unique within the local or global scope. For example, an employee ID is a unique identifier for each employee within a company while the International Securities Identification Number (ISIN) is a code used to specify each security that is globally unique.

This article introduces the PermIDs from Refinitiv, which are open, permanent, and universal identifiers for data. It mainly focuses on how to use PermID entity bulk download with Apache Jena and query it to answer specific questions.

Open PermID

PermID is a shortening of “Permanent Identifier” which is a machine-readable number assigned to entities, securities, organizations (companies, government agencies, universities, etc), quotes, individuals, and more. It is specifically designed for use by machines to reference related information programmatically. Open PermID is publicly available for free at https://permid.org/.

Open PermID also supports several APIs.

  • Open PermID - Record Matching - RESTful API

    Make a request to the Record Matching API to obtain matches to Refinitiv Open PermID on a collection of Organizations, Instruments, Quotes or Persons

     
  • Open PermID - Entity Search · RESTful API

    Use the Entity Search API to lookup the PermID, or search the PermID by name, ticker, or RIC

     
  • Intelligent Tagging - RESTful API

    Intelligent Tagging is a sophisticated web service designed to let people in the financial domain extract insight from unstructured content. It also maps the metadata-tags in the tagging output to Refinitiv unique IDs

Moreover, Open PermID also provides bulk files (one per entity type) containing the complete lists of the entities via Entity Bulk Download.

Entity Bulk Download

Open PermID provides bulk files of the following entities:

  • Organization
  • Instrument
  • Quote
  • Asset Class
  • Currency
  • Instrument Code
  • Person

These files are updated on a weekly basis. The following represents relationships between Open PermID entities.

 

 

The bulk files are provided in both the Turtle (Terse RDF Triple Languages) and the N-Triples zipped formats.

For example, Turtle and N-Triples formats representing Refinitiv UK Financial Ltd  are:

Turtle format

    	
            

<https://permid.org/1-8589934184>

        a                               tr-org:Organization ;

        mdaas:HeadquartersAddress       "\n\n\n\n\n\nUnited Kingdom\n"^^xsd:string ;

        mdaas:RegisteredAddress         "Canary Wharf\nThomson Reuters Building\n30 South Colonnade\nLONDON\nE14 5EP\nUnited Kingdom\n"^^xsd:string ;

        tr-common:hasPermId             "8589934184"^^xsd:string ;

        tr-org:hasActivityStatus        tr-org:statusActive ;

        tr-org:hasLatestOrganizationFoundedDate

                "1986-04-21T00:00:00Z"^^xsd:dateTime ;

        tr-org:hasPrimaryBusinessSector

                <https://permid.org/1-4294952739> ;

        tr-org:hasPrimaryEconomicSector

                <https://permid.org/1-4294952740> ;

        tr-org:hasPrimaryIndustryGroup  <https://permid.org/1-4294952738> ;

        tr-org:isIncorporatedIn         <http://sws.geonames.org/2635167/> ;

        fibo-be-le-cb:isDomiciledIn     <http://sws.geonames.org/2635167/> ;

        vcard:organization-name         "Refinitiv UK Financial Ltd"^^xsd:string .

N-Triples format

    	
            

<https://permid.org/1-8589934184> <http://ont.thomsonreuters.com/mdaas/RegisteredAddress> "Canary Wharf\nThomson Reuters Building\n30 South Colonnade\nLONDON\nE14 5EP\nUnited Kingdom\n"^^<http://www.w3.org/2001/XMLSchema#string> .

<https://permid.org/1-8589934184> <http://permid.org/ontology/organization/hasPrimaryIndustryGroup> <https://permid.org/1-4294952738> .

<https://permid.org/1-8589934184> <http://permid.org/ontology/organization/hasActivityStatus> <http://permid.org/ontology/organization/statusActive> .

<https://permid.org/1-8589934184> <http://www.omg.org/spec/EDMC-FIBO/BE/LegalEntities/CorporateBodies/isDomiciledIn> <http://sws.geonames.org/2635167/> .

<https://permid.org/1-8589934184> <http://permid.org/ontology/common/hasPermId> "8589934184"^^<http://www.w3.org/2001/XMLSchema#string> .

<https://permid.org/1-8589934184> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://permid.org/ontology/organization/Organization> .

<https://permid.org/1-8589934184> <http://permid.org/ontology/organization/hasLatestOrganizationFoundedDate> "1986-04-21T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

<https://permid.org/1-8589934184> <http://www.w3.org/2006/vcard/ns#organization-name> "Refinitiv UK Financial Ltd"^^<http://www.w3.org/2001/XMLSchema#string> .

From the above information, the PermID for Refinitiv UK Financial Ltd is 8589934184.

Both Turtle and N-Triples are file formattings for storing and expressing data in the Resource Description Framework (RDF) data model which is a general method for describing data by defining relationships between data objects and used or building Knowledge Graphs.

RDF uses resources with properties and property values to identify things. For instance, the values of resource, property, and property value of the above examples are:

  • A resource is https://permid.org/1-8589934184
  • A property is vcard:organization-name        
  • A property’s value is "Refinitiv UK Financial Ltd"^^xsd:string

It represents the vcard:organization-name of https://permid.org/1-8589934184 is a string of Refinitiv UK Financial Ltd. ‘^^’ specified a type of the property value.

The next section introduces Apache Jena which is a Java framework that supports RDF and shows how to install it on a Windows machine.

Apache Jena

Apache Jena is a free and open-source Java web framework that provides several APIs and components to process RDF data. The following picture shows the framework architecture of Apache Jena.

 

 

The core APIs of Apache Jena are RDF API used to process RDF data and SPARQL API used to query RDF data.  SPARQL is an RDF query language able to retrieve and manipulate data stored in RDF format. It looks similar to the SQL language in the relational database. In addition, this article also uses TDB for RDF storage and Fuseki as a web interface to query RDF data.

To install Apache Jena and Fuseki on Windows, please follow these steps.

  1. Download Apache Jena and Apache Jena Fuseki from https://jena.apache.org/download/. This article uses apache-jena-3.14.0 and apache-jena-fuseki-3.14.0
  2. Download Java from https://www.java.com/en/download.  Jena requires Java8
  3. Install Java8 on the machine and verify its version
    	
            

C:\>java -version

java version "1.8.0_201"

Java(TM) SE Runtime Environment (build 1.8.0_201-b09)

Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

4.    Decompress apache-jena-3.14.0 and apache-jena-fuseki-3.14.0 into a directory (C:\workspace)

5.    Set the JENA_HOME environment variable to the directory of Apache Jena (C:\workspace\apache-jena-3.14.0)

    	
            
set JENA_HOME=C:\workspace\apache-jena-3.14.0

6.    Run bat\sparql.bat --version to verify the version    

    	
            

C:\workspace\apache-jena-3.14.0>bat\sparql.bat --version

Jena:       VERSION: 3.14.0

Jena:       BUILD_DATE: 2020-01-16T14:55:05+0000

Now, Apache Jena is ready to be used. Next, I will show how to load bulk Open PermID files into Apache Jena TDB.

Loading bulk Open PermID files into Apache Jena TDB

Bulk Open PermID files are available at the Open PermID website (https://permid.org/). If you don’t have an account, you can register to create an account. After that, you can log in with the created account, then select “Download Entity Data” from the menu to see all available files. This article uses organization and person files. You can select ttl or ntriples files.

 

 

  • Loading TTL files
  1. Download and uncompress the organization and person bulk Open PermID TTL files
  2. Verify that the OpenPermID-bulk-organization-xxx.ttl file contains the following prefix in the beginning
    	
            
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

You can use type command to verify it.

    	
            

C:\workspace\apache-jena-3.14.0>type OpenPermID-bulk-organization-xxx.ttl | head

 

@prefix tr-common: <http://permid.org/ontology/common/> .

 

@prefix fibo-be-le-cb: <http://www.omg.org/spec/EDMC-FIBO/BE/LegalEntities/CorporateBodies/> .

 

@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

 

@prefix tr-org: <http://permid.org/ontology/organization/> .

 

@prefix mdaas: <http://ont.thomsonreuters.com/mdaas/> .

 

@prefix tr-fin: <http://permid.org/ontology/financial/> .

 

 

 

<https://permid.org/1-5037622197>

 

atr-org:Organization ;

 

mdaas:HeadquartersAddress"Germany\n"^^xsd:string ;

The above output indicates that the OpenPermID-bulk-organization-xxx.ttl file doesn’t have the xsd prefix. Therefore, you need to add it at the beginning of the file. You can create a new file, such as Organization.ttl with the above prefix. Please make sure that you add a new line at the end. Then, run the following command to append the OpenPermID-bulk-organization-xxx.ttl to the Organization.ttl file

    	
            
cat OpenPermID-bulk-organization-xxx.ttl >> Organization.ttl

3.    Verify that the OpenPermID-bulk-person-xxx.ttl file contains the following prefixes in the beginning

    	
            

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

You can use type command to verify it.

    	
            

C:\workspace\apache-jena-3.14.0>type OpenPermID-bulk-person-xxx.ttl | head

 

@prefix tr-vcard: <http://permid.org/ontology/tr-vcard/#> .

 

@prefix tr-common: <http://permid.org/ontology/common/> .

 

@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .

 

@prefix tr-person: <http://permid.org/ontology/person/> .

 

 

 

<https://permid.org/1-10010140>

 

atr-person:DirectorRole ;

 

tr-common:hasPermId"10010140"^^xsd:string ;

 

tr-person:rank"14"^^xsd:string ;

 

skos:prefLabel"Non-Executive Vice Chairman"^^xsd:string .

The above output indicates that the OpenPermID-bulk-person-xxx.ttl file doesn’t have the xsd and skos prefixes. Thus, you need to add these prefixes at the beginning of the file. You can create a new file, such as Person.ttl with the above prefixes. Please make sure that you add a new line at the end. Then, run the following command to append the OpenPermID-bulk-person-xxx.ttl to the Person.ttl file.

    	
            
cat OpenPermID-bulk-person-xxx.ttl >> Person.ttl

4.    Next, run the following command in the Apache Jena directory to load those files to Apache Jena TDB. The location of the database is at c:\workspace\database

    	
            
C:\workspace\apache-jena-3.14.0>bat\tdb2_tdbloader.bat --loc c:\workspace\database Organization.ttl Person.ttl

It will take some time (an hour) to populate the database. If you see errors regarding prefixes: xsd, or skos, please make sure that those prefixes at the beginning of the TTL files

5.    You can use the tdb2_tdbdump.bat command to dump the database

    	
            
C:\workspace\apache-jena-3.14.0>bat\tdb2_tdbdump.bat --loc c:\workspace\database

Press Ctrl+c to terminate the job

 

  • Loading NTripples files
  1. Download and uncompress the organization and person bulk Open PermID ntriples files
  2. Change the file extension from ntriples to nt
  3. Next, run the following command in the Apache Jena directory to load those files to Apache Jena TDB. The location of the database is at c:\workspace\database
    	
            
C:\workspace\apache-jena-3.14.0>bat\tdb2_tdbloader.bat --loc c:\workspace\database OpenPermID-bulk-organization-xxx.nt OpenPermID-bulk-person-xxx.nt

4.    You can use the tdb2_tdbdump.bat command to dump the database

    	
            
C:\workspace\apache-jena-3.14.0>bat\tdb2_tdbdump.bat --loc c:\workspace\database

Press Ctrl+c to terminate the job

At this point, the Apache Jena TDB store is ready to be used. Next, I will send queries to the database in order to answer specific questions.

Querying the database

In this section, SPARQL is used to query the RDF store. SPARQL is a set of specifications that provide languages and protocols to query and manipulate an RDF store, such as Apache Jena TDB. The data in the RDF store contains collections of a triple (three parts statement). A triple has resource (subject), property (predicate), and property value (object). For example, the result of the tdb2_tdbdump.bat command lists the data in a triple format.

    	
            
<https://permid.org/1-8589934184> <http://www.w3.org/2006/vcard/ns#organization-name> "Refinitiv UK Financial Ltd" .

 

 

 

Typically, the SPARQL consists of two parts:

  • the SELECT clause identifies the variables (prefixed with ‘?’) to appear in the query results
  • the WHERE clause provides the triple pattern to match against the RDF data
    	
            

SELECT ?subject ?predicate ?object

WHERE {

  ?subject ?predicate ?object

}

For example, to query the subject that has the "Refinitiv UK Financial Ltd" in its <http://www.w3.org/2006/vcard/ns#organization-name> property. The SPARQL looks like:

    	
            

SELECT ?subject

WHERE {

  ?subject <http://www.w3.org/2006/vcard/ns#organization-name> "Refinitiv UK Financial Ltd" 

}

The result will be:

    	
            

-------------------------------------

| subject                           |

=====================================

| <https://permid.org/1-8589934184> |

-------------------------------------

The SPARQL can be sent to the Apache Jena TDB via the command line (bat\tdb2_tdbquery.bat) or the web user interface (Apache Jena Fuseki).

  • bat\tdb2_tdbquery.bat

Save the query into the txt file (query.txt) and then run the following command:

    	
            
C:\workspace\apache-jena-3.14.0>bat\tdb2_tdbquery.bat --loc c:\workspace\database --file query.txt
  • Apache Jena Fuseki

Change the directory to the Apache Jena Fuseki and then run the following command:

    	
            
C:\workspace\apache-jena-fuseki-3.14.0>fuseki-server.bat --loc=c:\workspace\database –port 3030 -tdb2 /ds

This command will start the webserver listening on TCP port number 3030. Then, visit http://127.0.0.1:3030/ via the web browser, and select the query action in the /ds dataset. After that, enter the query and then run.

 

 

Asking Questions

To ask questions from the RDF store loaded with the bulk person and organization files, you need to understand relationships among entities. You can refer to ontology files available at http://permid.org/ontology/organization/ and http://permid.org/ontology/person/. The WebVOWL tool (http://vowl.visualdataweb.org/webvowl.html) can also be used to verify the ontology files.

 

 

The followings are sample questions that the Open PermID can answer.

  1. What are organizations in which a person with “Trump” last name holds positions?

    To answer this question, the query uses the vcard:family-name property of the person to find people with “Trump” last name. Then, the tr-person:holdsPosition and tr-person:isPositionIn properties of that person are used to find the organization in which that person has positions. The query is: 
    	
            

prefix tr-common: <http://permid.org/ontology/common/> 

prefix vcard: <http://www.w3.org/2006/vcard/ns#> 

prefix tr-person: <http://permid.org/ontology/person/> 

 

SELECT DISTINCT ?PermID ?FirstName ?MiddleName ?LastName ?Position ?Orgainization

WHERE

?PermID vcard:family-name "Trump" .

?PermID vcard:given-name ?FirstName .

?PermID vcard:family-name ?LastName .

?PermID tr-common:hasPublicationStatus  tr-common:publicationstatuspublished .

OPTIONAL {?PermID vcard:additional-name ?MiddleName .}

?PermID tr-person:holdsPosition ?h .

?h tr-person:hasReportedTitle ?Position .

?h tr-person:isPositionIn ?c .

?c vcard:organization-name ?Orgainization .

}

The results display the matched person PermIDs, first names, middle names, last names, positions, and organizations:

 

 

2.    What are organizations in China in which a person with the MBA degree from Stanford University holds positions?

To answer this question, the query uses the tr-person:hasQualification and tr-person:withDegree  properties to find people with the Master of Business Administration degree from Stanford University. Then, it selects people who have positions in the organizations located in China (http://sws.geonames.org/1814991/). The query is: 

    	
            

prefix skos: <http://www.w3.org/2004/02/skos/core#>

prefix tr-common: <http://permid.org/ontology/common/> 

prefix vcard: <http://www.w3.org/2006/vcard/ns#> 

prefix tr-person: <http://permid.org/ontology/person/> 

prefix tr-org: <http://permid.org/ontology/organization/> 

prefix mdaas: <http://ont.thomsonreuters.com/mdaas/> 

 

SELECT DISTINCT  ?person ?FirstName ?Position ?Orgainization ?addr

WHERE

?p tr-person:fromInstitutionName "Stanford University" .

  ?d skos:prefLabel "Master of Business Administration" .

?p tr-person:withDegree ?d .    

  ?person  tr-person:hasQualification ?p .

?person vcard:given-name ?FirstName .

?person tr-person:holdsPosition ?h .

?person tr-common:hasPublicationStatus  tr-common:publicationstatuspublished .

?h tr-person:hasReportedTitle ?Position .

?h tr-person:isPositionIn ?c .

?c vcard:organization-name ?Orgainization .

  ?c tr-org:isIncorporatedIn <http://sws.geonames.org/1814991/> .

OPTIONAL {?c mdaas:RegisteredAddress ?addr .}

}

The results display the matched person PermIDs, first names, positions, organizations, and addresses:

 

 

Summary

Open PermID is a unique permanent identifier assigned to entities, such as securities, organizations, quotes, individuals, and more. It is freely available at https://permid.org/. Open PermID also provides APIs to lookup the PermID by name, ticker, or RIC. It also provides bulk files for the entities in both TTL and ntriples formats. One way to use these files is by loading them to the RDF store, such as Apache Jena TDB. After that, SPARQL can be used to query the RDF store. The query can be sent through the command line or web user interface via Apache Jena Fuseki.

References

  1. “Unique identifier”, In Wikipedia. Retrieved Feb 7, 2020, from https://en.wikipedia.org/wiki/Unique_identifier
  2. “Open PermID”, Refinitiv Developer Community, https://developers.refinitiv.com/open-permid
  3. “What is RDF?”, ontotext, https://www.ontotext.com/knowledgehub/fundamentals/what-is-rdf/
  4. “Turtle (syntax)”, In Wikipedia. Retrieved Feb 7, 2020, from https://en.wikipedia.org/wiki/Turtle_(syntax)
  5. “XML RDF”, w3schools.com, https://www.w3schools.com/xml/xml_rdf.asp
  6. “Apache Jena”, https://jena.apache.org/index.html
  7. “Apache Jena”, In Wikipedia. Retrieved Feb 7, 2020, from https://en.wikipedia.org/wiki/Apache_Jena
  8. “SPARQL”, In Wikipedia, Retrieved Feb 7, 2020, from https://en.wikipedia.org/wiki/SPARQL
  9. “SPARQL 1.1 Overview”,W3C, March, 21 2013, https://www.w3.org/TR/2013/REC-sparql11-overview-20130321/
  10. Bob DuCharme. “SPARQL in 11 minutes”,YouTube, May 3, 2015, https://www.youtube.com/watch?v=FvGndkpa4K0
  11. “Visual Data Web”, http://visualdataweb.de