Article

Exploring News Metadata - with Refinitiv Data Platform and Python

August 12,2020

Author:

Zoya Farberov

Introduction

Refinitiv Data Platform provides simple web-based API access to a broad range of content. Refinitiv Data Platform (RDP) News service encompasses diverse news content sets:

News Alerts
Headline Search
Story Retrieval
Metadata Retrieval

News metadata describes the content of a story and is created and packaged with the story. It is most often retrieved and presented with the story. The hierarchy of news metadata is available for separate retrieval. Metadata enables accurate and efficient search and discovery of the relevant news content.

Next, we are going to discuss RDP News metadata retrieval capabilities and use examples of code in Python.

Prerequisites

Having a valid machineId, corresponding long password and client_id
Permissioned for RDP News service
Python 3 installed
Jupiter Notebook or Jupyter Lab installed (or one can run the same code via plain Python)
Python libraries installed

Simple And Effective

A simple but very powerful use case of RDP News metadata is for lookup a news code or a news topic.

For example, we can send a request to lookup code B:227

    	
            https://api.refinitiv.com/data/news/v1/metadata/B:227

And examine the result

    	
            "newsCode": {
    "id": "B:227",
    "description": "Manufacturers, extractors and refiners of chemicals, minerals, precious metals, steel, aluminum, forest products, and construction and other raw materials.",
    "label": "Basic Materials (TRBC level 1)",
    "group": "BusinessSectors",
    "readable": "Topic:BMAT",
    "searchable": true,
    "childrenCount": 105
  }

}

Another is to find the codes that are related to a broader code, a.k.a. "children". For example, we request the children of code B:227

    	
            https://api.refinitiv.com/data/news/v1/metadata/B:227/children

And the result contains the children

    	
            …
"data": [
    {
      "id": "B:27",
      "description": "Producers of wood-based products and providers of related services, as well as manufacturers of paper and non-paper products, containers and packaging.",
      "label": "Applied Resources (TRBC level 2)",
      "group": "BusinessSectors",
      "readable": "Topic:APRE",
      "searchable": true,
      "childrenCount": 27
    },
    {
      "id": "B:13",
      "description": "Producers and refiners of agricultural, commodity and specialty chemicals.",
      "label": "Chemicals (TRBC level 2)",
      "group": "BusinessSectors",
      "readable": "B:13",
      "searchable": false,
      "childrenCount": 25
    },
    {
      "id": "B:228",
      "description": "Miners and processors of steel, aluminum, precious and specialty metals and minerals, and construction related materials. Includes integrated mining companies.",
      "label": "Mineral Resources (TRBC level 2)",
      "group": "BusinessSectors",
      "readable": "Topic:MINE",
      "searchable": true,
      "childrenCount": 51
    }
  ]
...

Lastly, we can just request a list of current metadata available, in portions, such as

    	
            https://api.refinitiv.com/data/news/v1/metadata/B:227/children

The result will culminate in link next, for example:

    	
            "next": "eyJsaW1pdCI6NTAsImZvcndhcmQiOnRydWUsInBhZ2luYXRpb25JZCI6IjAwMDAwMDAwNDkifQ==",

That will allow us to retrieve more of the results, iteratively.

A More Involved Use Case

The combination of the last two simple use cases will allow us to implement a more involved use case, that we will go over next:

Retrieve all the metadata in use at the time,
Correlate the metadata in terms of relationship between the nodes in terms of multiple trees
Display the result as a simple non-graphical tree implementation AnyTree

We implement it in Jupiter Notebook, that can be downloaded from Example.RDPAPI.NewsMetadata on GitHub

A brief explanation on how the metadata is retrieved and correlated:

From RDP news metadata service, we retrieve the available metadata, and follow Next link recursively, till any Next data remains to be retrieved. These will become Nodes in the trees, the trees will define the relationships between the news codes that are Nodes.
For each news code/Node that has Children, we recursively retrieve Children until any Children remain, and when a new level is applied we take a note of the Parent of each news code/Node.
Once we have completed steps 1 and 2 full, we will have all the makings of our set of trees; and all that will remain is to structure them into a visual representation.

Let us step through the implementation.

1. Read Credentials

Prior to this step we must have the valid credentials assigned, password set and client_id established. Client_it is an API Key set up with API Key Generator with API selection "EDP".

For the quickest and surest way to get started, refer to Quickstart Guide for RDP RDP Quickstart Guide

    	
            import requests, json, time, getopt, sys
 
# User Variables
credFile = open("..\creds\credFile.txt","r")    # one per line
                                                #--- RDP MACHINE ID---
                                                #--- LONG PASSWORD---
                                                #--- GENERATED CLIENT ID---
 
USERNAME = credFile.readline().rstrip('\n')
PASSWORD = credFile.readline().rstrip('\n')
CLIENT_ID = credFile.readline().rstrip('\n')
 
credFile.close()
 
#print("USERNAME="+str(USERNAME))
#print("PASSWORD="+str(PASSWORD))
#print("CLIENT_ID="+str(CLIENT_ID))

2. Define Token Endpoint

    	
            # Application Constants
RDP_version = "/v1"
base_URL = "https://api.refinitiv.com"
category_URL = "/auth/oauth2"
endpoint_URL = "/token"
CLIENT_SECRET = ""
TOKEN_FILE = "token.txt"
SCOPE = "trapi"
 
TOKEN_ENDPOINT = base_URL + category_URL + RDP_version + endpoint_URL
 
#print("TOKEN_ENDPOINT=" + TOKEN_ENDPOINT)

3. Define Token processing functions

We are going to reuse token processing code that is also in use by RDP Quickstart and Python examples deck .

4. Obtain Valid Token

    	
            accessToken = getToken();
print("Have token now");

That wil be saved to a file for use by the consequent steps.

5. Request News Metadata

This step retrieves the very first portion of the metadata available. Please note the error handling implemented in this section, as the valid access token can potentially expire, we wish to handle a token expiry error by re-requesting a token on the spot and continuing, while handle any other error as a failure.

    	
            news_category_URL = "/data/news"
newsmeta_endpoint_URL = "/metadata"
news_param1 = "?limit=100"
NEWS_ENDPOINT = base_URL + news_category_URL + RDP_version + newsmeta_endpoint_URL 
NEWS_META_FILE = "newsMetadata.txt"
 
nodesWithParents = []
nodesWithoutParents = []
 
print("NEWS_ENDPOINT=" + NEWS_ENDPOINT)
 
dResp = requests.get(NEWS_ENDPOINT + news_param1 , headers = {"Authorization": "Bearer " + accessToken});
 
if dResp.status_code != 200:
    print("Unable to get data. Code %s, Message: %s" % (dResp.status_code, dResp.text));
    if dResp.status_code == 401:   # error token expired
        accessToken = getToken();     # token refresh on token expired
        dResp = requests.get(NEWS_ENDPOINT + news_param1 , headers = {"Authorization": "Bearer " + accessToken});
else:
    print("Resource access successful")
    # Display data
    jResp = json.loads(dResp.text);
   # print(json.dumps(jResp, indent=2));

Parses it as json and optionally we can examine the results to be sure they are what we expect.

6. Request Children and Re-Categorize With Parent Information

We define a function processWithChildren that we will call recusrsively, on every item that is retrieved via metadata endpoint, exhaustively.

    	
            def processWithChildren(dResp, jResp, parentId):
    news_param2 = "/children?offset="
    step_size = 100 # 100 is max allowed at the time of this writing
    news_param3 = "&limit="+str(step_size)  
    global accessToken
    
    if dResp.status_code == 200:
        for node in jResp['data']: 
            nodeIsFirstSeen = True
            if parentId != '':
                node['parentId'] = parentId 
                if node not in nodesWithParents:
                    nodesWithParents.append(node)
 #                   print("*** id= " + str(node.get('id')) + "nodesWithParents.append" )
                else :
                    nodeIsFirstSeen = False
            else:
                if not any(nd.get('id') == node.get('id') for nd in nodesWithParents) and node not in nodesWithoutParents:
                    nodesWithoutParents.append(node)
  #                  print("*** id= " + str(node.get('id')) + "nodesWithoutParents.append")
                else :
                    nodeIsFirstSeen = False
            # keep track of the processing progress
            if nodeIsFirstSeen == True and ((len(nodesWithParents) + len(nodesWithoutParents)) % 200) == 0:
                print("***************Inserted "+ str((len(nodesWithParents) + len(nodesWithoutParents))))
            childrenOfThisNode = node.get('childrenCount')
 #           print("^^^^^^^^^^^^^^^^^^ children="+ str(childrenOfThisNode))
            if nodeIsFirstSeen == True and childrenOfThisNode != 0:
                start = 0; nextExists = True;
                while nextExists and start <= node.get('childrenCount'):
                    nextExists = True;
                    print("*in node %s with childrenCount %s at offset %s " % (node.get('id'),node.get('childrenCount'), str(start)))
                    dChildrenResp = requests.get(NEWS_ENDPOINT + "/" + str(node.get('id')) + news_param2 + str(start) + news_param3, headers = {"Authorization": "Bearer " + accessToken});
 
                    if dChildrenResp.status_code != 200:
                        print("Unable to get children data. Code %s, Message: %s, in node %s with childrenCount %s at offset %s" % (dChildrenResp.status_code, dChildrenResp.text, 
                                                                                                                       node.get('id'),node.get('childrenCount'), str(start)));
                        if dChildrenResp.status_code != 401:   # error other then token expired
                            break 
                        accessToken = getToken();     # token refresh on token expired
                        dChildrenResp = requests.get(NEWS_ENDPOINT + "/" + str(node.get('id')) + news_param2 + str(start) + news_param3, headers = {"Authorization": "Bearer " + accessToken});
                                    
                    jCResp = json.loads(dChildrenResp.text);
                    processWithChildren(dChildrenResp, jCResp, node.get('id'))
                    
                    if not "next" in jCResp["meta"]: 
#                        print("*next = False");
                        nextExists = False;
                    else:
                        print("*in node %s next is not False " % (node.get('id')))
                        start = start + step_size

Next, we call it once:

    	
            processWithChildren(dResp, jResp,'')

And continue recusrively, while any more remains and Next link in the response remains non-empty. The output grows and reflects the processing that is taking place:

    	
            ...
***************Inserted 1000
***************Inserted 1200
***************Inserted 1400
***************Inserted 1600
...

The completion is signalled with

    	
            ...
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<DONE child processing >>>>>>>>>>>>>>>>>>>>>>>>>>

7. Request Next on News Metadata and (optionally) Save to File

We may wish to save the resulting content to a file so we can search through it with a text editor, for debugging purposes, by default, we are keeping it commented

    	
            #DBG nf = open(NEWS_META_FILE, "w+");
#DBG nf.write(json.dumps(jResp, indent=2))
    
#print("Next= " + jResp["meta"]["next"])
 
news_param2 = "?cursor=" 
while jResp["meta"]["next"]:   #not empty
    print("Next= " + jResp["meta"]["next"])
    dResp = requests.get(NEWS_ENDPOINT + news_param2 + jResp["meta"]["next"] , headers = {"Authorization": "Bearer " + accessToken});
 
    if dResp.status_code != 200:   #
        print("Unable to get data. Code %s, Message: %s" % (dResp.status_code, dResp.text));
        if dResp.status_code != 401:   # error other then token expired
            break 
        accessToken = getToken();     # token refresh on token expired
        dResp = requests.get(NEWS_ENDPOINT + news_param2 + jResp["meta"]["next"] , headers = {"Authorization": "Bearer " + accessToken});
            
    print("Resource access successful")
    # Display data
    jResp = json.loads(dResp.text);
#    print(json.dumps(jResp, indent=2));
    processWithChildren(dResp, jResp,'')
        
#DBG    nf.write(json.dumps(jResp, indent=2))
#DBG nf.close()
 
print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<DONE child processing >>>>>>>>>>>>>>>>>>>>>>>>>>")

The tree of metadata that fully describes any currently published news content tends to be relatively large. During the processing, access token is likely to expire. Whenever this happens, we refresh the token and pick up the processing where we have left off.

8. Process into Tree Form

Lastly, we process all the retrieved data that includes children and is by now categorized in two lists NodesWithParents and NodesWithoutParents into a tree form and render the tree

    	
            from anytree import Node, RenderTree
 
# keeping track of the progress prior to removing a few duplicates
print("nodesWithoutParents length=" + str(len(nodesWithoutParents)) + ", nodesWithParents length=" + str(len(nodesWithParents)))
    
for node in nodesWithoutParents:
    node['treenode'] = Node(node.get('id')) 
    
for node in nodesWithParents:
    node['treenode'] = Node(node.get('id')) 
    
for node in nodesWithParents:
    found = False
    for nWithp in nodesWithParents:
        if node.get('parentId') == nWithp.get('id'):
            node['treenode'].parent = nWithp.get('treenode')  
            found = True
            break
    if not found:
        for nWithoutp in nodesWithoutParents:
            if node.get('parentId') == nWithoutp.get('id'):
                node['treenode'].parent = nWithoutp.get('treenode')  
                found = True
                break
    if not found:
        node['treenode'] = Node(node.get('id'))
        print("ORPHAN ?" + node.get('id'))
        
# check for top-levels that are not really top level, just happened to be first
for index, node in enumerate(nodesWithoutParents):
    if any(nd.get('id') == node.get('id') for nd in nodesWithParents):
#        remove mislabeled top-level        
        nodesWithoutParents.remove(node) 
#        print("Mislabeled empty top-level removed"+ str(node))
        
for node in nodesWithoutParents:
    print(RenderTree(node.get('treenode')))

For the purposes of simplicity, we render as simple AnyTree (find the documentation in references).

However, once the full content is retrieved, and structured, it can be presented in any tree form, plain text, graphical or even interactive tree representation. Our result looks simple:

    	
            ...
│   │   ├── Node('/M:1/M:1EY/M:KP/B:255')
│   │   │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70')
│   │   │   │   └── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71')
│   │   │   │       ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72')
│   │   │   │       │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72/B:1292')
│   │   │   │       │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72/B:1298')
│   │   │   │       │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72/B:1294')
 
...

And once refreshed to the latest, can be copied, pasted and conveniently text-searched

Conclusion

RDP News metadata service complements other RDP News services, allowing for more efficient consumption and analysis of headlines, stories and alerts, yielding itself easily to design and implementation of diverse use cases.