Article

Exploring News Metadata - with Refinitiv Data Platform and Python

Author:

Zoya Farberov
Developer Advocate Developer Advocate

Introduction

Refinitiv Data Platform provides simple web-based API access to a broad range of content.  Refinitiv Data Platform (RDP) News service encompasses diverse news content sets:

  • News Alerts
  • Headline Search
  • Story Retrieval
  • Metadata Retrieval

News metadata describes the content of a story and is created and packaged with the story. It is most often retrieved and presented with the story. The hierarchy of news metadata is available for separate retrieval.  Metadata enables accurate and efficient search and discovery of the relevant news content. 

Next, we are going to discuss RDP News metadata retrieval capabilities and use examples of code in Python.

Prerequisites

  • Having a valid machineId, corresponding long password and client_id
  • Permissioned for RDP News service
  • Python 3 installed
  • Jupiter Notebook or Jupyter Lab installed (or one can run the same code via plain Python)
  • Python libraries installed

Simple And Effective

A simple but very powerful use case of RDP News metadata is for lookup a news code or a news topic. 

For example, we can send a request to lookup code B:227

    	
            https://api.refinitiv.com/data/news/v1/metadata/B:227
        
        
    

And examine the result

    	
            

"newsCode": {

    "id": "B:227",

    "description": "Manufacturers, extractors and refiners of chemicals, minerals, precious metals, steel, aluminum, forest products, and construction and other raw materials.",

    "label": "Basic Materials (TRBC level 1)",

    "group": "BusinessSectors",

    "readable": "Topic:BMAT",

    "searchable": true,

    "childrenCount": 105

  }

}

Another is to find the codes that are related to a broader code, a.k.a. "children".  For example, we request the children of code B:227

    	
            https://api.refinitiv.com/data/news/v1/metadata/B:227/children
        
        
    

And the result contains the children

    	
            

"data": [

    {

      "id": "B:27",

      "description": "Producers of wood-based products and providers of related services, as well as manufacturers of paper and non-paper products, containers and packaging.",

      "label": "Applied Resources (TRBC level 2)",

      "group": "BusinessSectors",

      "readable": "Topic:APRE",

      "searchable": true,

      "childrenCount": 27

    },

    {

      "id": "B:13",

      "description": "Producers and refiners of agricultural, commodity and specialty chemicals.",

      "label": "Chemicals (TRBC level 2)",

      "group": "BusinessSectors",

      "readable": "B:13",

      "searchable": false,

      "childrenCount": 25

    },

    {

      "id": "B:228",

      "description": "Miners and processors of steel, aluminum, precious and specialty metals and minerals, and construction related materials. Includes integrated mining companies.",

      "label": "Mineral Resources (TRBC level 2)",

      "group": "BusinessSectors",

      "readable": "Topic:MINE",

      "searchable": true,

      "childrenCount": 51

    }

  ]

...

Lastly, we can just request a list of current metadata available, in portions, such as

    	
            https://api.refinitiv.com/data/news/v1/metadata/B:227/children
        
        
    

The result will culminate in link next, for example:

    	
            "next": "eyJsaW1pdCI6NTAsImZvcndhcmQiOnRydWUsInBhZ2luYXRpb25JZCI6IjAwMDAwMDAwNDkifQ==",
        
        
    

That will allow us to retrieve more of the results, iteratively.

A More Involved Use Case

The combination of the last two simple use cases will allow us to implement a more involved use case, that we will go over next:

  • Retrieve all the metadata in use at the time,
  • Correlate the metadata in terms of relationship between the nodes in terms of multiple trees
  • Display the result as a simple non-graphical tree implementation AnyTree

We implement it in Jupiter Notebook, that can be downloaded from Example.RDPAPI.NewsMetadata on GitHub

A brief explanation on how the metadata is retrieved and correlated:

  1. From RDP news metadata service, we retrieve the available metadata, and follow Next link recursively, till any Next data remains to be retrieved.  These will become Nodes in the trees, the trees will define the relationships between the news codes that are Nodes.
  2.  For each news code/Node that has Children, we recursively retrieve Children until any Children remain, and when a new level is applied we take a note of the Parent of each news code/Node.
  3. Once we have completed steps 1 and 2 full, we will have all the makings of our set of trees; and all that will remain is to structure them into a visual representation.

Let us step through the implementation.

1. Read Credentials

Prior to this step we must have the valid credentials assigned, password set and client_id established.  Client_it is an API Key set up with API Key Generator with API selection "EDP".

For the quickest and surest way to get started, refer to Quickstart Guide for RDP RDP Quickstart Guide

    	
            

import requests, json, time, getopt, sys

 

# User Variables

credFile = open("..\creds\credFile.txt","r")    # one per line

                                                #--- RDP MACHINE ID---

                                                #--- LONG PASSWORD---

                                                #--- GENERATED CLIENT ID---

 

USERNAME = credFile.readline().rstrip('\n')

PASSWORD = credFile.readline().rstrip('\n')

CLIENT_ID = credFile.readline().rstrip('\n')

 

credFile.close()

 

#print("USERNAME="+str(USERNAME))

#print("PASSWORD="+str(PASSWORD))

#print("CLIENT_ID="+str(CLIENT_ID))

2. Define Token Endpoint

    	
            

# Application Constants

RDP_version = "/v1"

base_URL = "https://api.refinitiv.com"

category_URL = "/auth/oauth2"

endpoint_URL = "/token"

CLIENT_SECRET = ""

TOKEN_FILE = "token.txt"

SCOPE = "trapi"

 

TOKEN_ENDPOINT = base_URL + category_URL + RDP_version + endpoint_URL

 

#print("TOKEN_ENDPOINT=" + TOKEN_ENDPOINT)

3. Define Token processing functions

We are going to reuse token processing code that is also in use by RDP Quickstart and Python examples deck .

4. Obtain Valid Token

    	
            

accessToken = getToken();

print("Have token now");

That wil be saved to a file for use by the consequent steps.

5. Request News Metadata

This step retrieves the very first portion of the metadata available.  Please note the error handling implemented in this section, as the valid access token can potentially expire, we wish to handle a token expiry error by re-requesting a token on the spot and continuing, while handle any other error as a failure. 

    	
            

news_category_URL = "/data/news"

newsmeta_endpoint_URL = "/metadata"

news_param1 = "?limit=100"

NEWS_ENDPOINT = base_URL + news_category_URL + RDP_version + newsmeta_endpoint_URL 

NEWS_META_FILE = "newsMetadata.txt"

 

nodesWithParents = []

nodesWithoutParents = []

 

print("NEWS_ENDPOINT=" + NEWS_ENDPOINT)

 

dResp = requests.get(NEWS_ENDPOINT + news_param1 , headers = {"Authorization": "Bearer " + accessToken});

 

if dResp.status_code != 200:

    print("Unable to get data. Code %s, Message: %s" % (dResp.status_code, dResp.text));

    if dResp.status_code == 401:   # error token expired

        accessToken = getToken();     # token refresh on token expired

        dResp = requests.get(NEWS_ENDPOINT + news_param1 , headers = {"Authorization": "Bearer " + accessToken});

else:

    print("Resource access successful")

    # Display data

    jResp = json.loads(dResp.text);

   # print(json.dumps(jResp, indent=2));

Parses it as json and optionally we can examine the results to be sure they are what we expect.

6. Request Children and Re-Categorize With Parent Information

We define a function processWithChildren that we will call recusrsively, on every item that is retrieved via metadata endpoint, exhaustively.

    	
            

def processWithChildren(dResp, jResp, parentId):

    news_param2 = "/children?offset="

    step_size = 100 # 100 is max allowed at the time of this writing

    news_param3 = "&limit="+str(step_size)  

    global accessToken

    

    if dResp.status_code == 200:

        for node in jResp['data']: 

            nodeIsFirstSeen = True

            if parentId != '':

                node['parentId'] = parentId 

                if node not in nodesWithParents:

                    nodesWithParents.append(node)

 #                   print("*** id= " + str(node.get('id')) + "nodesWithParents.append" )

                else :

                    nodeIsFirstSeen = False

            else:

                if not any(nd.get('id') == node.get('id') for nd in nodesWithParents) and node not in nodesWithoutParents:

                    nodesWithoutParents.append(node)

  #                  print("*** id= " + str(node.get('id')) + "nodesWithoutParents.append")

                else :

                    nodeIsFirstSeen = False

            # keep track of the processing progress

            if nodeIsFirstSeen == True and ((len(nodesWithParents) + len(nodesWithoutParents)) % 200) == 0:

                print("***************Inserted "+ str((len(nodesWithParents) + len(nodesWithoutParents))))

            childrenOfThisNode = node.get('childrenCount')

 #           print("^^^^^^^^^^^^^^^^^^ children="+ str(childrenOfThisNode))

            if nodeIsFirstSeen == True and childrenOfThisNode != 0:

                start = 0; nextExists = True;

                while nextExists and start <= node.get('childrenCount'):

                    nextExists = True;

                    print("*in node %s with childrenCount %s at offset %s " % (node.get('id'),node.get('childrenCount'), str(start)))

                    dChildrenResp = requests.get(NEWS_ENDPOINT + "/" + str(node.get('id')) + news_param2 + str(start) + news_param3, headers = {"Authorization": "Bearer " + accessToken});

 

                    if dChildrenResp.status_code != 200:

                        print("Unable to get children data. Code %s, Message: %s, in node %s with childrenCount %s at offset %s" % (dChildrenResp.status_code, dChildrenResp.text, 

                                                                                                                       node.get('id'),node.get('childrenCount'), str(start)));

                        if dChildrenResp.status_code != 401:   # error other then token expired

                            break 

                        accessToken = getToken();     # token refresh on token expired

                        dChildrenResp = requests.get(NEWS_ENDPOINT + "/" + str(node.get('id')) + news_param2 + str(start) + news_param3, headers = {"Authorization": "Bearer " + accessToken});

                                    

                    jCResp = json.loads(dChildrenResp.text);

                    processWithChildren(dChildrenResp, jCResp, node.get('id'))

                    

                    if not "next" in jCResp["meta"]: 

#                        print("*next = False");

                        nextExists = False;

                    else:

                        print("*in node %s next is not False " % (node.get('id')))

                        start = start + step_size

Next, we call it once:

    	
            processWithChildren(dResp, jResp,'')
        
        
    

And continue recusrively, while any more remains and Next link in the response remains non-empty.    The output grows and reflects the processing that is taking place:

    	
            

...

***************Inserted 1000

***************Inserted 1200

***************Inserted 1400

***************Inserted 1600

...

The completion is signalled with

    	
            

...

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<DONE child processing >>>>>>>>>>>>>>>>>>>>>>>>>>

7. Request Next on News Metadata and (optionally) Save to File

We may wish to save the resulting content to a file so we can search through it with a text editor, for debugging purposes, by default, we are keeping it commented

    	
            

#DBG nf = open(NEWS_META_FILE, "w+");

#DBG nf.write(json.dumps(jResp, indent=2))

    

#print("Next= " + jResp["meta"]["next"])

 

news_param2 = "?cursor=" 

while jResp["meta"]["next"]:   #not empty

    print("Next= " + jResp["meta"]["next"])

    dResp = requests.get(NEWS_ENDPOINT + news_param2 + jResp["meta"]["next"] , headers = {"Authorization": "Bearer " + accessToken});

 

    if dResp.status_code != 200:   #

        print("Unable to get data. Code %s, Message: %s" % (dResp.status_code, dResp.text));

        if dResp.status_code != 401:   # error other then token expired

            break 

        accessToken = getToken();     # token refresh on token expired

        dResp = requests.get(NEWS_ENDPOINT + news_param2 + jResp["meta"]["next"] , headers = {"Authorization": "Bearer " + accessToken});

            

    print("Resource access successful")

    # Display data

    jResp = json.loads(dResp.text);

#    print(json.dumps(jResp, indent=2));

    processWithChildren(dResp, jResp,'')

        

#DBG    nf.write(json.dumps(jResp, indent=2))

#DBG nf.close()

 

print("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<DONE child processing >>>>>>>>>>>>>>>>>>>>>>>>>>")

The tree of metadata that fully describes any currently published news content tends to be relatively large.  During the processing, access token is likely to expire.  Whenever this happens, we refresh the token and pick up the processing where we have left off.

8. Process into Tree Form

Lastly, we process all the retrieved data that includes children and is by now categorized in two lists NodesWithParents and NodesWithoutParents into a tree form and render the tree

    	
            

from anytree import Node, RenderTree

 

# keeping track of the progress prior to removing a few duplicates

print("nodesWithoutParents length=" + str(len(nodesWithoutParents)) + ", nodesWithParents length=" + str(len(nodesWithParents)))

    

for node in nodesWithoutParents:

    node['treenode'] = Node(node.get('id')) 

    

for node in nodesWithParents:

    node['treenode'] = Node(node.get('id')) 

    

for node in nodesWithParents:

    found = False

    for nWithp in nodesWithParents:

        if node.get('parentId') == nWithp.get('id'):

            node['treenode'].parent = nWithp.get('treenode')  

            found = True

            break

    if not found:

        for nWithoutp in nodesWithoutParents:

            if node.get('parentId') == nWithoutp.get('id'):

                node['treenode'].parent = nWithoutp.get('treenode')  

                found = True

                break

    if not found:

        node['treenode'] = Node(node.get('id'))

        print("ORPHAN ?" + node.get('id'))

        

# check for top-levels that are not really top level, just happened to be first

for index, node in enumerate(nodesWithoutParents):

    if any(nd.get('id') == node.get('id') for nd in nodesWithParents):

#        remove mislabeled top-level        

        nodesWithoutParents.remove(node) 

#        print("Mislabeled empty top-level removed"+ str(node))

        

for node in nodesWithoutParents:

    print(RenderTree(node.get('treenode')))    

For the purposes of simplicity, we render as simple AnyTree (find the documentation in references).

However, once the full content is retrieved, and structured, it can be presented in any tree form, plain text, graphical or even interactive tree representation.  Our result looks simple:

    	
            

...

│   │   ├── Node('/M:1/M:1EY/M:KP/B:255')

│   │   │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70')

│   │   │   │   └── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71')

│   │   │   │       ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72')

│   │   │   │       │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72/B:1292')

│   │   │   │       │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72/B:1298')

│   │   │   │       │   ├── Node('/M:1/M:1EY/M:KP/B:255/B:70/B:71/B:72/B:1294')

 

...

And once refreshed to the latest, can be copied, pasted and conveniently text-searched

Conclusion

RDP News metadata service complements other RDP News services, allowing for more efficient consumption and analysis of headlines, stories and alerts, yielding itself easily to design and implementation of diverse use cases.

References

https://github.com/Refinitiv-API-Samples/Example.RDPAPI.Python.NewsMetadata

https://developers.refinitiv.com/en/api-catalog/refinitiv-data-platform/refinitiv-data-platform-apis/documentation#news-user-guide

https://developers.refinitiv.com/refinitiv-data-platform/refinitiv-data-platform-apis

http://apidocs.refinitiv.com

https://anytree.readthedocs.io/en/latest/