My purpose in this article is to cover a common ESG (Environmental, Social and Governance) use case: how to regularly and easily retrieve ESG bulk data from the Elektron Data Platform (EDP) hosted Client File Store (CFS).
There are several steps to retrieve this data before you can use it in your own applications: the latest file-sets need to be identified, several hundred compressed files must be downloaded and then concatenated.
To make my life easy I created a Python demo code sample that performs all these tasks; it is available for download at the end of this article. My purpose is to explain how it works, after introducing ESG data and the Client File Store in which it resides.
You need some notions of EDP to understand this article. To experiment with the API a valid EDP user account, permissioned for CFS ESG data, is also required.
The ESG acronym stands for Environmental Social & Governance. This is data that highlights how companies are rated on Environmental, Social and Corporate Governance criteria.
The data is of 2 natures:
There are more than 400 measures (the actual number can vary depending on the company), which as their name indicates are measures of specific criteria, self reported by the companies. The type of data can be a boolean, a number, or eventually a string, depending on the nature of the underlying data. Here are a few areas that are covered:
- Water and energy consumption and efficiency
- CO2 emissions
- Human rights
- Women employees
- Staff training
- Health and safety
Using the measures, we calculate a set of scores, i.e. analytics whose result is a number that indicates how well a company scores on various criteria. Some of these are on specific topics, like the Innovation Score, others are summaries, like the Environment, Social and Governance Pillar Score, which are overall ratings of these 3 pillars, and the ESG Score which is an overall score for the company, based on those 3 pillars. Scores have values between 0 and 1, 0 being the worst and 1 the best.
All these values are delivered on a yearly basis, i.e. there is one record per calendar year.
Please note that the source data for measures is self reported by the companies. Depending on when it is published, values for the preceding year might not be available.
How can I access ESG data ?
The Elektron Data Platform has 2 ways of accessing ESG data:
- By instrument, using specific queries - See the ESG data in Postman tutorial for more information
- In bulk files - This is what this article is about. Two new ESG bulk files are made available on a daily basis, one for scores, one for raw data (measures). They remain available for one month.
ESG bulk data is stored in the Elektron Data Platform (EDP) hosted Client File Store (CFS).
The CFS provides authorization and enables access to content files stored in publisher-supplied repositories, currently Amazon Web Services (AWS) S3. CFS is intended for both publishers and subscribers. Publishers call CFS APIs to post metadata about their files stored in their repositories, while subscribers call CFS APIs to discover and download those files.
ESG Bulk data is published by Refinitiv, on a 7/7 daily basis. Note that the scores files are published approximately 2 hours before the raw (measures) files. All these files remain available for 1 month.
As an ESG bulk customer you are entitled to search and download this data.
CFS data file organization
CFS provides buckets and file-sets to organize files.
A bucket is the top level of the hierarchical data organisation. It can contain any number of file-sets. ESG bulk data is stored in a specific bucket, EDS-BULK-ESG-Production; this bucket is reserved for bulk ESG data, there is no other data in there.
A file-set is the second level of the data organisation. It is an indivisible set of files that are all delivered together. For ESG bulk data there are 2 daily file-sets, one for scores and one for raw data. Each file-set consists in multiple compressed files that make up one large file. The 2 file-sets are identified by their package code (esg-esg-scores or esg-esg-raw), so you can decide to retrieve only scores, raw (measures) data, or both.
The file is the last level of the data organisation. In the scores file-set there are approximately 220 files, representing a total of nearly 4.5 million records. In the raw data file-set there are approximately 600 files representing approximately 20 million records. All these files are compressed, using gzip. Most files have 20'000 lines, but there can be a few exceptions.
Once all files have been added to a file-set and it is ready for download by subscribers, the publisher updates the file-set status to READY. This allows the publisher to control the release. File-sets with a status of READY cannot be updated or modified, updates must be published as a new file-set. This means that as soon as a file-set has status READY, you can safely download its files, as there is no risk that the contents change.
As a subscriber you can query for files for which you are entitled by several criteria, like bucket, date range, file-set, as well as by other attributes, such as package code.
Now that we know how the ESG data is organized in the CFS, let us come back to our use case, i.e. easily retrieve all ESG bulk files on a daily basis.
We need to:
- Find the most recent file-sets
- Retrieve the list of all files in each file-set
- Download the files of each file-set
- Concatenate the files of each file-set
Once this is done, the ESG bulk data can be used by our applications.
Obviously, before doing all this, we must authenticate to EDP; this step is not specific to ESG data, and is out of scope for this article. Please refer to our tutorials or this other article for more details on EDP authentication.
Let us now look at how we can implement this workflow, in more detail.
1. Find the most recent file-sets
This is done by using the API to query the ESG bulk bucket for a list of file-sets updated today. Here is the API call, using today's date as parameter in the URI:
The result will depend on the time of the day:
- The scores file-set is usually made available around 08:40 UTC
- The raw file-set is usually made available around 10:40 UTC
If you are only interested in scores data you could add a filter on the packageCode, like this:
The same applies to raw data:
The response contains a list of file-sets with some metadata about them. One important element of that metadata is the file-set Id, a unique identifier that allows retrieving the full list of files in the file-set.
2. Retrieve the list of all files in each file-set
We use the file-set Id received in the response to the previous call:
GET https://api.refinitiv.com/file-store/beta1/files?filesetId=<file-set id>
For each record in the response, we save the filename and its corresponding file id.
Note that the response will only return up to 25 objects. If the request generates more objects, which is the case for ESG bulk data, the result will include a nextLink to indicate that more data is available. We therefore use the URI specified in the nextLink to retrieve the next page of data. The nextLink continues to appear until all results have been returned.
3. Download the files of each file-set
This can be achieved by using the file ids and filenames collected in the previous step.
To download a file, we use the file id as a parameter in the call:
GET https://api.refinitiv.com/file-store/beta1/files/<file id>/stream
Then we save the file using the filename received in the file metadata.
Note that this API call returns an HTTP 302 redirect, that automatically redirects the request to the underlying repository, with embedded authorization. If your firewall modifies headers or blocks redirects, you can add the doNotRedirect=true parameter to disable automatic redirection and instead return an AWS S3 signed link to the objects content:
GET https://api.refinitiv.com/file-store/beta1/files/<file id>/stream?doNotRedirect=true
You can then use the returned URI to initiate the file download.
As there are many files in ESG bulk file-sets, you can expect the download process to take some time. In my tests I observed that scores typically downloaded in approximately 10 to 20 minutes, and raw data in 30 minutes to 2 hours. Please consider these numbers as rough estimates; download performance depends on a wide set of factors, including your internet connection, its load, and the servers load.
4. Concatenate the files of each file-set
Once all files have been downloaded, we can proceed to the last step.
The individual files are all compressed using gzip. The usual use case is to concatenate them into a single file that represents the entire data set for the day. The order in which the files are concatenated is not important, as there is no specific order in the ESG records.
Concatenation can be done with or without decompressing the individual files.
But we must be careful: the last record in every file is not followed by a new line character. This means that if we simply concatenate all the files, we will end up with the last record of each file on the same line as the first record of the following file. This could disrupt further processing of the resulting file. To circumvent this, we insert a new line character after each individual file during the concatenation process.
My Python demo code sample implements the workflow I have just described. Apart from that, it also includes a final step to decompress the full file.
It also contains a feature to optimize re-runs. A re-run is required in the following cases:
- Not all file-sets were available at the first run.
- The previous run failed or was interrupted.
File downloads are the most time consuming task when retrieving ESG bulk files. Therefore, the code logs the filename of each file once it has been successfully downloaded. In case the sample is run several times on the same day, it checks if files have already been downloaded by reading that file, and skips all files that were already downloaded, thus gaining time.
The code for downloading the files also attempts to recover from spurious disconnects, which can occur occasionally.
The Python demo code sample is presented here "as is", without any warranty or conditions of any kind. It is an unproductized open-source example, intended as a learning tool. Error handling is therefore fairly basic. Please feel free to update the code as required for your intended use case.
Here is what you can expect to see when running the program:
It creates a log directory where you can find the lists of file-sets, files and successfully downloaded files, and a data directory containing the downloaded and concatenated files.
In this article I showed you how to easily implement a typical ESG bulk data retrieval scenario, for daily downloads.
For documentation, see the CFS API User Guide.