Using Tick History REST API in R language - Part 2

This is the second article of the educational series that is going to show you how to use the httr package to retrieve data from Thomson Reuters Tick History (TRTH) REST APIs in R. In part 1 - we look at the user authentication and get user information. In this part, we are going to retrieve Venue by Day(VBD) files.

DSS Feeds are organized by Subscription (e.g. Tick History Venue by Day or Street Events) and Package. Packages are enabled for clients based on their purchased commercial package and provide a means to purchase subsets of all the available content for the Subscription. Each Package delivers its content at specific intervals as User Package Deliveries which provide some information about the file content such as the name, release date, frequency, and file size.

Before we start, you are encouraged to visit the REST API Reference tree at the REST API Help Home Page as a primary source of reference for all restful functions and supported conditions. You can also view the full list of functions in the httr package here.

The R script on this article is available on GitHub.

List User Packages

Tick History subscriptions are permitted by venues. Each venue represents as a single package. Users can download Venue By Day file from the package they are entitled to retrieve only. So first, let's get the list of all entitled packages. You can request a package list by using GET.

GET https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackages
Authorization: Token <your_auth_token_goes_here>
Prefer: respond-async
url <- "https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackages"
r <- GET(url,add_headers(prefer = "respond-async",Authorization = token))
stop_for_status(r)
a<-content(r, "parsed", "application/json", encoding="UTF-8")

The request will return a list of the user package Id, user package name and the corresponding subscription name available for your account.

str(a)
List of 2
 $ @odata.context: chr "https://hosted.datascopeapi.reuters.com/RestApi/v1/$metadata#UserPackages"
 $ value         :List of 2
  ..$ :List of 5
  .. ..$ UserPackageId   : chr "0x04f21a8d2fe59cb1"
  .. ..$ PackageId       : chr "0x04f21a8d2fe59cb1"
  .. ..$ PackageName     : chr "ZAG - Zagreb Stock Exchange"
  .. ..$ SubscriptionId  : chr "0x0400dc1d24a00cb4"
  .. ..$ SubscriptionName: chr "TRTH Venue by Day"
  ..$ :List of 5
  .. ..$ UserPackageId   : chr "0x04f21a8d16759cb1"
  .. ..$ PackageId       : chr "0x04f21a8d16759cb1"
  .. ..$ PackageName     : chr "BRA - Bratislava Stock Exchange"
  .. ..$ SubscriptionId  : chr "0x0400dc1d24a00cb4"
  .. ..$ SubscriptionName: chr "TRTH Venue by Day"

List User Packages Deliveries

We can either retrieve the list of user package deliveries (data files) from PackageId.

GET https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveryGetUserPackageDeliveriesByPackageId(PackageId='0x04f21a8d2fe59cb1')
Authorization: Token <your_auth_token_goes_here>
Prefer: respond-async

Or retrieve the list of user package deliveries for a specific date range.

GET https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveryGetUserPackageDeliveriesByDateRange(SubscriptionId='0x0400dc1d24a00cb4',FromDate=2016-11-22T20:54:03.671Z,ToDate=2016-11-24T20:54:03.671Z)
Authorization: Token <your_auth_token_goes_here>
Prefer: respond-async

For this tutorial, we shall retrieve from PackageId.

url <- "https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveryGetUserPackageDeliveriesByPackageId(PackageId='0x04f21a8d2fe59cb1')"
r <- GET(url,add_headers(prefer = "respond-async",Authorization = token))
stop_for_status(r)
a<-content(r, "parsed", "application/json", encoding="UTF-8")

The request will return list of the user package delivery Id, file name, and file size

str(a)
List of 3
 $ @odata.context : chr "https://hosted.datascopeapi.reuters.com/RestApi/v1/$metadata#UserPackageDeliveries"
 $ value          :List of 250
  ..$ :List of 8
  .. ..$ PackageDeliveryId: chr "0x05c1c083535b3026"
  .. ..$ UserPackageId    : chr "0x04f21a8d23f59cb1"
  .. ..$ SubscriptionId   : chr "0x0400dc1d24a00cb4"
  .. ..$ Name             : chr "MCE-2017-06-14-NORMALIZEDMP-Report-1-of-1.csv.gz"
  .. ..$ ReleaseDateTime  : chr "2017-06-14T20:30:00.000Z"
  .. ..$ FileSizeBytes    : int 222692
  .. ..$ Frequency        : chr "Daily"
  .. ..$ ContentMd5       : chr ""
  ..$ :List of 8
  .. ..$ PackageDeliveryId: chr "0x05c1c081490b2f96"
  .. ..$ UserPackageId    : chr "0x04f21a8d23f59cb1"
  .. ..$ SubscriptionId   : chr "0x0400dc1d24a00cb4"
  .. ..$ Name             : chr "MCE-2017-06-14-NORMALIZEDMP-Data-1-of-1.csv.gz"
  .. ..$ ReleaseDateTime  : chr "2017-06-14T20:30:00.000Z"
  .. ..$ FileSizeBytes    : int 72940814
  .. ..$ Frequency        : chr "Daily"
  .. ..$ ContentMd5       : chr "5f4e539c0fc2abc320a4af76b823ea7b"
  .. ..
  .. ..
  .. ..
 $ @odata.nextlink: chr "https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveryGetUserPackageDeliveriesByPackageId(P"

However, retrieving the list of package deliveries will return only the most recent 250 rows. This is because TRTH REST API uses Server-driven paging to ensures that the quantity of data does not overwhelm the application. This conforms to OData server driven paging specifications. Although the size can be customized by setting the 'odata.maxpagesize' preference. For example:

GET https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/PackageDeliveries
Authorization: Token <your_token_goes_here>
Accept: application/json; odata.metadata=minimal
Accept-Charset: UTF-8
Prefer: odata.maxpagesize=10; respond-async

The API is not required to honor the requested page size. Which means you should always account for the nextlink.

So, for this article, let's use a recursive function to retrieve the entire list.

.TRTHGetAllPages <- function(token,url) {
    r <- GET(url,add_headers(prefer = "respond-async",Authorization = token))
    stop_for_status(r)
    a<-content(r, "parsed", "application/json", encoding="UTF-8")
    # Check if there is a next link
    if (!is.null(a[["@odata.nextlink"]])) {
        # Call the function again, using next link.
        nurl <- a[["@odata.nextlink"]]
        b<-.TRTHGetAllPages(nurl)
        # Merge the result
        for(i in 1:length(b[["value"]])) {
            a[["value"]][[length(a[["value"]])+1]]<-b[["value"]][[i]]
        }
    }
    # Remove next link to avoid confusion
    a[["@odata.nextlink"]]<-NULL
    return(a)
}

TRTHUserPackageDeliveriesByPackageId <- function(PackageId) {
    url <- paste0("https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveryGetUserPackageDeliveriesByPackageId(PackageId='",PackageId,"')")
    a<-.TRTHGetAllPages(url)
    return(a)
}

Get User Package Deliveries

In this final step, we retrieve the files and save them to local disk. This is done one file at a time. The PackageDeliveryId is a parameter set in the URL.

GET https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveries('0x05c1c083535b3026')/$value
Authorization: Token <your_auth_token_goes_here>
Prefer: respond-async

However, The API deviates from the standard RFC2616 for HTTP protocol where the server may only encode or compress content using the accepted encodings that the client specifies. Standard Extractions feeds always deliver files in a pre-compressed format, even though the Accept-Encoding: gzip is not present. This is because VBD content can be very large files and would increase the burdens on the servers and bandwidth were the files not compressed. The large file size also means that we cannot cache the response locally in the memory. The file should be saved directly to disk.

Another issue is that Tick History delivers its output as several smaller gzip files concatenated into a single large gzip file when the file size is too large. However, the httr does not support concatenated gzip files. It will fail with an error Unrecognized or bad HTTP Content or Transfer-Encoding. This is explained in the advisory notice: Avoid Incomplete Output - Decompress then Download.

To avoid this issue you have to disable the httr's content decoding and decompress the output only after you have downloaded it.

So, what we have to do is

  1. Use the httr::write_disk() option to write the file to a known path.
  2. Disable HTTP content decoding with httr::config(http_content_decoding=0).
  3. It is also advisable to add a progress bar with progress().
url <- "https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveries('0x05c1c083535b3026')/$value"
r <- GET(url,add_headers(prefer = "respond-async",Authorization = token),config(http_content_decoding=0),write_disk("output.csv.gz"),progress())
stop_for_status(r)

The file will be saved to the disk which can be open with read.csv(). read.csv() can read a compressed file so there is no need to decompress it first.

Downloading Files via Direct Download From Amazon Web Services

Consider the size of VBD files: it is faster if the file is retrieved directly from the Amazon Web Services (AWS) cloud in which they are hosted. This can be done by including the HTTP header field X-Direct-Download: true in the request. The Tick History will response with the HTTP 302 status code which redirects the request to AWS.

Example response

HTTP/1.1 302 Found
Set-Cookie: DSSAPI-COOKIE=R##########; path=/
Cache-Control: no-cache
Pragma: no-cache
Expires: -1
Location: https://s3-us-west-1.amazonaws.com/tickhistory.query.production.pln-ebd-dse/Instruments/DMX/2017/2017-11-08/data/DMX-2017-11-08-Instruments-1-of-1.csv.gz?AWSAccessKeyId=AFSDFFSAFSAFSAFSA&Expires=1512752471&response-content-disposition=attachment%3B%20filename%3DDMX-2017-11-08-Instruments-1-of-1.csv.gz&Signature=fjgMfHS1BhgA4WOmsuybrfml3eM%3D&x-amz-request-payer=requester
Server: Microsoft-IIS/7.5
X-Request-Execution-Correlation-Id: 02a55bd3-3b9b-437e-ba2f-88d68fca31b5
X-App-Id: Custom.RestApi
X-App-Version: 11.2.653.64
Date: Fri, 08 Dec 2017 11:01:11 GMT
Content-Length: 0

By default, httr will handle HTTP 302 redirection to Amazon automatically. However, it forward all the HTTP header fields from the original request as part of the new (redirected) request, including the Authorization header field that contains the DataScope Select authentication token. Amazon interprets this token as an Amazon token, causing the request to fail. As a result, we have to configure httr client to not redirect a request when it receives a 302 status code, then read the response’s Location header field and send the new request to that URL, Omitting the Authorization header field and the X-Direct-Download header field.

So, as bullet points:

  1. Add "X-Direct-Download" = "true" header.
  2. Disable redirection with httr::config(followlocation=0).
  3. Use httr::status_code(), to extract status code from a response.
  4. Send a new request to r$headers$location if the code is 302.
url <- paste0("https://hosted.datascopeapi.reuters.com/RestApi/v1/StandardExtractions/UserPackageDeliveries('",PackageDeliveryId,"')/$value")
r <- GET(url,add_headers(prefer = "respond-async",Authorization = token,"X-Direct-Download" = "true"),config(http_content_decoding=0,followlocation=0),write_disk(Path,Overwrite),progress())
if (status_code(r) == 302) {
  r2 <- GET(r$headers$location,add_headers(prefer = "respond-async"),config(http_content_decoding=0,followlocation=0),write_disk(Path,Overwrite),progress())
  stop_for_status(r2)
} else {
stop_for_status(r)
}