Article

How to Optimize Tick History file downloads for Python (and other languages)

Christiaan Meihsl
Head of Developer Content, Platform Application Developer Head of Developer Content, Platform Application Developer

Introduction

After making a request for historical data using LSEG Tick History, once the data has been extracted and is ready you can download and save the resulting compressed data file to your local hard disk.

In this article, which applies to any LSEG Tick History data extraction type, I investigate how to download the compressed data files, and how to optimize the download time. Several Python code versions are compared.

Read on even if you do not use Python, because the introduction on request optimization, and part of the conclusion, apply to all programming languages.

You need some notions of LSEG Tick History to understand this article. To experiment with the API a valid LSEG Tick History user account is also required.

Disclaimer

The performance numbers given in this article are purely indicative, and do not constitute a guarantee of any sort.

Download performance influencing parameters

Download performance varies a lot due to several parameters.

Available bandwidth depends on your internet connection, and other simultaneous traffic sharing the same connection. This must be considered along the entire way between your machine and the servers, so the path through the internet and its current load are also big influencing factors. Due to their distributed and stochastic nature, it is impossible to make reliable predictions on their performance.

Our server load depends on the current number and size of requests being handled. This typically fluctuates during the day, as many users will attempt to retrieve data as soon as markets are closed and the data is made available, thus generating high traffic peaks.

The nature of your requests and how your code is written also have an influence, and this is where you have a certain level of control. The question I tried to answer here is: how much control does a developer actually have, and what can be done to optimize performance ?

Optimizing data requests

The idea is to reduce the load from the start. Reducing a request to ask for the strict minimum amount of data that is required helps shorten the processing time, reduces the resulting data size, and thus allows for faster downloads that consume less storage space.

Here are a few paths to request optimization that will apply to any programming language.

Download directly from the Amazon Web Services cloud:

The REST API allows you to download extracted data faster by retrieving them directly from the AWS (Amazon Web Services) cloud in which they are hosted. For this you must include the HTTP header field X-Direct-Download: true in the request. In C# you can add this header using: context.DefaultRequestHeaders.Add("x-direct-download", "true");

This feature was available early in 2017 for standard extractions for VBD (Venue by Day) data, and since 14 August 2017 is also available for the following custom Tick History reports: Tick History Time and Sales, Tick History Market Depth, Tick History Intraday Summaries, and Tick History Raw reports.

When you use this, a response with HTTP status 302 (redirect) will be returned. Most HTTP clients will automatically follow the redirection. Some HTTP clients (like curl command, or Invoke-WebRequest of Powershell) support and follow the redirection, but fail to connect to AWS, because they include the Authorization header in the request message redirected to AWS, which returns a BadRequest status (400) with the error:

    	
            
Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified

If your HTTP client does not automatically follow the redirection, or if it runs into this error, then retrieve from the 302 header an item called Location, which is the pre-signed URL that allows you to get your data directly from AWS. Then perform a GET on this URL, without using the Authorization header.

Using AWS will give you the greatest performance gain, and ensure you receive the extraction file in the shortest possible time.

Limit the request to the data you really need:

  • Reduce the date/time range for the request to what you need. Why request 7 days if you only need 5 ?
  • Limit the number of fields to those that are really required. Avoid requesting static fields (e.g. Asset Type, Instrument ID Type, etc.) for requests that return many rows of data, as the static fields will always contain the same values. It is much better to request reference and other such data in a separate call. Note that even empty fields require extraction time.
  • Request only the instruments you need.

When using ISINs, Sedols or Cusips, remember that they usually map to several RICs:

  • A request for Time and Sales, Market Depth or Intraday Summary will return data for all corresponding RICs. This amount of data is rarely required. If you do not want data for all RICs, you are better off mapping your ISINs to the set of RICs you need, and requesting data for that subset.
  • Note: a request for Elektron Timeseries, Terms and Conditions or Corporate Actions delivers data for the RIC corresponding to the input ISIN and exchange. If no exchange is specified, the result defaults to the primary RIC. If you want results for several exchanges, you need one entry per exchange.

Beware of chain RICs expansion:

  • A request for a chain RIC will expand to all its constituents. If you only need a few of them, then refrain from requesting the entire chain.

In the context of your general workflow:

  • Reduce the overhead by making a single request for multiple instruments; this is usually faster than multiple requests for few instruments.
  • Set a realistic polling interval when querying the servers for results. The more frequently you poll, the more system resources are consumed. A very large request does not need a very short polling interval.

Optimizing your Python download code

Optimizing your code can enhance its performance and minimize its resource consumption.

For this article I looked at two Python libraries: Requests and urllib3. Requests is much more popular than urllib3, but as Requests is built on top of urllib3 I was curious to compare their performance. I also wanted to see if code optimization would lead to visible performance gains, by testing different coding versions.

For an on demand data request, first we send an extraction request to the server; then we check its status. Once the extraction has completed a job ID is returned. The following code extracts start at the actual data download request, using the job ID. If you need more information on the whole process, please refer to the LSEG Tick History Tutorials, starting with the one on the On Demand extraction workflow.

Using the Requests library

Requesting the data

Here is the basic Python code I used to request the data, using the job ID:

    	
            

import requests

 

 

requestUrl = "https://selectapi.datascope.refinitiv.com/RestApi/v1/Extractions/RawExtractionResults" + "('" + jobId + "')" + "/$value"

 

requestHeaders={

    "Prefer":"respond-async",

    "Content-Type":"text/plain",

    "Accept-Encoding":"gzip",

    "Authorization": "token " + token

}

 

r = requests.get(requestUrl,headers=requestHeaders,stream=True)

 

r.raw.decode_content = False

Notes on the code:

  • I set stream=True to be able to access the raw socket response from the server.
  • The last line ensures that data is not automatically decompressed on the fly.

Saving the data

The next step is to save the data received in the response stream r. Let us investigate several possible code versions.

Version 1.1

    	
            

with open(fileName, 'wb') as fd:

   fd.write(r.raw.read())

This simple code is a very bad choice, listed here to illustrate what must be avoided !  First it pulls all data into RAM, and only then writes it to disk. RAM usage is therefore quite high, and performance poor. This version was promptly discarded. The next versions do not load the RAM. Instead they save data on the fly, as it is received.

Version 1.2 (read)

This code reads data in chunks, and writes them to disk, thus eliminating the main drawback of the first code snippet. It also allows for tuning the chunk size.

I made 2 variants of this code.

Version 1.2a:

    	
            

chunk_size = 1024

with open(fileName, 'wb') as fd:

    while True:

        data = r.raw.read(chunk_size)

        if not data:

            break

        fd.write(data)

Version 1.2b, where I add a variable rr = r.raw, to see if this helps:

    	
            

chunk_size = 1024

rr = r.raw

with open(fileName, 'wb') as fd:

    while True:

        data = rr.read(chunk_size)

        if not data:

            break

        fd.write(data)

Version 1.3 (stream)

This simpler code handles the stream slightly differently. I made 4 variants of this one.

Version 1.3a, is the simplest:

    	
            

with open(fileName, 'wb') as fd:

   for data in r.raw.stream(decode_content=False):

       fd.write(data)

In version 1.3b, I add the additional variable (like I did in version 1.2b):

    	
            

rr = r.raw

with open(fileName, 'wb') as fd:

   for data in rr.stream(decode_content=False):

       fd.write(data)

In version 1.3c, I try tuning the chunk size:

    	
            

chunk_size = 1024

rr = r.raw

with open(fileName, 'wb') as fd:

   for data in rr.stream(chunk_size,decode_content=False):

       fd.write(data)

In version 1.3d, I try tuning the chunk size but do not use the intermediary variable:

    	
            

chunk_size = 1024

with open(fileName, 'wb') as fd:

   for data in r.raw.stream(chunk_size,decode_content=False):

       fd.write(data)

Version 1.4 (shutil)

This is the last code version. It is very simple but requires an additional library, shutil.

I made 4 variants of this code. Version 1.4a, is the simplest:

    	
            

import shutil

 

 

with open(fileName, 'wb') as fd:

    shutil.copyfileobj(r.raw, fd)

Version 1.4b, uses the additional variable:

    	
            

rr = r.raw

with open(fileName, 'wb') as fd:

    shutil.copyfileobj(rr, fd)

Version 1.4c, adds chunk size tuning:

    	
            

chunk_size = 1024

rr = r.raw

with open(fileName, 'wb') as fd:

    shutil.copyfileobj(rr, fd, chunk_size)

Version 1.4d, uses chunk size tuning but dispenses with the additional variable:

    	
            

chunk_size = 1024

with open(fileName, 'wb') as fd:

    shutil.copyfileobj(r.raw, fd, chunk_size)

Using the urllib3 library

As Requests is built on top of urllib3 I would expect to get better performance by using urllib3 and avoiding the overhead of Requests.

Certificate verification

urllib3 generates an "InsecureRequestWarning" if a request is made without certificate verification enabled:

    	
            

C:\Python\Anaconda\lib\site-packages\urllib3\connectionpool.py:852:

InsecureRequestWarning: Unverified HTTPS request is being made.

Adding certificate verification is strongly advised.

See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings

To avoid that one might be tempted to simply disable warnings before the request:

    	
            

urllib3.disable_warnings()

http = urllib3.PoolManager(timeout=urllib3.util.Timeout(connect=60.0, read=60.0))

This is not recommend, it is insecure !

Instead, follow best practices and verify the certificates, as illustrated in what follows.

Requesting the data

We require imports for urllib3, as well as certifi for certificate verification:

    	
            

import urllib3

import certifi

The code to define the URL and headers is the same as before:

    	
            

requestUrl = "https://selectapi.datascope.refinitiv.com/RestApi/v1/Extractions/RawExtractionResults" + "('" + jobId + "')" + "/$value"

 

requestHeaders={

    "Prefer":"respond-async",

    "Content-Type":"text/plain",

    "Accept-Encoding":"gzip",

    "Authorization": "token " + token

}

After that I define a pool manager to verify the certificate, and then execute the request:

    	
            

http = urllib3.PoolManager(timeout=urllib3.util.Timeout(connect=60.0, read=60.0),cert_reqs='CERT_REQUIRED',ca_certs=certifi.where())

 

r = http.request("GET",requestUrl,headers=requestHeaders,preload_content=False,decode_content=False)

Note: the last parameter ensures that data is not automatically decompressed on the fly.

Saving the data

Let us investigate several possible code versions.

Version 2.1
    	
            

fd = open(fileName, 'wb')

fd.write(r.data)

fd.close()

r.release_conn()

Like version 1.1, this simple code is a very bad choice, illustrating what you must avoid !  It first pulls all data into RAM, and only then writes it to disk. Due to its high memory usage and poor performance, I did not keep this version. The next version does not load the RAM but saves data on the fly.

Version 2.2 (urllib3)

This version is similar to version 1.2, and also uses chunk size tuning:

    	
            

chunk_size = 1024

with open(fileName, 'wb') as fd:

    while True:

        data = r.read(chunk_size)

        if not data:

            break

        fd.write(data)

r.release_conn()

Test methodology

I ran my tests on Windows 7 with Python 3.6.0, requests 2.12.4 and urllib3 1.21.1.

To test performance I made an On Demand extraction request, and then retrieved the exact same extraction results repeatedly, measuring the time it took to download and save the file. The file size was 323MB compressed, 5.6GB after manual decompression, and contained slightly more than 44 million lines.

Initial tests showed the download time varied quite a bit between consecutive downloads, and also varied depending on the time of the day.

To even these effects out, I built code that ran a sequence of one download per code variant, and ran the entire sequence 11 times in a row; this took 26 hours to complete. Another test run with 6 iterations ran from 11:50 till 01:55. Each download time was logged. I tested chunk sizes of 128, 1024 and 8192, and in one case even 32768.

After that I analyzed the results, calculating the average download time for each code variant, searching for performance variations by code version, and potential evidence of code optimization.

I started testing before AWS direct download was available. After it was released I decided not to use it, based on the assumption that AWS usage would not impact the request and code optimizations discussed here.

Test results

Here is the summary of one of my tests. It includes 459 downloads, all made during work days, using a sequence of 27 code versions (including variants with different chunk sizes), and repeating the entire sequence 17 times. This includes measurements made at all times of the day and night:

Some numbers:

  • Shortest of all download times: 2’15” (2 minutes 15 seconds)
  • Longest of all download times: 8’34” (excluding one outlier at 12’54”)
  • Average of all download times: 5’06”

Code versions

A few numbers:

  • Average download time for the slowest versions: 5’20”
  • Average download time for the fastest versions: 4’55” (a gain of ~8%)
  • Average download time variation for a single code version: +-38%

These numbers must be taken with a lot of caution !

The results varied a lot, even for a single code version; there is obviously a big influence of external factors we cannot control or measure (internet speed, server load). Even with 17 measurements per code version, the data sample was not sufficient to arrive at definitive conclusions. It was very difficult to detect trends, and impossible to clearly sort all the code variants by performance.

Nevertheless, the results seemed to show the following:

  • Contrary to what I expected, the additional variable (rr = r.raw) is not a recipe for enhanced performance. In most code versions it has no influence, and for 1.3c (stream) it slightly decreased performance compared to 1.3d.
  • Versions 1.3a and 1.3c (stream) had the worst performance.
  • Versions 2.2 (urllib3), 1.3d, 1.4c and 1.4d (shutil) delivered the best performance.
  • Using the lower level urllib3 seems to deliver a very slight advantage, but it is not really significant.
  • As expected, chunk size tuning seems to have an influence, but the best chunk size value depends on the code variant. Whereas for versions 2.2 (urllib3) and 1.4d (shutil) the best performance was with a chunk size between 128 and 1024, a chunk size of 1024 seems slightly better for versions 1.3d (stream), 1.4c and 1.4d.

So, which code versions performed best ?

Average download time (17 iterations) Version Chunk size Comment
4’45” 2.2 (urllib3) 128 In a previous run, 2.2 was slightly slower than 1.4c, and performed slightly better with chunk size 1024
4’56” 1.4c (shutil) 1024  
4’58” 1.3d (stream) 1024  
4’59” 1.4d (shutil) 128 Chunk size 1024: nearly the same result

Considering our sample size, I guessed that there would not be a statistically significant difference between these 4 versions, as the download times were practically the same. To check this I ran another test over 24 hours, cycling through these 4 methods again and again, and logged the download times. Results after 65 iterations:

Average download time (65 iterations) Version Chunk size Comment
5’15” 2.2 (urllib3) 128  
5’18” 1.4c (shutil) 1024  
5’22” 1.3d (stream) 1024  
5’15” 1.4d (shutil) 128 This time it delivered the best results

In this last test run the average download time was 5’18”, slightly higher than in the previous test, with a variation of +- 1% depending on the code version.

In the long run I would expect these 4 versions to exhibit very similar download times, and to perform approximately 5% faster than the worst performing versions. Not a big gain, and certainly less than I had expected, but it is better than nothing.

Time of day

Using the data from the last test run, over 24 hour test with the 4 best code variants, I charted the average download time based on the time of the day:

  • Shortest download times occur in this time range: 00 - 08 and 14:00 - 24 GMT
  • Downloads take ~20% longer in this time range: 08 - 14 GMT
  • Note: for the previous test run downloads were longer in time range: 07 - 15 GMT

This analysis seems to indicate that the load of the internet itself plays a significant role, greater than that from code optimization !

So, if you do not have stringent timing requirements for receiving your data, you can decrease the download time by requesting data at lower usage times of the day.

Conclusions

After running more than 900 downloads, I found that external factors overwhelmed the measured numbers, making it difficult to compare the performance of code versions.

It seems that through pure Python code tweaking a performance increase of a few percent can be gained. The best download times were delivered by code versions 2.2 (urllib3, chunk size 128), 1.4c (shutil, chunk size 1024), 1.4d (shutil, chunk size 128), and 1.3d (stream, chunk size 1024).

The most interesting result was that downloading data between 8 and 14 GMT has a download time penalty of approximately 20%.

I believe the same conclusions should apply to direct AWS downloads, though such tests were not made for this article.

To conclude, to get best download performance you should combine as many as possible from the techniques discussed in this article:

  1. Download results directly from AWS
  2. Avoid downloading between 8 and 14 GMT
  3. Optimize your data requests
  4. Optimize your download code

Here I have listed them in order of decreasing effectiveness.