Artificial Intelligence - Data Engineering


One of the deliverables of the business understanding phase is the appropriate minimum data schema that the AI project will need to solve the problem at hand. Once the schema is defined, the team needs to design and implement a data strategy that will allow them to source the appropriate data from its various locations, explore the initial data space and apply some initial preparation to the data in order to ensure a high quality dataset. This is precisely the scope of the data engineering phase which is defined as the aspect of data science during which an AI data engineer deals with initial collection and analysis of raw datasets. An AI data engineer builds and maintains pipeline systems that ingest and synchronise data from disperse endpoints and then clean and prepare the data for the next phases of the AI pipeline.

Data ingestion

These days companies focus on gathering both structured and unstructured data. There are many sources of data including databases, APIs, sites, social media, IoT devices, sensors, blogs, emails and more. During this step of the data engineering pipeline phase, the main task of the AI data engineer is to gather the data from all the disparate sources and store it in a single data store. Refinitiv already provides a single source of many different industry datasets and using the available APIs an AI team will be able to quickly tap in into a rich ecosystem of well-structured data - a result of the complex preprocessing that the product already has applied to various datasets. We will be exploring detailed ways of specifying, ingesting and structuring data from the platform using its various available APIs.

Data Exploration

Once the team has a central storage area - the AI data engineer is ready for an initial exploration of the data that will allow the team to reach certain conclusions about the quality and quantity of the data as well as available distributions within parameter spaces. Data consistency and clarity is also analysed. During this phase, the AI data engine will also reveal any problems that the datasets might have, and the team can start devising strategies towards solving or smoothing them out. For example, in Timeseries one might be interested in stationary timeseries - so the team might test for that and if there is non-stationarity - conclude that differencing operations might be required for example.

Data preparation

After initial data exploration, the AI data engineer is ready to deploy initial data cleansing and synchronization methodologies on the raw data. There are a multitude of techniques that can be deployed to enhance the quality of the available dataset and prepare it for the next phase of the Artificial Intelligence pipeline, the feature engineering phase. Cleansing techniques during this phase target problems in the datasets that can include:

  • Typos and taxonomy problems
  • Data conversions
  • Data synchronisations