***********
Description
***********
A Datatool is a containerized python based utility, which extracts data from a given source dataset, translates both unstructured (images, audio, video files etc.) and semi-structured (CSV, JSON, TXT, XML, XLS, etc.) data to common representation. It also enables loading the translated data into python based data structures using a standard API.

By performing this ETL (Extract, Transform and Load) step, the datatool generates a standard data structure and definitions for the annotations, which are then consumed by the data scientists and analysts for various purposes, e.g., data analysis, predictive modeling etc.

.. image:: ../assets/Datatool-workflow.png
  :alt: Datatool workflow

The purpose of this document is to give an overview of the datatool workflow to obtain an structured dataset from raw data.

*************
Prerequisites
*************
To be able to complete this workflow:

- Make sure you have set up the local environment as explained in :doc:`/pages/setup`
- Have Docker 19.03 installed in your local environement.

********
Overview
********
A Datatool workflow is as follows:

**1. Documentation**: Overall documentation including the example samples and statistics:

- Documentation (README.md) with example samples and annotations.
- Exploratory Data Analysis (EDA) Report: containing the annotation statistics, trends and interactions. 

**2. Main Interface**: Process the source input data and create the final standardized dataset:

- Script *datatool.py* to process source input data and generate standardized output.
- Script *datatool_patch.py* to apply any available post-processing (cleaning, geometric transformations etc.) on the generated datatool output.
- Dependencies: The following contains the list of python modules needed by the datatool.

  - datatool_api/deps/requirements.txt
  - common/deps/requirements.txt
  - deps/requirements.txt
   

**3. Exploratory Data Analysis**: Create / Re-create an Exploratory Data Analysis report on datatool output for getting statistical insights:

- Script *create_EDA_report.sh* to download and run the EDA report tool on datatool output and generate the EDA report. 

**4. Visualize Annotations on Datatool Output**: Draw samples at random from datatool output and visualize annotations. This is useful in validation and verification of the annotations:

- Script *visualize_annotations.py* to visualize annotations on randomly drawn samples from datatool output.
- Dependencies *visualize_annotations/requirements.txt* required list of python modules needed by the script.

**5. Data Loader Example**: Example of how to easily load the datatool output for model training/validation using the datatool API:

- Script *example_dataloaders/example_dataloader_pytorch.py* provides a data loader example for pytorch using the datatool API.

******************
Datatool Workflows
******************

Currently, two Datatool examples are available:

- `COCO Whole Body Data Tool <https://gitlab.com/bonseyes/artifacts/data_tools/examples/example_datatool_coco_wholebody/-/blob/master/README.md>`__ 

- `FER2013 <https://gitlab.com/bonseyes/artifacts/data_tools/examples/example_datatool_fer2013>`__