*********** Description *********** A Datatool is a containerized python based utility, which extracts data from a given source dataset, translates both unstructured (images, audio, video files etc.) and semi-structured (CSV, JSON, TXT, XML, XLS, etc.) data to common representation. It also enables loading the translated data into python based data structures using a standard API. By performing this ETL (Extract, Transform and Load) step, the datatool generates a standard data structure and definitions for the annotations, which are then consumed by the data scientists and analysts for various purposes, e.g., data analysis, predictive modeling etc. .. image:: ../assets/Datatool-workflow.png :alt: Datatool workflow The purpose of this document is to give an overview of the datatool workflow to obtain an structured dataset from raw data. ************* Prerequisites ************* To be able to complete this workflow: - Make sure you have set up the local environment as explained in :doc:`/pages/setup` - Have Docker 19.03 installed in your local environement. ******** Overview ******** A Datatool workflow is as follows: **1. Documentation**: Overall documentation including the example samples and statistics: - Documentation (README.md) with example samples and annotations. - Exploratory Data Analysis (EDA) Report: containing the annotation statistics, trends and interactions. **2. Main Interface**: Process the source input data and create the final standardized dataset: - Script *datatool.py* to process source input data and generate standardized output. - Script *datatool_patch.py* to apply any available post-processing (cleaning, geometric transformations etc.) on the generated datatool output. - Dependencies: The following contains the list of python modules needed by the datatool. - datatool_api/deps/requirements.txt - common/deps/requirements.txt - deps/requirements.txt **3. Exploratory Data Analysis**: Create / Re-create an Exploratory Data Analysis report on datatool output for getting statistical insights: - Script *create_EDA_report.sh* to download and run the EDA report tool on datatool output and generate the EDA report. **4. Visualize Annotations on Datatool Output**: Draw samples at random from datatool output and visualize annotations. This is useful in validation and verification of the annotations: - Script *visualize_annotations.py* to visualize annotations on randomly drawn samples from datatool output. - Dependencies *visualize_annotations/requirements.txt* required list of python modules needed by the script. **5. Data Loader Example**: Example of how to easily load the datatool output for model training/validation using the datatool API: - Script *example_dataloaders/example_dataloader_pytorch.py* provides a data loader example for pytorch using the datatool API. ****************** Datatool Workflows ****************** Currently, two Datatool examples are available: - `COCO Whole Body Data Tool `__ - `FER2013 `__