5 Introduction

This week explores the cultural, ethical, and critical challenges posed by data artefacts and data-intensive scientific processes. Engaging with Critical Data Studies, we discuss issues around data capture, curation, data quality, inclusion/exclusion and representativeness. The session also discusses the different kinds of data that one can encounter across disciplines, the underlying characteristics of data and how we can analytically and practically approach data quality issues and the challenge of identifying and curating appropriate data sets.

The practical lab session walks students through the earlier stages of the data science process. We start by looking at different types of data suitable for analysis within a data science framework and move on to how to wrangle the data to make it available for further use.

5.1 Highlights of the lecture

Some of the key concepts you should remember from this week are …

Thinking broadly on and with data – ethical considerations, potential biases
Data type taxonomies, w.r.t., sources of data, scales of measurement and common interpretations of data: tabular, temporal, spatial, textual, network
Data wrangling: concepts and common methods, missing values and data transformation approaches

5.2 Practical Lab Session

The practical lab session walks you through the earlier stages of the data science process. We start by looking at different types of data suitable for analysis within a data science framework and move on to how to wrangle the data to make it available for further use.

At the end of the session, you should be able to ..

Perform basic data manipulation and data merging tasks with Numpy and Pandas
Identify and address data quality issues
Get familiar with elementary data visualisation tools to help you interrogate the data

5.3 Independent learning & Reading lists

5.3.1 Required reading

Thinking comprehensively on Data and Data Science – Data Feminism by Catherine D’Ignazio and Lauren Klein: You can read the introduction chapter and chapter-1 in relation to discussions this week: https://mitpressonpubpub.mitpress.mit.edu/data-feminism
Watch: Databite No. 131: Data Feminism by Catherine D’Ignazio and Lauren F. Klein: https://youtu.be/Su3vIF5P06M
On data wrangling: Kandel, Sean, et al. “Research directions in data wrangling: Visualizations and transformations for usable and credible data.” Information Visualization 10.4 (2011): 271-288.
On biases in data/modelling workflows: Suresh, H. and Guttag, J.V., 2019. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002. [pdf]

5.3.2 Background reading

On missing value removal (accessible through library resources): Acuna, Edgar, and Caroline Rodriguez. “The treatment of missing values and its effect on classifier accuracy.” Classification, Clustering, and Data Mining Applications. Springer Berlin Heidelberg, 2004. 639-647.
Saar-Tsechansky, Maytal, and Foster Provost. “Handling Missing Values when Applying Classification Models.” Journal of Machine Learning Research 8 (2007): 1625-1657.