Discussion questions#
The “Workflow Specification” step of the of audited data pipeline involves defining a set of automated checks with specification (e.g., determining if the value range is appropriate (e.g., excluding people who are 127 years old), looking for NaN). Can you generate a list of possible anomalies that could arise within data and how you would implement that automation (e.g., set a range of allowable values (18 < participantAge < 90)). Feel free to examine data you are currently working with for inspiration. Moreover, are there any existing tools (e.g., python libraries / R libraries) that exist for implementing this automation.
How can steps that you take in your data cleansing pipeline potentially harm the mapping between the “real world” and your data table? Provide a specific example to illustrate your answer.