Module 4 - Assignment 1

Data organization in spreadsheets

This course introduces learners to tools and workflows for data science with R. Learners are also introduced to the concept of collaborative writing and coding using git and GitHub within the context of reproducible documents (i.e. Quarto). So far we have used data that is well structured and ready to be used. However, in reality a lot of data entry and storage is still managed in spreadsheets. This is why we also touch on some (research) data management topics (Data Organization in Spreadsheet).

The reading for this assignment provides guidance for data entry and storage aspects. It offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses.

Task 1: Read and prepare examples

For this assignment, we ask you to:

  1. Read Broman and Woo (2018): “Data organization in spreadsheets”.
  2. Chose two of the recommendations and come up with real-world examples or scenarios where the recommendations could be applied in your work.
  3. Be prepared to share these examples and explain how the recommendations would improve your workflows. This will be in a class setting as part of small discussion group (max 3 people).

Task 2: Apply the recommendations to your samples data from Module 3

Have you completed the assignments for module 3?

A pre-requisite for this homework is that you worked through the “Spreadsheet” assignment of module 3. If you have not done so, please do this first. https://ds4owd-001.github.io/website/assignments/md-03/am-03-2-spreadsheet.html

  1. Open the ds4owd workspace on posit.cloud
  2. Re-open your samples-USERNAME repository.
  3. Create a new .R file and save it as data_cleaning.R.
  4. Add library(tidyverse) to the top of the file.
  5. Add library(readxl) to the top of the file.
  6. Add library(janitor) to the top of the file.
  7. Use the read_excel() function to read in your the .xlsx file and store it in an object called samples.
  8. Use the glimpse() function to inspect the data.
  9. Try to use R functions to apply the recommendations from the reading to your data.
Ask questions on our Element Chat

If you would like support at different points of your data cleaning process, please ask questions on our Element Chat, we are happy to support.

Example

The following is an example of a dataset that follows the recommendations from the reading.

id date_sample system location users ts
1 2023-11-01 pit latrine household 5 136.24
2 2023-11-01 pit latrine household 7 102.45
3 2023-11-01 pit latrine household NA 57.02
4 2023-11-01 pit latrine household 6 27.03
5 2023-11-01 pit latrine household 12 97.27
6 2023-11-02 septic tank household 7 78.21
7 2023-11-02 septic tank household 14 15.24
8 2023-11-02 septic tank household 4 29.39
9 2023-11-02 septic tank household 10 64.22
10 2023-11-02 septic tank household 12 8.01
11 2023-11-03 pit latrine public toilet 50 11.24
12 2023-11-03 pit latrine public toilet 32 84.05
13 2023-11-03 pit latrine public toilet 41 55.92
14 2023-11-03 pit latrine public toilet 160 15.32
15 2023-11-03 pit latrine public toilet 20 22.65
16 2023-11-04 septic tank public toilet 26 8.72
17 2023-11-04 septic tank public toilet 91 43.92
18 2023-11-04 septic tank public toilet 68 10.37
19 2023-11-04 septic tank public toilet 112 23.21
20 2023-11-04 septic tank public toilet 59 15.64

References

Broman, Karl W., and Kara H. Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.