Data science lifecycle & Exploratory data analysis using visualization

ds4owd - data science for openwashdata

Lars Schöbitz

2023-11-07

Q: How do I successfully complete the course?

You successfully complete the course and you will receive a certificate of completion if you:

hand in a complete capstone project report that uses a dataset of your choice by 30 January 2024 (instructions will follow)

This is the only requirement to successfully complete the course, independent of how many classes you attended or how many homework assignments you completed.

Solving coding problems

Tipps for search engines

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query
  • Scroll through the top 5 results (don’t just pick the first)

Example: “How to remove a legend from a plot in R ggplot2”

Stack Overflow

What is it?

  • The biggest support network for (coding) problems
  • Can be intimidating at first
  • Up-vote system

Workflow

  • First, briefly read the question that was posted
  • Then, read the answer marked as “correct”
  • Then, read one or two more answers with high votes
  • Then, check out the “Linked” posts
  • Always give credit for the solution

Tipps for AI tools

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query

Example: “How to remove a legend from a plot in R ggplot2”

Other sources for help

Homework assignment module 1

on GitHub Organisation

Bookmark this link in your browser!

on your repository

on Posit Cloud

Bookmark this link in your browser!

on Posit Cloud

Version Control - Terminology

-

-

-

-

-

-

-

-

-

-

-

-

remember: git commit

remember: git push

remember: git push

collaborate: git clone

track work: git commit

update: git ???

update: git push

git ???

new: git pull

Learning Objectives (for this week)

  1. Learners can list the six elements of the data science lifecycle.
  2. Learners can describe the four main aesthetic mappings that can be used to visualise data using the ggplot2 R Package.
  3. Learners can control the colour scaling applied to a plot using colour as an aesthetic mapping.
  4. Learners can compare three different geoms (bar/col, histogram, point) and their use case.

Data Science Lifecycle

Deep End

via GIPHY

-

-

-

-

-

-

-

Exploratory Data Analysis with ggplot2

R Package ggplot2

  • ggplot2 is tidyverse’s data visualization package
  • gg in ggplot2 stands for Grammar of Graphics
  • Inspired by the book Grammar of Graphics by Leland Wilkinson
  • Documentation: https://ggplot2.tidyverse.org/
  • Book: https://ggplot2-book.org

My turn: Working with R



Sit back and enjoy!

Take a break

Please get up and move! Let your emails rest in peace.

10:00

Code structure

  • ggplot() is the main function in ggplot2
  • Plots are constructed in layers
  • Structure of the code for plots can be summarized as
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], 
                     y = [y-variable])) +
  geom_xxx() +
  other options

Code structure

ggplot()

Code structure

ggplot(data = gapminder)

Code structure

ggplot(data = gapminder,
       mapping = aes()) 

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp))  

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot() 

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot() +
  theme_minimal()

Polls

Poll 1: What does the thick line inside the box of a boxplot represent?

  1. the mean of the observations
  2. the middle of the box
  3. the median of the observations
  4. none of the above

Poll 2: What percentage of observations are contained inside the box of a boxplot (interquartile range)?

  1. 25%
  2. depends on the median
  3. 50%
  4. none of the above

Poll 3: What is the median of a set of observations?

  1. The median is the most frequently occurring value in a dataset.
  2. The median is the sum of all values in a dataset divided by the number of observations.
  3. The median is the point above and below which half (50%) of the observations falls.
  4. The median is the square root of the sum of the squares of each value in a dataset.

Poll 4: If you have the values: 1, 2, 3, and 10: which statistical measure best represents the “true” value?

  1. The mean
  2. The standard deviation
  3. The median
  4. The interquartile range

Boxplot, explained

A diagram depicting how a boxplot is created following the steps outlined above.

Figure 1: Diagram depicting how a boxplot is created.

Our turn: md-02-exercises

  1. Open posit.cloud in your browser (use your bookmark).
  2. Open the ds4owd workspace for the course.
  3. Click Start next to md-02-exercises.
  4. In the File Manager in the bottom right window, locate the md-02b-data-visualization.qmd file and click on it to open it in the top left window.
30:00

Take a break

Please get up and move! Let your emails rest in peace.

10:00

Visualizing data

Types of variables

numerical

discrete variables

  • non-negative
  • whole numbers
  • e.g. number of students, roll of a dice

continuous variables

  • infinite number of values
  • also dates and times
  • e.g. length, weight, size

non-numerical

categorical variables

  • finite number of values
  • distinct groups (e.g. EU countries, continents)
  • ordinal if levels have natural ordering (e.g. week days, school grades)

Histogram

  • for visualizing distribution of continuous (numerical) variables
ggplot(data = penguins,
       mapping = aes(x = body_mass_g)) +
  geom_histogram()

Barplot

  • for visualizing distribution of categorical (non-numerical) variables
ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar()

Scatterplot

  • for visualizing relationships between two continuous (numerical) variables
ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     size = pop,
                     color = continent)) +
  geom_point() +
  scale_color_colorblind() +
  theme_minimal()

Your turn: md-02-exercises

  1. Open posit.cloud in your browser (use your bookmark).
  2. Open the ds4owd workspace for the course.
  3. In the File Manager in the bottom right window, locate the md-02c-make-a-plot.qmd file and click on it to open it in the top left window.
  4. Follow instructions in the file
15:00

Homework assignments module 2

Module 2 documentation

Homework due date

  • Homework assignment due: Monday, November 13th
  • Correction & feedback phase up to: Thursday, November 16th

Wrap-up

Thanks! 🌻

Slides created via revealjs and Quarto: https://quarto.org/docs/presentations/revealjs/ Access slides as PDF on GitHub

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.