Determinants of borehole’ functionality in Northen Malawi

Author
Affiliation

Mabvuto Yesaya

Malawi University of Business and Applied Sciences,Malawi

Published

February 10, 2024

Project description

The data is from TA Wasambo in Karonga district (Northen Malawi). This data is for boreholes/handpumps that people mostly rely on for drinking water collection, secondary to wells and rivers. The data were collected in June 2013, and my interests are:

Objectives

  1. To know the users of boreholes per village.

  2. To check if there is a relationship between the costs that were considered before setting the tariff and the amount that is agreed to be paid as a tariff per month now

  3. To check if the actual tariff collected per month, tariff collection frequency, distance to where borehole spare parts are, presence of a service provider, actual tariff collected and conducting preventive maintenance lead to the functionality of boreholes

The goal of this document is to explore, clean, and select the final variables from the borehole Malawi dataset. The selected variables will be used in:

  1. Analysis

  2. Codebook

  3. Making R package

Import data

# library
#data wrangling, tidying,column mutations etc
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#clean column names
library (janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
#for EDA
library(explore)

#here here package
library(here)
here() starts at /cloud/project
#making simple tables
library(knitr)

library(dplyr)

#for making dummy variables
library(fastDummies)
Thank you for using fastDummies!
To acknowledge our work, please cite the package:
Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
## data
#read_csv or read_csv2 to read in a ; separated file does not work for me after consulting perplexity ai for multiple times. In the meantime, I will use read_delim. 

borehole_malawi_raw <- read_delim("/cloud/project/data/raw/Borehole_Malawi_-_all_versions_-_labels_-_2023-12-31-03-34-25.csv",
                            delim = ";")
New names:
Rows: 197 Columns: 194
── Column specification
──────────────────────────────────────────────────────── Delimiter: ";" chr
(81): Bohole name and ID, Village name, Name of VDC, Main role of the r... dbl
(99): If Functional status is one of not functional, No longer exists o... lgl
(8): HAND PUMP BOREHOLE FUNCTIONALITY SURVEY, Specify...86, Record the... dttm
(3): start, end, _submission_time date (3): today, Which year was the hand pump
borehole rehabilitated, What ...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `What is the current problem?` -> `What is the current problem?...15`
• `What is the current problem?` -> `What is the current problem?...17`
• `Specify` -> `Specify...29`
• `Specify` -> `Specify...86`
• `Specify` -> `Specify...109`
• `Specify` -> `Specify...126`
• `Specify` -> `Specify...128`
• `Is there a service provider or someone responsible for operating and/or
  maintaining this hand pump borehole or water system?` -> `Is there a service
  provider or someone responsible for operating and/or maintaining this hand
  pump borehole or water system?...138`
• `Is there a service provider or someone responsible for operating and/or
  maintaining this hand pump borehole or water system?` -> `Is there a service
  provider or someone responsible for operating and/or maintaining this hand
  pump borehole or water system?...142`
• `Specify` -> `Specify...155`

Data cleaning

#add id numbers
borehole_malawi<-borehole_malawi_raw |> 
mutate(id = seq(1:n())) |> 
  relocate(id)
#renaming relevant questions 

borehole_malawi<-borehole_malawi |> 
  rename(id=id,
         date=today)  
 #objective 1: To know the users of boreholes per Village.

borehole_malawi<-borehole_malawi |> 
  rename(village_name=`Village name`,
        bh_use_num= `How many households use (d) this hand pump borehole?`) 
#objective 2: To check if there is a relationship between the costs that were considered before setting the tariff and the amount that is agreed to be paid as a tariff per month now

borehole_malawi<-borehole_malawi |>   
  rename( tariff_costs_consider=`What costs were considered when setting the tariff or user fee?`,
          tarrif_frequency=`How often is the tariff/user fee collected?`,
         tariff_amount=`How much is the tariff/user fee (in Kwacha)?`) 
 #objective 3: To check if the actual tariff collected per month, distance to where borehole spare parts are, presence of a service provider and conducting  preventive maintenance lead to the functionality of boreholes

  borehole_malawi<-borehole_malawi |> 
  rename(tariff_hh_number=`How many households in the community paid a water fee the last time it was collected?`,
         distance_to_spareparts= `How far away are (were) the materials you use for hand pump borehole repairs (ex. spare parts, tools, etc.)?`,
        available_service_provider= `Is there a service provider or someone responsible for operating and/or maintaining this hand pump borehole or water system?...142`,
        preventive_mantainance=`Do you conduct preventive maintenance?`,
         bh_functional=`Functional status of the borehole`)
 #create a smaller dataset
borehole_malawi_small<-borehole_malawi |> 
  select(id,
         date,
        
         village_name,                   #objective 1
         bh_use_num,
         
    
         tariff_costs_consider,
         tarrif_frequency,               #objective2
         tariff_amount, 
       
        
         tariff_hh_number, 
         distance_to_spareparts,
         available_service_provider,     #objective 3
         preventive_mantainance, 
         bh_functional)
#data cleaning
borehole_malawi_small<-borehole_malawi_small |> 
  clean_names() |> 

#remove the entries where the borehole no longer exists or abandoned
  filter(bh_functional != 'No longer exists or abandoned')

#recode functionality column
borehole_malawi_small<-borehole_malawi_small |> 
    mutate(
        bh_functional = case_when(
            bh_functional == "Functional" ~ "Yes",
            bh_functional == "Not functional" ~ "No",
           bh_functional== 'Partially functional but in need of repair'~"No"
            
        )
    )
  
#village name to be factor
#bhusenum to be int not double
#tarrif costs consider to be faactor
#tarrif frequency to be factor
#tarrif amount to be int not dbl
#tariff hh number to be int not dbl
#distance to spare parts to be factor
#availability of service provider to be factor
#preventive mantainance to be factor
#bh_functional tobe factor
#convert categories that will be regressed into dummies (1/0)
#change all Y/N into 1/0
#shorten accessible within the community but leave others unchaned
borehole_malawi_small <- borehole_malawi_small %>% 
  mutate(
    distance_to_spareparts = case_when(
    distance_to_spareparts == "Accessible within the community" ~ "Within community", 
    TRUE ~ as.character(distance_to_spareparts)     
    )
  )

#remove the records where village name is missing or Nb 
borehole_malawi_small <- borehole_malawi_small %>% 
  filter(!is.na(village_name) & village_name != "Nb")
#I think I have done the known ones, know let me do EDA to check every variable

#explore(borehole_malawi_small)   #main EDA indicator

#the number of people that uses a borehole

borehole_malawi_small |>
  group_by(village_name) |> #selected
  summarise(bh_use_num) 
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'village_name'. You can override using the
`.groups` argument.
# A tibble: 108 × 2
# Groups:   village_name [32]
   village_name bh_use_num
   <chr>             <dbl>
 1 Bonje                10
 2 Bonje                27
 3 Bonje                34
 4 Bonje                50
 5 Bonje                28
 6 Bonje                21
 7 Bonje                28
 8 Bonje                29
 9 Bonje              1125
10 Bonje                18
# ℹ 98 more rows
#village names were manually typed. I need to check the names for consistency and recode them

# Summarize the borehole_malawi_small dataset by village_name
summary_table <- borehole_malawi_small %>%
  filter(!is.na(village_name) & village_name != "Nb") %>%
  group_by(village_name) %>%
  summarise(count = n()) %>%
  kable()

# Print the summary table
print(summary_table)


|village_name   | count|
|:--------------|-----:|
|Bonje          |    11|
|Chambogho      |     7|
|Chewere        |     1|
|Chilawira      |     4|
|Chimphinga     |     6|
|Chizumba       |     4|
|Goerge         |     1|
|James          |     2|
|Kachaka        |     2|
|Kachere        |     3|
|Kaluli Chisiza |     2|
|Kaputamwera    |     1|
|Kayuni         |     2|
|Malongo        |     1|
|Manganani      |     5|
|Mkakatavu      |     3|
|Mlirasaambo    |     1|
|Mlirasambo     |     1|
|Mponela        |     8|
|Mtawale        |     3|
|Mtekela 2      |     1|
|Mtowerayifwa   |     4|
|Mugabe         |     1|
|Mugabi         |     3|
|Mugoyera       |     1|
|Mulirasambo    |     2|
|Mulyabweka     |     2|
|Mungombwa      |     6|
|Mwakanyamale   |     8|
|Mwakhwawa      |     8|
|Mzangunya      |     2|
|Wavikhazga     |     2|
#recode the similar names with different spellings. I suspect data entry errors
borehole_malawi_small <- borehole_malawi_small %>% 
  mutate(
    village_name = case_when(
    village_name == "Mlirasaambo" ~ "Mlirasambo",
     village_name == "Mlirasaambo" ~ "Mlirasaambo",
    village_name == "Mulirasambo" ~ "Mlirasambo",
    village_name == "Mugabi" ~ "Mugabe",
    TRUE ~ as.character(village_name)     
    )
  )
#tariff amount
#I will write a funcion  to handle the column with comments in it

# Define the modify function
modify <- function(dataset) {
  dataset <- as.numeric(dataset)  # Convert the column to numeric to handle NAs
  
  # 1. Bring down values that are greater than 2000 to 2000
  dataset[dataset > 2000] <- 2000
  
  # 2. When it finds values that have five digits, the last digit should be deleted
  dataset <- ifelse(nchar(as.character(dataset)) == 5, as.numeric(substring(as.character(dataset), 1, 4)), dataset)
  
  # 3. When it finds a single number, it should add 2 zeros to it
  dataset <- ifelse(dataset < 10, dataset * 100, dataset)
  
  return(dataset)
}

# Apply the modify function to the borehole_malawi_small dataset
borehole_malawi_small$tariff_amount <- modify(borehole_malawi_small$tariff_amount)
#tariff_hh_number is variable that define the number of households that paid that month when the surveywas conducted in July.  If households are tagged that they drink water from a borehole but forthat time they did not pay, NAs should be Zero

borehole_malawi_small <- borehole_malawi_small |> 
  mutate(tariff_hh_number = replace_na(tariff_hh_number, 0))

#same for tarrif amount
borehole_malawi_small <- borehole_malawi_small |> 
  mutate(tariff_amount = replace_na(tariff_amount, 0))

#preventive mantainance
borehole_malawi_small <- borehole_malawi_small |> 
  mutate(preventive_mantainance = replace_na(preventive_mantainance,'No'))

#if tarrif frequency is missing it means those people do not pay tarrif beause that question as not applicable to them. Therefore>None
borehole_malawi_small <- borehole_malawi_small |> 
  mutate(tarrif_frequency = replace_na(tarrif_frequency,'None'))

#those who have NA on consideration for setting the tarrif are the ones that drink from boreholes that do not have tarrif system
borehole_malawi_small <- borehole_malawi_small |> 
  mutate(tariff_costs_consider = replace_na(tariff_costs_consider,'No tarrif system set'))

I change the values below because they will go into regression model

Binary

# Replace Yes/No with 1/0 in multiple columns
borehole_malawi_small <- borehole_malawi_small %>%
  mutate_at(vars(available_service_provider:bh_functional), ~ifelse(. == "Yes", 1, 0))

Dummies

borehole_malawi_small<- dummy_cols(borehole_malawi_small, select_columns = c("distance_to_spareparts", "tarrif_frequency")) |> 
  clean_names()

Analysis ready data

# Write the borehole_malawi_small dataset to a CSV file
write_csv(borehole_malawi_small, here("data/processed", "borehole_malawi_small.csv"))