Import

Author

Steen Harsted & Søren O’Neill

Published

April 25, 2025

1 Presentation

You can download the course slides for this section here

  

2 Data import



2.1 Import basics

By now:

  • You have set up your course project
  • You know how to use the here() function
  • You have successfully read the solders.csv and assigned it to an object called soldiers

If not, we suggest you go to “Project Discipline” and complete the Setting up your course project section.

This is all you need to know for this course.

The following exercises will introduce you to importing file types other than .csv. These exercises are optional, and not required to complete the rest of the course.





2.2 Import excercises


Download this zipped folder and unzip it in your raw_data directory

The unzipped folder contains:

  • A .csv file called “Kronisk smerte - udvikling.csv”. It has been exported by a Danish version of Excel.
    • Is this a good name for a file in a data science project?
  • A set of files related to a motion capture project:
    • A Stata file (.dta) containing the sex of the participants
    • A .csv file containing the age (in months) of the participants.
    • 19 simplified motion capture files of children performing vertical jumps
      • Frame and Time_Seconds are time variables
      • CGY gives you the height (in mm) of their center of gravity
  • A folder called challenge that you need if you want to solve the challenge assignment



Import “kronisk smerte - udvikling.csv”

Try these functions:

  • read.csv()
  • read.csv2()
  • read_csv2()
  • What is the difference in the output? Why?
Show me the code
path <- here("raw_data", "data_import_exercises", "Kronisk smerte - udvikling.csv")
read.csv(path)
read.csv2(path)
read_csv2(path)



Import the files id_age.csv and id_sex.dta, combine them (use full_join()), and assign the combined dataframe to an object

  • The tidyverse includes the haven package that can read Stata´s .dta files
  • The function to use for the .dta file is haven::read_dta()
  • Investigate the two files before you combine them. Do you need to change anything?
  • The dplyr function full_join() will help you to combine the two imported objects.
Show me the code
a <- haven::read_dta(here("raw_data", "data_import_exercises", "id_sex.dta"))

b <- read_csv2(here("raw_data", "data_import_exercises","id_age.csv")) %>%
     rename(ID = id)  # Rename the id column

df_descriptives <- full_join(a, b)
rm(a,b) # Remove the objects a and b so that your environment is less crowded



Use list.files() to generate an object called files containing the filenames of the 19 motioncapture files

Hints for the pattern =argument in list.files()
, pattern = NULL          # The default setting. List all files in our directory
, pattern = ".csv$"       # all files in our directory that ends with ".csv"
, pattern = "^desc"       # all files in our directory that starts with "desc"
, pattern = "[0-9].csv$"  # all files in our directory that ends with "[a number from 0-9].csv"
Show me the code
files <- list.files(here("raw_data", "data_import_exercises"), pattern = "[0-9].csv$")



Import the 19 motioncapture files

  • Use `read_csv()``
  • The file = should include the path to the folder (use here()), and the files object you just created. ::: {.cell}
Show me the code
df_mocap <- read_csv(here("raw_data", "data_import_exercises", files))

:::

  • Combine the motion capture files with the df_descriptives object you created before ::: {.cell}
Show me the code
df_all <- full_join(df_descriptives, df_mocap)

:::



Update the object you just created

You will learn the code to do this later in the course. * Change ID to a factor * Change sex to a factor with levels = c(1,2), labels = c("Boy", "Girl") * Change age from months to years * Retain the row with the highest value of CGY for each of the children ::: {.cell}

Show me the code
df_all <- df_all %>% 
  mutate(
    ID = factor(ID),
    sex = factor(sex, levels = c(1,2), labels = c("Boy", "Girl")),
    age = age/12) %>% 
  group_by(ID) %>% 
  filter(CGY == max(CGY, na.rm = TRUE))

:::



Save the object you just created

  • Save the object you created before as an “.RData” file, save it in your “clean_data” folder
  • Save the object you created before as an “.csv” file, save it in your “clean_data” folder
  • What are the pros and cons of the two file types? ::: {.cell}
Show me the code
save(df_all, file = here("clean_data", "my_data.RData"))
write_csv(df_all, file = here("clean_data", "my_data.csv"))

:::



2.2.1 LOOKING FOR A CHALLENGE?

Inspect the 19 motioncapture files in the folder called challenge

  • What important piece of information is missing from the data in the files?
  • Where can you find this data?
    Answer The ID can only be found in the filename. You need to find a way to piece the ID together with the data in files

Install the package vroom and load it

  • vroom is a very fast package for importing .csv files. (hence the name)
  • The main function in the vroom package is vroom()
  • vroom() has the argument delim = that allows you specify the delimter you want
  • Read the documentation for vroom()
    • What does the id argument in the vroom() function do?



Use vroom() to import all the mocap files in the challenge folder

  • Use the id = argument in the vroom() function.
Show me the code
files_chal <- list.files(here("raw_data", "data_import_exercises", "challenge"), pattern = "[0-9].csv$")
df_mocap_chal <- vroom::vroom(here("raw_data", "data_import_exercises", "challenge", files_chal),
                              id = "filename")



We need to extract the ID from the filename column now

  • Create a column called ID that only contains the ID part from the filename
  • You need to use regular expressions to solve this
  • Two functions that may help you are str_extract() and str_remove()
  • Both functions have a pattern = argument that must be a regular expression
  • Use them inside a mutate() call ::: {.cell}
Show me the code
df_mocap_chal <- df_mocap_chal %>%
  mutate(ID = str_extract(filename, pattern = "[0-9]+_3.1"),  # Capture digits that come before _3.1
         ID = str_remove(ID, pattern = "_.+$"),               # Remove the _3.1 part
         ID = as.numeric(ID))                                 # Change ID from a string to a numeric variable. You can also change it to a factor

:::