Import

Author

Steen Harsted & Søren O’Neill

Published

June 17, 2025

1 Presentation

You can download the course slides for this section here

2 Data import

2.1 Import basics

By now:

You have set up your course project
You know how to use the here() function
You have successfully read the solders.csv and assigned it to an object called soldiers

If not, we suggest you go to “Project Discipline” and complete the Setting up your course project section.

This is all you need to know for this course.

The following exercises will introduce you to importing file types other than .csv. These exercises are optional, and not required to complete the rest of the course.

2.2 Import excercises

Download this zipped folder and unzip it in your `raw_data` directory

The unzipped folder contains:

A .csv file called “Kronisk smerte - udvikling.csv”. It has been exported by a Danish version of Excel.
- Is this a good name for a file in a data science project?
A set of files related to a motion capture project:
- A Stata file (.dta) containing the sex of the participants
- A .csv file containing the age (in months) of the participants.
- 19 simplified motion capture files of children performing vertical jumps
  - Frame and Time_Seconds are time variables
  - CGY gives you the height (in mm) of their center of gravity
A folder called challenge that you need if you want to solve the challenge assignment

Import “kronisk smerte - udvikling.csv”

Try these functions:

read.csv()
read.csv2()
read_csv2()
What is the difference in the output? Why?

Show me the code

path <- here("raw_data", "data_import_exercises", "Kronisk smerte - udvikling.csv")
read.csv(path)
read.csv2(path)
read_csv2(path)

Import the files `id_age.csv` and `id_sex.dta`, combine them (use `full_join()`), and assign the combined dataframe to an object

The tidyverse includes the haven package that can read Stata´s .dta files
The function to use for the .dta file is haven::read_dta()
Investigate the two files before you combine them. Do you need to change anything?
The dplyr function full_join() will help you to combine the two imported objects.

Show me the code

a <- haven::read_dta(here("raw_data", "data_import_exercises", "id_sex.dta"))

b <- read_csv2(here("raw_data", "data_import_exercises","id_age.csv")) %>%
     rename(ID = id)  # Rename the id column

df_descriptives <- full_join(a, b)
rm(a,b) # Remove the objects a and b so that your environment is less crowded

Use `list.files()` to generate an object called `files` containing the filenames of the 19 motioncapture files

Hints for the pattern =argument in list.files()

, pattern = NULL          # The default setting. List all files in our directory
, pattern = ".csv$"       # all files in our directory that ends with ".csv"
, pattern = "^desc"       # all files in our directory that starts with "desc"
, pattern = "[0-9].csv$"  # all files in our directory that ends with "[a number from 0-9].csv"

Show me the code

files <- list.files(here("raw_data", "data_import_exercises"), pattern = "[0-9].csv$")

Import the 19 motioncapture files

Use `read_csv()``
The file = should include the path to the folder (use here()), and the files object you just created.

Show me the code

df_mocap <- read_csv(here("raw_data", "data_import_exercises", files))

Combine the motion capture files with the df_descriptives object you created before

Show me the code

df_all <- full_join(df_descriptives, df_mocap)

Update the object you just created

You will learn the code to do this later in the course. * Change ID to a factor * Change sex to a factor with levels = c(1,2), labels = c("Boy", "Girl") * Change age from months to years * Retain the row with the highest value of CGY for each of the children

Show me the code

df_all <- df_all %>% 
  mutate(
    ID = factor(ID),
    sex = factor(sex, levels = c(1,2), labels = c("Boy", "Girl")),
    age = age/12) %>% 
  group_by(ID) %>% 
  filter(CGY == max(CGY, na.rm = TRUE))

Save the object you just created

Save the object you created before as an “.RData” file, save it in your “clean_data” folder
Save the object you created before as an “.csv” file, save it in your “clean_data” folder
What are the pros and cons of the two file types?

Show me the code

save(df_all, file = here("clean_data", "my_data.RData"))
write_csv(df_all, file = here("clean_data", "my_data.csv"))

2.2.1 LOOKING FOR A CHALLENGE?

Inspect the 19 motioncapture files in the folder called `challenge`

What important piece of information is missing from the data in the files?
Where can you find this data?

Answer
The ID can only be found in the filename. You need to find a way to piece the ID together with the data in files

Install the package `vroom` and load it

vroom is a very fast package for importing .csv files. (hence the name)
The main function in the vroom package is vroom()
vroom() has the argument delim = that allows you specify the delimter you want
Read the documentation for vroom()
- What does the id argument in the vroom() function do?

Use `vroom()` to import all the mocap files in the challenge folder

Use the id = argument in the vroom() function.

Show me the code

files_chal <- list.files(here("raw_data", "data_import_exercises", "challenge"), pattern = "[0-9].csv$")
df_mocap_chal <- vroom::vroom(here("raw_data", "data_import_exercises", "challenge", files_chal),
                              id = "filename")

We need to extract the ID from the filename column now

Create a column called ID that only contains the ID part from the filename
You need to use regular expressions to solve this
Two functions that may help you are str_extract() and str_remove()
Both functions have a pattern = argument that must be a regular expression
Use them inside a mutate() call

Show me the code

df_mocap_chal <- df_mocap_chal %>%
  mutate(ID = str_extract(filename, pattern = "[0-9]+_3.1"),  # Capture digits that come before _3.1
         ID = str_remove(ID, pattern = "_.+$"),               # Remove the _3.1 part
         ID = as.numeric(ID))                                 # Change ID from a string to a numeric variable. You can also change it to a factor

1 Presentation

2 Data import

2.1 Import basics

2.2 Import excercises

Download this zipped folder and unzip it in your raw_data directory

Import “kronisk smerte - udvikling.csv”

Import the files id_age.csv and id_sex.dta, combine them (use full_join()), and assign the combined dataframe to an object

Use list.files() to generate an object called files containing the filenames of the 19 motioncapture files

Import the 19 motioncapture files

Update the object you just created

Save the object you just created

2.2.1 LOOKING FOR A CHALLENGE?

Inspect the 19 motioncapture files in the folder called challenge

Install the package vroom and load it

Use vroom() to import all the mocap files in the challenge folder

We need to extract the ID from the filename column now

Download this zipped folder and unzip it in your `raw_data` directory

Import the files `id_age.csv` and `id_sex.dta`, combine them (use `full_join()`), and assign the combined dataframe to an object

Use `list.files()` to generate an object called `files` containing the filenames of the 19 motioncapture files

Inspect the 19 motioncapture files in the folder called `challenge`

Install the package `vroom` and load it

Use `vroom()` to import all the mocap files in the challenge folder