Inspect your data

Inspicér dine data

Author

Published

January 6, 2025

df <- data.frame(id=1:5, age=c(23,43,32,NA,43), department=as.character(c("Management", "Marketing", "Accounting", "Public relations", "Accounting")))
df$passport_stamps <- list(c("Norway","Netherlands"), c(), c("Uruguay","Mexico","Canada"), c("India"), c("Kenya", "Botswana"))

English
Dansk

Many new R users (and quite a few experienced ones, too) have spent a lot of time scratching their heads, because they thought they knew how their data was structured, only to find they were wrong.

It is a very good idea, once data has been loaded into memory, to spend a little time inspecting it more carefully.

1 Data structure

1.1 Names

The function names() provides the names of the variables in an R data.frame or a list. For instance, using the simple data frame called ‘df’ defined at the top of this page:

names(df)

[1] "id"              "age"             "department"      "passport_stamps"

The names() tells us that ‘df’ contains four columns called ‘id’, ‘age’, etc. The names() function will not provides meaningful information about vectors, as they (typically) have no named constituents.

names() can also be used to define the names of a data frame. What do you think the following code does?

names(df) <- c("Id", "Age", "Department", "Stamps")

2 Structure

R also provides the function str() which details the structure of a variable. An example output of the str() function can look like this:

str(df)

'data.frame':   5 obs. of  4 variables:
 $ id             : int  1 2 3 4 5
 $ age            : num  23 43 32 NA 43
 $ department     : chr  "Management" "Marketing" "Accounting" "Public relations" ...
 $ passport_stamps:List of 5
  ..$ : chr  "Norway" "Netherlands"
  ..$ : NULL
  ..$ : chr  "Uruguay" "Mexico" "Canada"
  ..$ : chr "India"
  ..$ : chr  "Kenya" "Botswana"

This output is somewhat more comprehensive and tells us that the variable ‘df’ is a data frame consisting of 5 observations (rows) of 4 variables (columns).

Furthermore, we can see the data type and the first actual data points for each column – e.g. ‘id’ is of type ‘int’ (integer).

Perhaps, we expected the variable ‘department’ to be a factor rather than a type character. Let’s fix that:

df <- df |> mutate(department = as.factor(department))
str(df)

'data.frame':   5 obs. of  4 variables:
 $ id             : int  1 2 3 4 5
 $ age            : num  23 43 32 NA 43
 $ department     : Factor w/ 4 levels "Accounting","Management",..: 2 3 1 4 1
 $ passport_stamps:List of 5
  ..$ : chr  "Norway" "Netherlands"
  ..$ : NULL
  ..$ : chr  "Uruguay" "Mexico" "Canada"
  ..$ : chr "India"
  ..$ : chr  "Kenya" "Botswana"

Notice, that the ‘department’ is now of type ‘Factor’ with 4 levels.

Also notice the column ‘passport_stamps’ which is of type ‘list’.

3 Data content

3.1 Inspect the head of the data

The function head() simply lists the first 6 observations in a data frame or a vector. This is useful to get a first quick impression of the data at hand. The number of lines displayed can be specified in the function call, e.g. head(data, n=10). A similar function tail() will list the last 6 values of a variable.

4 Look for missing values

The function is.na() tests whether a given value is NA and returns TRUE or FALSE. If a data frame is passed to is.na() it will return a data frame of similar size with each cell being TRUE or FALSE. Look at this simple example

df2<-data.frame(c1=1:5, c2=sample(LETTERS[1:24],5,TRUE), c3=letters[1:5]) 
df2[3,3]<-NA
df2

  c1 c2   c3
1  1  G    a
2  2  H    b
3  3  G <NA>
4  4  A    d
5  5  B    e

We can now use the is.na() function to look for NA values:

is.na(df2)

        c1    c2    c3
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE  TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE

If the data frame is large and difficult to get a good overview of, we could also check for the presence of any NA’s and their location:

any(is.na(df2)) # Are any NA's present at all?

[1] TRUE

is.na(df2) |> sum() # How many NA's in the data frame?

[1] 1

The code above works like this: the boolean values TRUE and FALSE will be regarded as integers 1 and 0 in any function that takes numerical input – thus the sum() function will add all the FALSE (0) and TRUE (1) values in the data frame and the result thus represents the number of observed NAs.

If you were interested in finding the row and column of NA values, you could do it like so:

which(is.na(df2) , arr.ind=TRUE)

     row col
[1,]   3   3

..or perhaps, if you want to check for NAs on a per-column basis:

df2 |> summarise(across(c1:c3, ~ sum(is.na(.x))))

  c1 c2 c3
1  0  0  1

5 Look at value ranges

Let us return to the data frame ‘df’ and look at the column ‘age’.

summary(df$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  23.00   29.75   37.50   35.25   43.00   43.00       1

The function summary() gives us some summary statistics for the variable ‘age’, including the minimum and maximum values.

Similarly, look at the output of summary() for the ‘department’ variable, which was a factor:

summary(df$department)

      Accounting       Management        Marketing Public relations 
               2                1                1                1

The rstatix package includes the function get_summarys_stats which will provide even more detailed summary data. Look at the ouput regarding the data frame df - notice, that it only includes the numerical variables/columns:

get_summary_stats(df) |> kable()

variable	n	min	max	median	q1	q3	iqr	mad	mean	sd	se	ci
id	5	1	5	3.0	2.00	4	2.00	1.483	3.00	1.581	0.707	1.963
age	4	23	43	37.5	29.75	43	13.25	8.154	35.25	9.674	4.837	15.393

6 Manually cleaning data

If you find, that you data is not structured correctly (e.g. a variable is cast as a character, but should be a factor), has unexpected NA values or has some other issues with data values: You should write R code to clean and restructure the data – do not edit the raw data.

That way, the data cleaning remains transparent and reversible.

Mange nye R-brugere (og en del erfarne også) har brugt meget tid på at klø sig i hovedet, fordi de troede, de vidste, hvordan deres data var struktureret, blot for at opdage, at de tog fejl.

Det er en rigtig god idé, når data er blevet indlæst i hukommelsen, at bruge lidt tid på at inspicere det mere omhyggeligt.

7 Datastruktur

7.1 Navne

Funktionen names() viser navnene på variablerne i en R data.frame eller en liste. For eksempel bruger vi på denne side den simpele data frame kaldet ‘df’, der blev defineret øverst på siden:

names(df)

[1] "id"              "age"             "department"      "passport_stamps"

Funktionen names() fortæller os, at ‘df’ indeholder fire kolonner kaldet ‘id’, ‘age’ osv. names() giver dog ikke meningsfuld information om vektorer, da de ikke har navngivne komponenter.

names() kan også bruges til at definere navnene på en data frame. Hvad tror du, følgende kode gør?

names(df) <- c("Id", "Age", "Department", "Stamps")

8 Struktur

R tilbyder også funktionen str(), der detaljerer strukturen af en variabel. Et eksempel på output fra str() kan se sådan ud:

str(df)

'data.frame':   5 obs. of  4 variables:
 $ id             : int  1 2 3 4 5
 $ age            : num  23 43 32 NA 43
 $ department     : Factor w/ 4 levels "Accounting","Management",..: 2 3 1 4 1
 $ passport_stamps:List of 5
  ..$ : chr  "Norway" "Netherlands"
  ..$ : NULL
  ..$ : chr  "Uruguay" "Mexico" "Canada"
  ..$ : chr "India"
  ..$ : chr  "Kenya" "Botswana"

Dette output er noget mere omfattende og fortæller os, at variablen ‘df’ er en data frame, der består af 5 observationer (rækker) og 4 variabler (kolonner).

Vi kan desuden se datatypen og de første faktiske datapunkter for hver kolonne – f.eks. er ‘id’ af typen ‘int’ (heltal).

Måske forventede vi, at variablen ‘department’ var en faktor fremfor en karakter. Lad os rette det:

df <- df |> mutate(department = as.factor(department))
str(df)

'data.frame':   5 obs. of  4 variables:
 $ id             : int  1 2 3 4 5
 $ age            : num  23 43 32 NA 43
 $ department     : Factor w/ 4 levels "Accounting","Management",..: 2 3 1 4 1
 $ passport_stamps:List of 5
  ..$ : chr  "Norway" "Netherlands"
  ..$ : NULL
  ..$ : chr  "Uruguay" "Mexico" "Canada"
  ..$ : chr "India"
  ..$ : chr  "Kenya" "Botswana"

Bemærk, at ‘department’ nu er af typen ‘Factor’ med 4 niveauer.

Bemærk også kolonnen ‘passport_stamps’, som er af typen ‘list’.

9 Datainhold

9.1 Inspicer starten af data

Funktionen head() viser blot de første 6 observationer i en data frame eller en vektor. Dette er nyttigt for at få et hurtigt første indtryk af dataene. Antallet af linjer, der vises, kan specificeres i funktionskaldet, f.eks. head(data, n=10). En tilsvarende funktion tail() viser de sidste 6 værdier i en variabel.

10 Kig efter manglende værdier

Funktionen is.na() tester, om en given værdi er NA, og returnerer TRUE eller FALSE. Hvis en data frame gives som input til is.na(), returnerer den en data frame af tilsvarende størrelse, hvor hver celle er TRUE eller FALSE. Se dette simple eksempel:

df2 <- data.frame(c1 = 1:5, c2 = sample(LETTERS[1:24], 5, TRUE), c3 = letters[1:5]) 
df2[3, 3] <- NA
df2

  c1 c2   c3
1  1  A    a
2  2  L    b
3  3  D <NA>
4  4  N    d
5  5  X    e

Vi kan nu bruge funktionen is.na() til at lede efter NA-værdier:

is.na(df2)

        c1    c2    c3
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE  TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE

Hvis dataframen er stor og svær at overskue, kan vi også kontrollere for tilstedeværelsen af enhver NA og deres placering:

any(is.na(df2)) # Findes der nogen NA'er overhovedet?

[1] TRUE

is.na(df2) |> sum() # Hvor mange AN'er findes der?

[1] 1

Koden ovenfor fungerer sådan: De boolske værdier TRUE og FALSE betragtes som heltal 1 og 0 i enhver funktion, der tager numerisk input – dermed summerer sum()-funktionen alle FALSE (0) og TRUE (1) værdier i dataframen, og resultatet repræsenterer antallet af observerede NAs.

Hvis du ønsker at finde rækken og kolonnen for NA-værdier, kan du gøre det sådan her:

which(is.na(df2), arr.ind = TRUE)

     row col
[1,]   3   3

Eller måske, hvis du vil tjekke for NAs pr. kolonne:

df2 |> summarise(across(c1:c3, ~ sum(is.na(.x))))

  c1 c2 c3
1  0  0  1

11 Undersøg værdiernes rækkevidde

Lad os vende tilbage til dataframen ‘df’ og kigge på kolonnen ‘age’.

summary(df$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  23.00   29.75   37.50   35.25   43.00   43.00       1

Funktionen summary() giver os nogle opsummerende statistikker for variablen ‘age’, inklusive minimums- og maksimumsværdier.

Tilsvarende kan vi kigge på outputtet fra summary() for variablen ‘department’, som var en faktor:

summary(df$department)

      Accounting       Management        Marketing Public relations 
               2                1                1                1

Pakken rstatix indeholder funktionen get_summary_stats, som giver endnu mere detaljerede opsummerende data. Se på outputtet for dataframen df – bemærk, at den kun inkluderer numeriske variabler/kolonner:

get_summary_stats(df) |> kable()

variable	n	min	max	median	q1	q3	iqr	mad	mean	sd	se	ci
id	5	1	5	3.0	2.00	4	2.00	1.483	3.00	1.581	0.707	1.963
age	4	23	43	37.5	29.75	43	13.25	8.154	35.25	9.674	4.837	15.393

12 Manuel oprensning af data

Hvis du opdager, at dine data ikke er struktureret korrekt (f.eks. en variabel er castet som en karakter, men burde være en faktor), har uventede NA-værdier eller har andre problemer: Du bør skrive R-kode til at rense og omstrukturere dataene – rediger ikke rådataene.

På den måde forbliver datarensningen gennemsigtig og reversibel.