<- data.frame(id=1:5, age=c(23,43,32,NA,43), department=as.character(c("Management", "Marketing", "Accounting", "Public relations", "Accounting")))
df $passport_stamps <- list(c("Norway","Netherlands"), c(), c("Uruguay","Mexico","Canada"), c("India"), c("Kenya", "Botswana")) df
Inspect your data
Many new R users (and quite a few experienced ones, too) have spent a lot of time scratching their heads, because they thought they knew how their data was structured, only to find they were wrong.
It is a very good idea, once data has been loaded into memory, to spend a little time inspecting it more carefully.
1 Data structure
1.1 Names
The function names()
provides the names of the variables in an R data.frame or a list. For instance, using the simple data frame called ‘df’ defined at the top of this page:
names(df)
[1] "id" "age" "department" "passport_stamps"
The names()
tells us that ‘df’ contains four columns called ‘id’, ‘age’, etc. The names()
function will not provides meaningful information about vectors, as they (typically) have no named constituents.
names()
can also be used to define the names of a data frame. What do you think the following code does?
names(df) <- c("Id", "Age", "Department", "Stamps")
2 Structure
R also provides the function str()
which details the structure of a variable. An example output of the str()
function can look like this:
str(df)
'data.frame': 5 obs. of 4 variables:
$ id : int 1 2 3 4 5
$ age : num 23 43 32 NA 43
$ department : chr "Management" "Marketing" "Accounting" "Public relations" ...
$ passport_stamps:List of 5
..$ : chr "Norway" "Netherlands"
..$ : NULL
..$ : chr "Uruguay" "Mexico" "Canada"
..$ : chr "India"
..$ : chr "Kenya" "Botswana"
This output is somewhat more comprehensive and tells us that the variable ‘df’ is a data frame consisting of 5 observations (rows) of 4 variables (columns).
Furthermore, we can see the data type and the first actual data points for each column – e.g. ‘id’ is of type ‘int’ (integer).
Perhaps, we expected the variable ‘department’ to be a factor rather than a type character. Let’s fix that:
<- df |> mutate(department = as.factor(department))
df str(df)
'data.frame': 5 obs. of 4 variables:
$ id : int 1 2 3 4 5
$ age : num 23 43 32 NA 43
$ department : Factor w/ 4 levels "Accounting","Management",..: 2 3 1 4 1
$ passport_stamps:List of 5
..$ : chr "Norway" "Netherlands"
..$ : NULL
..$ : chr "Uruguay" "Mexico" "Canada"
..$ : chr "India"
..$ : chr "Kenya" "Botswana"
Notice, that the ‘department’ is now of type ‘Factor’ with 4 levels.
Also notice the column ‘passport_stamps’ which is of type ‘list’.
3 Data content
3.1 Inspect the head of the data
The function head()
simply lists the first 6 observations in a data frame or a vector. This is useful to get a first quick impression of the data at hand. The number of lines displayed can be specified in the function call, e.g. head(data, n=10)
. A similar function tail()
will list the last 6 values of a variable.
4 Look for missing values
The function is.na()
tests whether a given value is NA and returns TRUE or FALSE. If a data frame is passed to is.na()
it will return a data frame of similar size with each cell being TRUE or FALSE. Look at this simple example
<-data.frame(c1=1:5, c2=sample(LETTERS[1:24],5,TRUE), c3=letters[1:5])
df23,3]<-NA
df2[ df2
c1 c2 c3
1 1 G a
2 2 H b
3 3 G <NA>
4 4 A d
5 5 B e
We can now use the is.na()
function to look for NA values:
is.na(df2)
c1 c2 c3
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
If the data frame is large and difficult to get a good overview of, we could also check for the presence of any NA’s and their location:
any(is.na(df2)) # Are any NA's present at all?
[1] TRUE
is.na(df2) |> sum() # How many NA's in the data frame?
[1] 1
The code above works like this: the boolean values TRUE and FALSE will be regarded as integers 1 and 0 in any function that takes numerical input – thus the sum()
function will add all the FALSE (0) and TRUE (1) values in the data frame and the result thus represents the number of observed NAs.
If you were interested in finding the row and column of NA values, you could do it like so:
which(is.na(df2) , arr.ind=TRUE)
row col
[1,] 3 3
..or perhaps, if you want to check for NAs on a per-column basis:
|> summarise(across(c1:c3, ~ sum(is.na(.x)))) df2
c1 c2 c3
1 0 0 1
5 Look at value ranges
Let us return to the data frame ‘df’ and look at the column ‘age’.
summary(df$age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
23.00 29.75 37.50 35.25 43.00 43.00 1
The function summary()
gives us some summary statistics for the variable ‘age’, including the minimum and maximum values.
Similarly, look at the output of summary()
for the ‘department’ variable, which was a factor:
summary(df$department)
Accounting Management Marketing Public relations
2 1 1 1
The rstatix
package includes the function get_summarys_stats
which will provide even more detailed summary data. Look at the ouput regarding the data frame df
- notice, that it only includes the numerical variables/columns:
get_summary_stats(df) |> kable()
variable | n | min | max | median | q1 | q3 | iqr | mad | mean | sd | se | ci |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | 5 | 1 | 5 | 3.0 | 2.00 | 4 | 2.00 | 1.483 | 3.00 | 1.581 | 0.707 | 1.963 |
age | 4 | 23 | 43 | 37.5 | 29.75 | 43 | 13.25 | 8.154 | 35.25 | 9.674 | 4.837 | 15.393 |
6 Manually cleaning data
If you find, that you data is not structured correctly (e.g. a variable is cast as a character, but should be a factor), has unexpected NA values or has some other issues with data values: You should write R code to clean and restructure the data – do not edit the raw data.
That way, the data cleaning remains transparent and reversible.
Mange nye R-brugere (og en del erfarne også) har brugt meget tid på at klø sig i hovedet, fordi de troede, de vidste, hvordan deres data var struktureret, blot for at opdage, at de tog fejl.
Det er en rigtig god idé, når data er blevet indlæst i hukommelsen, at bruge lidt tid på at inspicere det mere omhyggeligt.
7 Datastruktur
8 Struktur
R tilbyder også funktionen str()
, der detaljerer strukturen af en variabel. Et eksempel på output fra str()
kan se sådan ud:
str(df)
'data.frame': 5 obs. of 4 variables:
$ id : int 1 2 3 4 5
$ age : num 23 43 32 NA 43
$ department : Factor w/ 4 levels "Accounting","Management",..: 2 3 1 4 1
$ passport_stamps:List of 5
..$ : chr "Norway" "Netherlands"
..$ : NULL
..$ : chr "Uruguay" "Mexico" "Canada"
..$ : chr "India"
..$ : chr "Kenya" "Botswana"
Dette output er noget mere omfattende og fortæller os, at variablen ‘df’ er en data frame, der består af 5 observationer (rækker) og 4 variabler (kolonner).
Vi kan desuden se datatypen og de første faktiske datapunkter for hver kolonne – f.eks. er ‘id’ af typen ‘int’ (heltal).
Måske forventede vi, at variablen ‘department’ var en faktor fremfor en karakter. Lad os rette det:
<- df |> mutate(department = as.factor(department))
df str(df)
'data.frame': 5 obs. of 4 variables:
$ id : int 1 2 3 4 5
$ age : num 23 43 32 NA 43
$ department : Factor w/ 4 levels "Accounting","Management",..: 2 3 1 4 1
$ passport_stamps:List of 5
..$ : chr "Norway" "Netherlands"
..$ : NULL
..$ : chr "Uruguay" "Mexico" "Canada"
..$ : chr "India"
..$ : chr "Kenya" "Botswana"
Bemærk, at ‘department’ nu er af typen ‘Factor’ med 4 niveauer.
Bemærk også kolonnen ‘passport_stamps’, som er af typen ‘list’.
9 Datainhold
9.1 Inspicer starten af data
Funktionen head()
viser blot de første 6 observationer i en data frame eller en vektor. Dette er nyttigt for at få et hurtigt første indtryk af dataene. Antallet af linjer, der vises, kan specificeres i funktionskaldet, f.eks. head(data, n=10)
. En tilsvarende funktion tail()
viser de sidste 6 værdier i en variabel.
10 Kig efter manglende værdier
Funktionen is.na()
tester, om en given værdi er NA, og returnerer TRUE eller FALSE. Hvis en data frame gives som input til is.na()
, returnerer den en data frame af tilsvarende størrelse, hvor hver celle er TRUE eller FALSE. Se dette simple eksempel:
<- data.frame(c1 = 1:5, c2 = sample(LETTERS[1:24], 5, TRUE), c3 = letters[1:5])
df2 3, 3] <- NA
df2[ df2
c1 c2 c3
1 1 A a
2 2 L b
3 3 D <NA>
4 4 N d
5 5 X e
Vi kan nu bruge funktionen is.na()
til at lede efter NA-værdier:
is.na(df2)
c1 c2 c3
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
Hvis dataframen er stor og svær at overskue, kan vi også kontrollere for tilstedeværelsen af enhver NA og deres placering:
any(is.na(df2)) # Findes der nogen NA'er overhovedet?
[1] TRUE
is.na(df2) |> sum() # Hvor mange AN'er findes der?
[1] 1
Koden ovenfor fungerer sådan: De boolske værdier TRUE og FALSE betragtes som heltal 1 og 0 i enhver funktion, der tager numerisk input – dermed summerer sum()
-funktionen alle FALSE (0) og TRUE (1) værdier i dataframen, og resultatet repræsenterer antallet af observerede NAs.
Hvis du ønsker at finde rækken og kolonnen for NA-værdier, kan du gøre det sådan her:
which(is.na(df2), arr.ind = TRUE)
row col
[1,] 3 3
Eller måske, hvis du vil tjekke for NAs pr. kolonne:
|> summarise(across(c1:c3, ~ sum(is.na(.x)))) df2
c1 c2 c3
1 0 0 1
11 Undersøg værdiernes rækkevidde
Lad os vende tilbage til dataframen ‘df’ og kigge på kolonnen ‘age’.
summary(df$age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
23.00 29.75 37.50 35.25 43.00 43.00 1
Funktionen summary()
giver os nogle opsummerende statistikker for variablen ‘age’, inklusive minimums- og maksimumsværdier.
Tilsvarende kan vi kigge på outputtet fra summary()
for variablen ‘department’, som var en faktor:
summary(df$department)
Accounting Management Marketing Public relations
2 1 1 1
Pakken rstatix
indeholder funktionen get_summary_stats
, som giver endnu mere detaljerede opsummerende data. Se på outputtet for dataframen df
– bemærk, at den kun inkluderer numeriske variabler/kolonner:
get_summary_stats(df) |> kable()
variable | n | min | max | median | q1 | q3 | iqr | mad | mean | sd | se | ci |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | 5 | 1 | 5 | 3.0 | 2.00 | 4 | 2.00 | 1.483 | 3.00 | 1.581 | 0.707 | 1.963 |
age | 4 | 23 | 43 | 37.5 | 29.75 | 43 | 13.25 | 8.154 | 35.25 | 9.674 | 4.837 | 15.393 |
12 Manuel oprensning af data
Hvis du opdager, at dine data ikke er struktureret korrekt (f.eks. en variabel er castet som en karakter, men burde være en faktor), har uventede NA-værdier eller har andre problemer: Du bør skrive R-kode til at rense og omstrukturere dataene – rediger ikke rådataene.
På den måde forbliver datarensningen gennemsigtig og reversibel.