id | test_a | test_b | day_of_test_a | day_of_test_b |
---|---|---|---|---|
1 | 43 | 318 | TUE | MON |
2 | 751 | 360 | WED | TUE |
3 | 261 | 493 | TUE | TUE |
4 | 963 | 67 | WED | WED |
5 | 898 | 652 | WED | WED |
6 | 323 | 525 | WED | THU |
7 | 202 | 431 | TUE | FRI |
8 | 461 | 520 | WED | MON |
9 | 715 | 59 | WED | TUE |
10 | 268 | 152 | FRI | FRI |
Tidy data
A few words about tidy data structure…
It is often said that 80% of data analysis is spent on cleaning and preparing data.
… a quote from the tidyverse page on tidy data.
Please consider the differences between:
- Raw data
- Cleaned data
- Wrangled data
The raw data is the unadulterated version of the data as collected by what-ever-means you collect your data: questionnaires, machine sensor readings, etc. Within the limits set by GDPR rules, you should always keep a version of the raw data in its original form.
Cleaned data is the raw data after you have made the minimal changes necessary to make data useful. For example, deletion of observations which are flawed due to apparatus malfunction, or data entry mistakes, deletion of variables that were never collected, and conversion of data types if necessary. The process whereby the raw data is cleaned should be scripted (coded) to ensure, that it is reproducible and documented.
The cleaned data probably needs to be wrangled into a shape (and content) appropriate for specific analyses. For example, a wrangled data set may include only specific observations of specific variables relevant to a given analysis, in a format/shape suited for that analysis. This should also be scripted to ensure, that it is reproducible and documented.
The above link to the tidyverse page on tidy data provides a lot of information about tidy data, but the central principle is, that with tidy data:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
…this sort of assumes, that data is stored in a rectangular data frame (or tibble).
On the following three tabs, you can see three examples of the same data set, structured in different ways. Look at each of them in turn…
id | test | day | measurement |
---|---|---|---|
1 | a | TUE | 43 |
1 | b | MON | 318 |
2 | a | WED | 751 |
2 | b | TUE | 360 |
3 | a | TUE | 261 |
3 | b | TUE | 493 |
4 | a | WED | 963 |
4 | b | WED | 67 |
5 | a | WED | 898 |
5 | b | WED | 652 |
6 | a | WED | 323 |
6 | b | THU | 525 |
7 | a | TUE | 202 |
7 | b | FRI | 431 |
8 | a | WED | 461 |
8 | b | MON | 520 |
9 | a | WED | 715 |
9 | b | TUE | 59 |
10 | a | FRI | 268 |
10 | b | FRI | 152 |
id | test_a | test_b |
---|---|---|
1 | 43,TUE | 318,MON |
2 | 751,WED | 360,TUE |
3 | 261,TUE | 493,TUE |
4 | 963,WED | 67,WED |
5 | 898,WED | 652,WED |
6 | 323,WED | 525,THU |
7 | 202,TUE | 431,FRI |
8 | 461,WED | 520,MON |
9 | 715,WED | 59,TUE |
10 | 268,FRI | 152,FRI |
Consider the three different ways to structure the data in light of:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Which of the three structures/tables represent the most tidy data structure?
Ask yourself, what different units of information (i.e. data points) constitutes each observation … and how the relation between data points specifies such an observation?
It is obvious, that each of the numerical values (measurements) represent data, but so does ‘id’ and the ‘weekday’, as well as the test ‘a’ versus ‘b’.
It seems from the data, that each id was tested on two occasions (‘a’ and ‘b’) which fell on different weekdays.
In other words, ‘id’, ‘test’, ‘weekday’ and ‘measurement’ all represent units of information (data points) which together constitutes an observation, but they are related in a non-trivial manner:
For instance, the data id=1, test=a, weekday=TUE and measurement=43 are related as a single observation. Similarly, the data id=1, test=b, weekday=MON and measurement=318 are related as another unique observation.
The most tidy data structure is thus Table 2 above: Each row represents an observation and each column represents one of the variables that constitutes each observation. Note however, that there is no one-single variable that is unique per observation – instead, it is the combination of variables that constitutes a unique identifier (in this case, ‘id’ and ‘test’ in combination). This is not necessarily a problem.
Table 1 may seem more intuitive, and probably easier to set up as a data entry interface, e.g. a spreadsheet. At first impression, it also has the benefit that each line includes a unique identifier (id). In reality however, this data structure stores some information (e.g. whether the test was ‘a’ and ‘b’) as column names, rather than as actual data in cells. Table 3 is even more problematic, not only does it store data in the column names, it also stores multiple data points in each cell, and data of different types (numeric vs text) at that.
Det siges ofte, at 80% af dataanalysen bruges på at rense og forberede data.
…et citat fra tidyverse-siden om tidy data.
Overvej forskellene mellem:
- Rå data
- Rensede data
- Tilrettelagte data
Rå data er den uforarbejdede version af data, som den er indsamlet gennem spørgeskemaer, maskinsensoraflæsninger osv. Inden for GDPR-reglernes rammer bør du altid gemme en version af de rå data i deres oprindelige form.
Rensede data er rå data, efter du har foretaget de minimale ændringer, der er nødvendige for at gøre data anvendelige. For eksempel kan det være at slette observationer, som er fejlbehæftede på grund af apparatfejl, tastefejl eller manglende variabler, samt at konvertere datatyper om nødvendigt. Processen, hvorved de rå data renses, bør være scriptet (kodet) for at sikre, at den er reproducerbar og dokumenteret.
De rensede data skal sandsynligvis tilrettes til en form (og et indhold), der er passende til specifikke analyser. For eksempel kan et tilrettet datasæt inkludere kun specifikke observationer af specifikke variabler, der er relevante for en given analyse, i et format der passer til denne analyse. Dette bør også scriptes for at sikre, at det er reproducerbart og dokumenteret.
Ovenstående link til tidyverse-siden om tidy data indeholder meget information om tidy data, men det centrale princip er, at for tidy data gælder det, at:
- Hver kolonne er en variabel.
- Hver række er en observation.
- Hver celle indeholder én enkelt værdi.
… dette forudsætter, at data opbevares i et rektangulært datasæt (eller tibble).
I de følgende tre faner kan du se tre eksempler på det samme datasæt, struktureret på forskellige måder. Undersøg hver af dem:
id | test_a | test_b | day_of_test_a | day_of_test_b |
---|---|---|---|---|
1 | 47 | 529 | WED | THU |
2 | 291 | 975 | THU | WED |
3 | 834 | 817 | THU | THU |
4 | 153 | 195 | WED | MON |
5 | 29 | 502 | WED | FRI |
6 | 272 | 249 | MON | WED |
7 | 192 | 747 | TUE | MON |
8 | 717 | 118 | TUE | WED |
9 | 416 | 996 | MON | THU |
10 | 8 | 959 | MON | THU |
id | test | day | measurement |
---|---|---|---|
1 | a | WED | 47 |
1 | b | THU | 529 |
2 | a | THU | 291 |
2 | b | WED | 975 |
3 | a | THU | 834 |
3 | b | THU | 817 |
4 | a | WED | 153 |
4 | b | MON | 195 |
5 | a | WED | 29 |
5 | b | FRI | 502 |
6 | a | MON | 272 |
6 | b | WED | 249 |
7 | a | TUE | 192 |
7 | b | MON | 747 |
8 | a | TUE | 717 |
8 | b | WED | 118 |
9 | a | MON | 416 |
9 | b | THU | 996 |
10 | a | MON | 8 |
10 | b | THU | 959 |
id | test_a | test_b |
---|---|---|
1 | 47,WED | 529,THU |
2 | 291,THU | 975,WED |
3 | 834,THU | 817,THU |
4 | 153,WED | 195,MON |
5 | 29,WED | 502,FRI |
6 | 272,MON | 249,WED |
7 | 192,TUE | 747,MON |
8 | 717,TUE | 118,WED |
9 | 416,MON | 996,THU |
10 | 8,MON | 959,THU |
Overvej de tre forskellige måder at strukturere data på i lyset af:
- Hver kolonne er en variabel.
- Hver række er en observation.
- Hver celle indeholder én enkelt værdi.
Hvilken af de tre strukturer/tabeller repræsenterer den mest tidy data-struktur?
Spørg dig selv, hvilke forskellige informationsenheder (dvs. datapunkter) udgør hver observation, og hvordan relationen mellem datapunkter specificerer en sådan observation.
Det er tydeligt, at hver af de numeriske værdier (målinger) repræsenterer data, men det samme gør ‘id’ og ‘ugedag’ samt test ‘a’ versus ‘b’.
Ud fra dataene ser det ud til, at hver id blev testet to gange (‘a’ og ‘b’), som fandt sted på forskellige ugedage.
Med andre ord repræsenterer ‘id’, ‘test’, ‘ugedag’ og ‘måling’ informationsenheder (datapunkter), der tilsammen udgør en observation, men de er relateret på en ikke-triviel måde:
For eksempel er dataene id=1, test=a, weekday=WED og measurement=47 relateret som en enkelt observation. Tilsvarende er dataene id=1, test=b, weekday=THU og measurement=529 relateret som en anden unik observation.
Den mest tidy datastruktur er således Table 2 ovenfor: Hver række repræsenterer en observation, og hver kolonne repræsenterer en af de variabler, der udgør hver observation. Bemærk dog, at der ikke er én enkelt variabel, der er unik per observation – i stedet er det kombinationen af variabler, der udgør en unik identifikator (i dette tilfælde ‘id’ og ‘test’ i kombination). Dette er ikke nødvendigvis et problem.
Table 1 kan virke mere intuitiv og sandsynligvis lettere at opsætte som en dataindtastningsgrænseflade, fx et regneark. Ved første øjekast har den også fordelen af, at hver linje inkluderer en unik identifikator (id). I virkeligheden gemmer denne datastruktur dog noget information (fx om testen var ‘a’ og ‘b’) som kolonnenavne, snarere end som faktiske data i celler. Table 3 er endnu mere problematisk, da den ikke kun gemmer data i kolonnenavne, men også gemmer flere datapunkter i hver celle og data af forskellige typer (numerisk vs. tekst).