subject_id | subject_name | measurement | test_day | test | type |
---|---|---|---|---|---|
1 | John | 337,900 | TUE,TUE | a,b | Blood,Saliva |
2 | Billy | 403,429 | WED,MON | a,b | Blood,Saliva |
3 | Joan | 163,693 | WED,MON | a,b | Blood,Saliva |
4 | Clare | 202,496 | FRI,WED | a,b | Blood,Saliva |
5 | Peter | 996,957 | TUE,MON | a,b | Blood,Saliva |
6 | Maggy | 745,475 | TUE,FRI | a,b | Blood,Saliva |
7 | Jack | 507,637 | THU,WED | a,b | Blood,Saliva |
8 | Suzy | 45,186 | MON,THU | a,b | Blood,Saliva |
9 | Billy | 689,634 | FRI,MON | a,b | Blood,Saliva |
10 | Mary | 124,568 | FRI,FRI | a,b | Blood,Saliva |
Normalization of data
Levels of normalization of data structure…
Observing a tidy data format is sufficient for most purposes.
A more comprehensive way to approach data structure is called data normalization: “Normalization is a process that database designers use to eliminate data redundancy, improve data integrity, and enhance the overall efficiency of the database.” 1
The normalization of data has nothing to do with the normal (Gaussian) distribution.
Data normalization is not a case of either/or – data can be normalized to different levels, depending on your needs. In this text, we will deal only with levels 1 and 2.
1 First Normal Form – 1NF
The First Normal Form (of data normalization) is roughly similar to the Tidy data format and requires that:
- Each column represents one variable
- Each variable contains atomic values – i.e. the smallest (indivisible) unit of information
- Variables do not contain arrays of atomic values
- The data set does not contain repeating groups of similar variables
Look at the different examples of structures of the same data on the tabs below in light of the requirements for First Normal Form data listed above – and read the comment below each table:
More than one value is stored per cell in columns measurement
:type
. In other words, each cell consists of an array (or a series of) values, separated by a comma. This does not conform to 1NF.
subject_id | subject_name | test_a | test_b |
---|---|---|---|
1 | John | 337 (TUE : Blood) | 900 (TUE : Saliva) |
2 | Billy | 403 (WED : Blood) | 429 (MON : Saliva) |
3 | Joan | 163 (WED : Blood) | 693 (MON : Saliva) |
4 | Clare | 202 (FRI : Blood) | 496 (WED : Saliva) |
5 | Peter | 996 (TUE : Blood) | 957 (MON : Saliva) |
6 | Maggy | 745 (TUE : Blood) | 475 (FRI : Saliva) |
7 | Jack | 507 (THU : Blood) | 637 (WED : Saliva) |
8 | Suzy | 45 (MON : Blood) | 186 (THU : Saliva) |
9 | Billy | 689 (FRI : Blood) | 634 (MON : Saliva) |
10 | Mary | 124 (FRI : Blood) | 568 (FRI : Saliva) |
The data is not atomic, i.e. indivisible. The cells of columns test_a
and test_b
contain three different values (measurement, weekday, and type). This does not conform to 1NF.
subject_id | subject_name | test_a | test_b | day_of_test_a | day_of_test_b | test_a_type | test_b_type |
---|---|---|---|---|---|---|---|
1 | John | 337 | 900 | TUE | TUE | Blood | Saliva |
2 | Billy | 403 | 429 | WED | MON | Blood | Saliva |
3 | Joan | 163 | 693 | WED | MON | Blood | Saliva |
4 | Clare | 202 | 496 | FRI | WED | Blood | Saliva |
5 | Peter | 996 | 957 | TUE | MON | Blood | Saliva |
6 | Maggy | 745 | 475 | TUE | FRI | Blood | Saliva |
7 | Jack | 507 | 637 | THU | WED | Blood | Saliva |
8 | Suzy | 45 | 186 | MON | THU | Blood | Saliva |
9 | Billy | 689 | 634 | FRI | MON | Blood | Saliva |
10 | Mary | 124 | 568 | FRI | FRI | Blood | Saliva |
The data is atomic, and there is only one data point per cell. However, the data set contains repeating groups of similar variables (‘a’ and ‘b’ – look at the columns names). This does not conform to 1NF.
subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|
1 | John | a | Blood | TUE | 337 |
1 | John | b | Saliva | TUE | 900 |
2 | Billy | a | Blood | WED | 403 |
2 | Billy | b | Saliva | MON | 429 |
3 | Joan | a | Blood | WED | 163 |
3 | Joan | b | Saliva | MON | 693 |
4 | Clare | a | Blood | FRI | 202 |
4 | Clare | b | Saliva | WED | 496 |
5 | Peter | a | Blood | TUE | 996 |
5 | Peter | b | Saliva | MON | 957 |
6 | Maggy | a | Blood | TUE | 745 |
6 | Maggy | b | Saliva | FRI | 475 |
7 | Jack | a | Blood | THU | 507 |
7 | Jack | b | Saliva | WED | 637 |
8 | Suzy | a | Blood | MON | 45 |
8 | Suzy | b | Saliva | THU | 186 |
9 | Billy | a | Blood | FRI | 689 |
9 | Billy | b | Saliva | MON | 634 |
10 | Mary | a | Blood | FRI | 124 |
10 | Mary | b | Saliva | FRI | 568 |
This data conforms to the First Normal Form (1NF) data structure.
2 Second Normal Form – 2NF
The Second Normal Form requires that data conforms to the 1NF requirements and additionally:
- Data contains a single-column primary key.
- That all non-key variables ‘depend’ on the entire primary key.
..the meaning of this will become clearer when you work through examples A, B and C below.
subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|
1 | John | a | Blood | TUE | 337 |
1 | John | b | Saliva | TUE | 900 |
2 | Billy | a | Blood | WED | 403 |
2 | Billy | b | Saliva | MON | 429 |
3 | Joan | a | Blood | WED | 163 |
3 | Joan | b | Saliva | MON | 693 |
4 | Clare | a | Blood | FRI | 202 |
4 | Clare | b | Saliva | WED | 496 |
5 | Peter | a | Blood | TUE | 996 |
5 | Peter | b | Saliva | MON | 957 |
6 | Maggy | a | Blood | TUE | 745 |
6 | Maggy | b | Saliva | FRI | 475 |
7 | Jack | a | Blood | THU | 507 |
7 | Jack | b | Saliva | WED | 637 |
8 | Suzy | a | Blood | MON | 45 |
8 | Suzy | b | Saliva | THU | 186 |
9 | Billy | a | Blood | FRI | 689 |
9 | Billy | b | Saliva | MON | 634 |
10 | Mary | a | Blood | FRI | 124 |
10 | Mary | b | Saliva | FRI | 568 |
The data does not contain a single-column primary key. Each observation is unique and identifiable by the (compound) primary key subject_id
+ test_id
.
key | subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|---|
A | 1 | John | a | Blood | TUE | 337 |
B | 1 | John | b | Saliva | TUE | 900 |
C | 2 | Billy | a | Blood | WED | 403 |
D | 2 | Billy | b | Saliva | MON | 429 |
E | 3 | Joan | a | Blood | WED | 163 |
F | 3 | Joan | b | Saliva | MON | 693 |
G | 4 | Clare | a | Blood | FRI | 202 |
H | 4 | Clare | b | Saliva | WED | 496 |
I | 5 | Peter | a | Blood | TUE | 996 |
J | 5 | Peter | b | Saliva | MON | 957 |
K | 6 | Maggy | a | Blood | TUE | 745 |
L | 6 | Maggy | b | Saliva | FRI | 475 |
M | 7 | Jack | a | Blood | THU | 507 |
N | 7 | Jack | b | Saliva | WED | 637 |
O | 8 | Suzy | a | Blood | MON | 45 |
P | 8 | Suzy | b | Saliva | THU | 186 |
Q | 9 | Billy | a | Blood | FRI | 689 |
R | 9 | Billy | b | Saliva | MON | 634 |
S | 10 | Mary | a | Blood | FRI | 124 |
T | 10 | Mary | b | Saliva | FRI | 568 |
The data does contains a single-column primary key (key
), but all non-key variables do not ‘depend’ on the entire primary key. Specifically, the variable type
is contingent exclusively on test_id
. In this example, there is a very simple \(1:1\) relationship between test_id
and type
(test ‘a’ is always ‘Blood’, and ‘b’ always ‘Saliva’). If that is always the case, then type
is not dependent upon the unique identifier key
, but only the test_id
variable.
Data frame #1
key | subject_id | subject_name | test_id | day | measurement |
---|---|---|---|---|---|
A | 1 | John | a | TUE | 337 |
B | 1 | John | b | TUE | 900 |
C | 2 | Billy | a | WED | 403 |
D | 2 | Billy | b | MON | 429 |
E | 3 | Joan | a | WED | 163 |
F | 3 | Joan | b | MON | 693 |
G | 4 | Clare | a | FRI | 202 |
H | 4 | Clare | b | WED | 496 |
I | 5 | Peter | a | TUE | 996 |
J | 5 | Peter | b | MON | 957 |
K | 6 | Maggy | a | TUE | 745 |
L | 6 | Maggy | b | FRI | 475 |
M | 7 | Jack | a | THU | 507 |
N | 7 | Jack | b | WED | 637 |
O | 8 | Suzy | a | MON | 45 |
P | 8 | Suzy | b | THU | 186 |
Q | 9 | Billy | a | FRI | 689 |
R | 9 | Billy | b | MON | 634 |
S | 10 | Mary | a | FRI | 124 |
T | 10 | Mary | b | FRI | 568 |
Data frame #2
test_id | type |
---|---|
a | Blood |
b | Saliva |
This data structure conforms to the Second Normal Form (2NF) data structure. The data has been split into two separate tables, with no redundancy of information. The variable test_id
in the first, larger data set, is related to test_id
in the second, smaller data set. This structure makes data much easier to survey, when the amount of data grows large. Conversely, if data from different tables are needed for a given analysis, the data has to wrangled first (specifically merge with a join
command). Note that if there is not a fixed and invariable relationship between type
and test_id
, the data should not be split up into different tables, but remain in a structure similar to Example B.
3 Further Normal Forms
For most statistical analyses 2NF will suffice – in fact, you can probably get along fine with 1NF. If your data set is very large and complex, requiring a relational database system, look online for further details about data normalization - e.g. this link
At observere et tidy data-format er tilstrækkeligt til de fleste formål.
En mere omfattende måde at tilgå datastruktur på kaldes data-normalisering: “Normalisering er en proces, som databasespecialister bruger til at eliminere redundans, forbedre dataintegritet og øge databaseeffektiviteten.” 2
Normalisering af data har intet at gøre med den normale (Gaussiske) fordeling.
Data-normalisering er ikke et enten/eller-valg – data kan normaliseres til forskellige niveauer afhængigt af behov. I denne tekst vil vi kun beskæftige os med niveau 1 og 2.
4 Første Normalform – 1NF
Første normalform (1NF) i data-normalisering svarer nogenlunde til tidy data-formatet og kræver, at:
- Hver kolonne repræsenterer en variabel
- Hver variabel indeholder atomiske værdier – dvs. de mindste (uinddelbare) informationsenheder
- Variabler ikke indeholder arrays af atomiske værdier
- Datasættet ikke indeholder gentagende grupper af lignende variabler
Se på de forskellige eksempler på datastrukturer i fanerne nedenfor i forhold til kravene for data i første normalform – og læs kommentaren under hver tabel:
subject_id | subject_name | measurement | test_day | test | type |
---|---|---|---|---|---|
1 | John | 337,900 | TUE,TUE | a,b | Blood,Saliva |
2 | Billy | 403,429 | WED,MON | a,b | Blood,Saliva |
3 | Joan | 163,693 | WED,MON | a,b | Blood,Saliva |
4 | Clare | 202,496 | FRI,WED | a,b | Blood,Saliva |
5 | Peter | 996,957 | TUE,MON | a,b | Blood,Saliva |
6 | Maggy | 745,475 | TUE,FRI | a,b | Blood,Saliva |
7 | Jack | 507,637 | THU,WED | a,b | Blood,Saliva |
8 | Suzy | 45,186 | MON,THU | a,b | Blood,Saliva |
9 | Billy | 689,634 | FRI,MON | a,b | Blood,Saliva |
10 | Mary | 124,568 | FRI,FRI | a,b | Blood,Saliva |
Mere end én værdi er lagret pr. celle i kolonnerne measurement
:type
. Med andre ord, hver celle består af et array (eller en serie af) værdier, adskilt af komma. Dette er ikke i overensstemmelse med 1NF.
subject_id | subject_name | test_a | test_b |
---|---|---|---|
1 | John | 337 (TUE : Blood) | 900 (TUE : Saliva) |
2 | Billy | 403 (WED : Blood) | 429 (MON : Saliva) |
3 | Joan | 163 (WED : Blood) | 693 (MON : Saliva) |
4 | Clare | 202 (FRI : Blood) | 496 (WED : Saliva) |
5 | Peter | 996 (TUE : Blood) | 957 (MON : Saliva) |
6 | Maggy | 745 (TUE : Blood) | 475 (FRI : Saliva) |
7 | Jack | 507 (THU : Blood) | 637 (WED : Saliva) |
8 | Suzy | 45 (MON : Blood) | 186 (THU : Saliva) |
9 | Billy | 689 (FRI : Blood) | 634 (MON : Saliva) |
10 | Mary | 124 (FRI : Blood) | 568 (FRI : Saliva) |
Data er ikke atomisk, dvs. udelelig. Cellerne i kolonnerne test_a
og test_b
indeholder tre forskellige værdier (måling, ugedag og type). Dette er ikke i overensstemmelse med 1NF.
subject_id | subject_name | test_a | test_b | day_of_test_a | day_of_test_b | test_a_type | test_b_type |
---|---|---|---|---|---|---|---|
1 | John | 337 | 900 | TUE | TUE | Blood | Saliva |
2 | Billy | 403 | 429 | WED | MON | Blood | Saliva |
3 | Joan | 163 | 693 | WED | MON | Blood | Saliva |
4 | Clare | 202 | 496 | FRI | WED | Blood | Saliva |
5 | Peter | 996 | 957 | TUE | MON | Blood | Saliva |
6 | Maggy | 745 | 475 | TUE | FRI | Blood | Saliva |
7 | Jack | 507 | 637 | THU | WED | Blood | Saliva |
8 | Suzy | 45 | 186 | MON | THU | Blood | Saliva |
9 | Billy | 689 | 634 | FRI | MON | Blood | Saliva |
10 | Mary | 124 | 568 | FRI | FRI | Blood | Saliva |
Data er atomisk, og der er kun én datapunkt pr. celle. Dog indeholder datasættet gentagende grupper af lignende variabler (’a’ og ’b’ – se kolonnenavnene). Dette er ikke i overensstemmelse med 1NF.
subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|
1 | John | a | Blood | TUE | 337 |
1 | John | b | Saliva | TUE | 900 |
2 | Billy | a | Blood | WED | 403 |
2 | Billy | b | Saliva | MON | 429 |
3 | Joan | a | Blood | WED | 163 |
3 | Joan | b | Saliva | MON | 693 |
4 | Clare | a | Blood | FRI | 202 |
4 | Clare | b | Saliva | WED | 496 |
5 | Peter | a | Blood | TUE | 996 |
5 | Peter | b | Saliva | MON | 957 |
6 | Maggy | a | Blood | TUE | 745 |
6 | Maggy | b | Saliva | FRI | 475 |
7 | Jack | a | Blood | THU | 507 |
7 | Jack | b | Saliva | WED | 637 |
8 | Suzy | a | Blood | MON | 45 |
8 | Suzy | b | Saliva | THU | 186 |
9 | Billy | a | Blood | FRI | 689 |
9 | Billy | b | Saliva | MON | 634 |
10 | Mary | a | Blood | FRI | 124 |
10 | Mary | b | Saliva | FRI | 568 |
Dette data er i overensstemmelse med første normalform (1NF).
5 Anden Normalform – 2NF
Anden normalform kræver, at data opfylder 1NF-kravene og yderligere:
- Data indeholder en enkelt-kolonne primærnøgle.
- Alle ikke-nøgle-variabler afhænger af hele primærnøglen.
… betydningen af dette vil blive tydeligere, når du arbejder dig igennem eksemplerne A, B og C nedenfor.
subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|
1 | John | a | Blood | TUE | 337 |
1 | John | b | Saliva | TUE | 900 |
2 | Billy | a | Blood | WED | 403 |
2 | Billy | b | Saliva | MON | 429 |
3 | Joan | a | Blood | WED | 163 |
3 | Joan | b | Saliva | MON | 693 |
4 | Clare | a | Blood | FRI | 202 |
4 | Clare | b | Saliva | WED | 496 |
5 | Peter | a | Blood | TUE | 996 |
5 | Peter | b | Saliva | MON | 957 |
6 | Maggy | a | Blood | TUE | 745 |
6 | Maggy | b | Saliva | FRI | 475 |
7 | Jack | a | Blood | THU | 507 |
7 | Jack | b | Saliva | WED | 637 |
8 | Suzy | a | Blood | MON | 45 |
8 | Suzy | b | Saliva | THU | 186 |
9 | Billy | a | Blood | FRI | 689 |
9 | Billy | b | Saliva | MON | 634 |
10 | Mary | a | Blood | FRI | 124 |
10 | Mary | b | Saliva | FRI | 568 |
Data indeholder ikke en enkelt-kolonne primærnøgle. Hver observation er unik og identificerbar ved den (sammensatte) primærnøgle subject_id
+ test_id
.
key | subject_id | subject_name | test_id | type | day | measurement |
---|---|---|---|---|---|---|
A | 1 | John | a | Blood | TUE | 337 |
B | 1 | John | b | Saliva | TUE | 900 |
C | 2 | Billy | a | Blood | WED | 403 |
D | 2 | Billy | b | Saliva | MON | 429 |
E | 3 | Joan | a | Blood | WED | 163 |
F | 3 | Joan | b | Saliva | MON | 693 |
G | 4 | Clare | a | Blood | FRI | 202 |
H | 4 | Clare | b | Saliva | WED | 496 |
I | 5 | Peter | a | Blood | TUE | 996 |
J | 5 | Peter | b | Saliva | MON | 957 |
K | 6 | Maggy | a | Blood | TUE | 745 |
L | 6 | Maggy | b | Saliva | FRI | 475 |
M | 7 | Jack | a | Blood | THU | 507 |
N | 7 | Jack | b | Saliva | WED | 637 |
O | 8 | Suzy | a | Blood | MON | 45 |
P | 8 | Suzy | b | Saliva | THU | 186 |
Q | 9 | Billy | a | Blood | FRI | 689 |
R | 9 | Billy | b | Saliva | MON | 634 |
S | 10 | Mary | a | Blood | FRI | 124 |
T | 10 | Mary | b | Saliva | FRI | 568 |
Data indeholder en enkelt-kolonne primærnøgle (key
), men alle ikke-nøgle-variabler afhænger ikke af hele primærnøglen. Specifikt afhænger variablen type
udelukkende af test_id
. I dette eksempel er der et meget simpelt \(1:1\)-forhold mellem test_id
og type
(test ‘a’ er altid ‘Blod’, og ‘b’ altid ‘Spyt’). Hvis det altid er tilfældet, er type
ikke afhængig af primærnøglen key
men kun af test_id
.
Datasæt #1
key | subject_id | subject_name | test_id | day | measurement |
---|---|---|---|---|---|
A | 1 | John | a | TUE | 337 |
B | 1 | John | b | TUE | 900 |
C | 2 | Billy | a | WED | 403 |
D | 2 | Billy | b | MON | 429 |
E | 3 | Joan | a | WED | 163 |
F | 3 | Joan | b | MON | 693 |
G | 4 | Clare | a | FRI | 202 |
H | 4 | Clare | b | WED | 496 |
I | 5 | Peter | a | TUE | 996 |
J | 5 | Peter | b | MON | 957 |
K | 6 | Maggy | a | TUE | 745 |
L | 6 | Maggy | b | FRI | 475 |
M | 7 | Jack | a | THU | 507 |
N | 7 | Jack | b | WED | 637 |
O | 8 | Suzy | a | MON | 45 |
P | 8 | Suzy | b | THU | 186 |
Q | 9 | Billy | a | FRI | 689 |
R | 9 | Billy | b | MON | 634 |
S | 10 | Mary | a | FRI | 124 |
T | 10 | Mary | b | FRI | 568 |
Datasæt #2
test_id | type |
---|---|
a | Blood |
b | Saliva |
Denne datastruktur er i overensstemmelse med anden normalform (2NF). Data er delt op i to separate tabeller uden redundans af information. Variablen test_id
i det første, større datasæt, er relateret til test_id
i det andet, mindre datasæt. Denne struktur gør data meget lettere at overskue, når datamængden vokser sig stor. Omvendt skal data fra forskellige tabeller behandles først (specifikt ved at anvende en join
-kommando), hvis data fra flere tabeller er nødvendigt til en given analyse. Bemærk, at hvis der ikke er et fast og uforanderligt \(1:1\) forhold mellem test_id
og type
skal data ikke splittes op i to tabeller, men forblive i formen som Eksempel B.
6 Yderligere normalformer
Til de fleste statistiske analyser vil 2NF være tilstrækkeligt – faktisk kan du sandsynligvis klare dig fint med 1NF. Hvis dit datasæt er meget stort og komplekst og kræver et relationelt databasesystem, kan du finde yderligere detaljer om data-normalisering online – f.eks. denne side.