Introduction
In the previous few weeks I have been trying to learn about Mendelian Randomisation, which has resulted in me doing a whistle-stop tour around genetics, codons, and genetic-wide association studies (GWAS). This post is a collection of my notes and thoughts.1
1 Please let me know if there are any mistakes or misconceptions
Genetics for dummys
Before diving in, it’s worth remarking that the field of genetics was created before anyone had any idea about the structure of a cell, what DNA looked like, and what chromosomes were. This means the language around this topic can be somewhat confusing for a newcomer, as old terms are still floating around, and overlapping with newer more precise terms.
Let’s start by describing the basic mechanics of genetic material and cells. In humans, most cell contains DNA, which is packaged in 23 chromosomes2. DNA is a long code created from four nucleotides – adenine (A), guanine (G), cytosine (C), and thymine (T) – which are paired with each other (AT and CG) to create base pairs. Parts of the DNA do a specific job, and are known as genes. These genes either code for proteins, or help control other genes. Genes are small sections of DNA that have the ability to be copied into an RNA sequence, which can then encode for proteins (or perform other tasks).
2 by being tightly coiled around proteins known as histones
3 The full table showing these combinations can be seen here
A single-nucleotide polymorphism (SNP) is when there is a substitution in a single nucleotide. A SNP in a gene can create differences within populations, for example in the ability to metabolise alcohol. Most of the time, however, they do not create any differences, as when the RNA from genes are read, they are done in codons. A codon is a set of three base pairs, and therefore there are \(4^3 = 64\) different combinations of the four nucleotides (\(61\) of which specify amino acids). However, as there are only \(20\) amino acids, there are many codons with different base pairs that produce the same amino acids. For example, UCU, UCC, UCA, UCG, AGU, AGC all produce the Serine amino acid.3
A short history
Lets start with Gregor Mendel – the “father of modern genetics” and well known for his experiments into the inheritance of pea plants. I won’t go into detail on him, as Wikipedia can do that for me. However, two quick points to make:
- The Laws of Inheritance are important pre-requisite conditions for Mendelian Randomisation to work.
- The genius of Mendel was his ability to understand that the underlying process was random, and he was only observing a realisation of this random process.
The double-helix structure of DNA was discovered by Rosalind Franklin, James Watson and Francis Crick in 1953. This knowledge of the structure allowed the further discoveries of how proteins are translated from RNA, which is transcribed by DNA.
Biobanks
In the last twenty years or so, large biobanks have been created. These store the genetic and health information of individuals, either at a single point in time or with a temporal aspect. Two examples of large biobanks are:
- The UK Biobank has information on around half-a-million individuals aged between 40 and 69.
- The Avon Longitudinal Study of Parents and Children. Here, more than 14,000 pregnant women were recruited into the study in 1991/92, with their children, and grandchildren, being followed up in detail.
One important caveat for these biobanks when using them for research is that they are not representative of the population of the UK. For example, it is known that the individuals in the UK Biobank are more likely to be in a higher social-economic bracket, with fewer lung cancers and other health issues than the general population. This makes performing analysis and obtaining generalisable result (using GWAS4 and Mendelian Randomisation) more difficult.
4 genome-wise association studies
Genome-wise associated studies
Genome-wide association studies (GWAS) attempt to find associations between genetic variants and phenotypes. The genetic variants considered are usually SNPs, as opposed to indels5. As biobank sequencing is normally short-read (i.e. only a small section of DNA is read at a time) and the location of these short-reads are random,6 it is trickier to notice indels compared to SNPs. It is also worth noting here that when an individual’s genome is sequenced, the result isn’t the sequence of DNA in a specific cell, but instead is an average. This means that the SNPs in the sequences obtained were present in most of the cells within the individual so most likely have been there since the very early cell division stages. Therefore, if certain SNPs are associated with phenotypes, then the relationship can be thought of a “lifetime” association.
6 I am not sure if they are truly random along the whole of the DNA - it might just be that there is some uncertainty in the exact location
Mendelian Randomisation
Citation
@online{smith2025,
author = {Smith, Paul},
title = {A {First} {Foray} into {Genetics,} {GWAS,} and {Mendelian}
{Randomisation}},
date = {2025-01-11},
url = {https://pws3141.github.io/blog/posts/02-mandelian_randomisation/},
langid = {en}
}