Monday, 20 April 2020

Insights From Human Genome Sequencing

It's two decades since the human genome was sequenced. What it revealed has changed our understanding of the human genome and allowed us to construct a phylogenetic tree of how we got here.

The human genome, like most other mammalian genomes, is comprised of 3.2 million base pairs. There are 25000 known genes. Most mammals have similar number of DNA base pairs, the chicken has around a million, while the Japanese pufferfish, Takifugu rubripes is an outlier with only 400 million base pairs.

Only 5% of the human DNA is transcribed, i.e. read into mRNA, and only 1.5% is translated into protein from exons. Vast tracts of DNA therefore has no apparent function. It is in this context that the pufferfish's remarkable genomic efficiency must be viewed, as it appears to have rid itself of most of its "junk" DNA through evolution.

The need to preserve and keep hold of essential DNA leads to the remarkable similarities seen between diverse organisms such as the yeast, a worm (Caenorhabditis elegans), a model plant (Arabidopsis thaliana) and the mammalian genome.

However, the "excess (noncoded) DNA" is not without its uses. It provides a window on millions of years of evolution. By comparing the genomic sequences of chicken and Homo sapiens, for example, one can tell that these two organisms diverged from their common ancestor 300 million years ago. A similar exercise tells you that we humans diverged from our nearest anthropoid relative, the Chimpanzee a mere 7 million years ago. We are slightly more distantly related to the Gorilla, and even more distantly to the Orang utan.

So what accounts for the remarkable redundancy of the human (and mammalian) genome? Most such redundancy can be explained by "repeats", which account for over 50% of the >3 million base pairs. There are several types of such repeats, namely transposons, simple sequence repeats and segmental duplications. We'll discuss each in turn.

By sheer volume alone, the most abundant of these repeats are the transposons, forming fully 44% of the human genome. Barbara McClintock was awarded the Nobel Prize in 1983 for her discovery of transposons, popularly known as "jumping genes". Transposons are parasitic DNA, which far predate the human species itself. There is evidence that their origins stretch back more than 300 million years.

There are 4 types of transposons- LINE, SINE, LTR retrotransposons, and DNA transposons. Of these, LINE is the most abundant and arguably the most successful, as it still accounts for roughly 1 in 250 mutations in the human genome. Very few SINE still exists in the genome, and the LTR transposons and DNA transposons have died out for all practical purposes except for the solitary exception of HERV-K for the former.

LINE, or Long Interspersed Nuclear Elements- constitute 21% of the human genome. Such DNA is "autonomous", which means that it codes its own proteins needed for propagation. LINE contains 2 open reading frames (the equivalent of exons), and a reverse transcriptase. Unlike retroviruses, and their related LTR retrotansposons, reverse transcription takes place in the nucleus, where an endonuclease makes a single stranded nick in the human DNA to insert the retrotranscribed LINE DNA. This process starts at the 3' end and proceeds towards the 5' end. However, it is often incomplete, i.e. in many cases, it doesn't reach the 5' end. LINE derived RNA has a poly-A tail at the 3' untranslated region, which in eucaryotes, protects the mRNA from degradation once transcribed. The tail has an exactly opposite function in procaryotes.

Misplaced insertion of LINE elements has been associated with diverse human diseases such as dementia and cancers. Perhaps the most interesting example, described by Kazazian, was the insertion of a LINE transposon from Chromosome 22 in a woman into her X chromosome in the middle of the Factor VIII gene, a fact uncovered after her son was born with Haemophilia A, despite an absence of family history.

SINE, or Short Interspersed Elements are non-autonomous, unlike LINE. They do not code for protein, and in fact depend on LINE for the proteins needed for transposition. As such, they are more vulnerable to mutational loss. For example, when LINE2 died out roughly 50 million years ago, so did the associated SINE.

While most SINEs are now no longer functional and therefore cannot propagate, there are two- Alu and SVA which remain functional.

Interestingly, SINE & LINE are located in different parts of the genome. While SINE favours GC rich areas, LINE is located in more AT rich areas. GC rich areas have a higher gene density, while AT rich areas are "genetic deserts", ie they are dominated by non coding, apparently nonfunctional DNA. Some authorities think that SINE elements have a symbiotic relationship with the host DNA, where they work by reducing the likelihood of harmful mutations.

The LTR (Long Terminal repeat) retrotransposons are thought to be the predecessors of ancient retoviruses. Like the latter, they have LTR at both ends and reproduce by reverse transcription from RNA in the cytoplasm (not nucleus, unlike LINEs). As such, they code for gag and pol proteins, just like retroviruses. LTR retrotransposons have all but died out in the human genome. The only remaining retrotransposon- HERV-K has no known function.

Similarly, DNA transposons- which contain inverted repeats at either end, and whose description in maize led to the award of McClintock's Nobel Prize nearly 40 years later, are no longer functional in the human genome. They do remain functional in bacteria though, where they are responsible for horizontal transmission of antibiotic resistance. As it cannot spread from human beings horizontally, it became nonfunctional in the latter.

The human genome is remarkably repeat rich with interspersed transposons, in comparison with the yeast or invertebrates. Furthermore, the transposons in the human genome are ancient, compared with their counterparts in these other organisms. Again, a direct comparison of these repeats between human beings and mouse shows that human repeats are much older. It seems therefore that Homo sapiens has kept hold of ancient repeats in comparison with other organisms including fellow mammals despite the fact that most of these repeats serve no discernible function. Wish we were all as efficient as the puffer fish!

Not all chromosomes in the human cell are equally ancient. The Y chromosome, for example, is a relatively "young" chromosome, with rapid turnover of repeats, ie the repeats on Y chromosome are phylogenetically millions of years younger than on other chromosomes.

Simple sequence repeats are 2 or 3 base repeats such as AT and ATG which are polymorphic. This latter property- polymorphism- particularly in (CA)n has been useful in establishing identity, paternity tests, etc. When n is 1-13, these repeats are called microsatellites, while n=14 or more are called minisatellites. For some reason, (CA)n polymorphisms are infrequent on the X chromosome, i.e. Most X chromosomes have roughly equal numbers of CA repeats.

Segmental duplications involve duplications of 1-250 kb. For some reason, they tend to favour pericentromeric and telomeric regions. They can be intrachromosomal or interchromosomal. When they are intrachromosomal, they are called Low Copy repeats (LCRs). Intrachromosomal segmental duplications lead to deletion or duplication during crossover, and thus contiguous gene syndromes such as CMT 1A (due to duplication of PMP22). Similarly, it can lead to microdeletion syndromes such DiGeorge and velocardiofacial syndrome and Williams-Bueren syndrome.

LCRs are ubiquitous and can lead to problems with accurate genetic mapping with short reads, leading to gaps in the mapped genome.

Interchromosomal segemental duplication can lead to spread of a disease causing sequence to other chromosomes The most notable example of this is is the duplication of the adrenolekodystrophy locus from Xq28 to the pericentromeric region of chromosomes 2, 10, 16 and 22. Many inter and intrachromosomal duplications involve the X chromosome.

During meiosis, crossover occurs. There are 2 structural observations that are relevant here. First crossovers tend to affect the short arm of chromosomes far more than the long arm. Secondly, meiotic crossover is less common close to centromeres, and increases in the terminal 20-35 Mb section of the chromosome.

The unit for measuring "closeness" or linkage of loci is centiMorgan or cM. The closer the loci are, the less is the likelihood of crossover at meiosis. When two loci are separated by 1 cM, that equates with an 1% chance that these two loci will be separated by crossover at meiosis. The chances of crossing over for two loci is expressed as cM/Mb. The most crossover prone part of the human genome resides in the short arms of chromosome X & Y. Thus, two genes located in Xp or Yp have an almost 100% chance of being separated at cross over.

Common though the repeats above are, there are certain portions of the genome which they leave alone as being almost sacred. These regions have very few repeats. In mammals, 4 such regions are Homeobox A, B, C, and D. They are known as HoxA, HoxB, HoxC, and HoxD. The homeoboxes are responsible for embryonic development in the antero-posterior axis, and it is thought that ontogenically, mammals will not tolerate any disruption of this function by the interposition of repeats. The same however does not apply to reptiles, who have many repeats in their Hox regions and display a remarkable variety of species, perhaps due to the variation caused by these repeats during embryonic development. The remarkable speciation found in Anilis lizard is a good example of this phenomenon.

Not all parts of the human genome are equally rich in GC or AT. In fact, GC pairs only constitute 41% of the human genome, and AT pairs make up the other 59%. It is thought that over time (millions of years), there is steady mutational erosion of GC, being gradually replaced by AT. This is of some importance, as GC pairings are remarkably over-represented in gene rich regions- ie they appear in areas of high gene density. This is not to be confused with the density seen on Giemsa staining- called G bands. GC rich areas correspond to lighter G bands, while AT rich areas have denser G bands, ie the exact opposite of gene density.

What causes GC rich areas to be more gene dense? This is almost completely attributed to much shorter intron lengths in GC rich areas. The length of exons and exon numbers are relatively invariant between GC rich and AT rich areas.

CpG islands consist of cytosine bound to guanine through a phosphodiester bond in the 5'-3' direction from C to G (that is to say CpG is not the same as GpC). If we go by the relative frequency of Cytosine and Guanine bases- 21% each, then the frequency of CpG islands in the human genome should be 0.21*0.21, or around 4%. In actuality, the frequency of CpG islands is only a fifth of this.

This remarkable finding is explained by the fact that a large proportion of cytosine bases in H.sapiens are methylated. These spontaneously mutate to thymine. Unmethylated cytosine bases also mutate spontaneously to uracil, which, being foreign to DNA, is quickly corrected back to cytosine.

There may be an element of self preservation about the fact that humans have methylated CpG islands. The opposite applies to bacteria and viruses, who have hypomethylated DNA. When bacteria or viruses invade the human cell, TLR9 detects them through the fact that they are unmethylated and thus activates the innate immune system. For viruses, for example, this can lead to increased production of Type I interferons by plasmacytoid dendritic cells.

CpG islands in human beings are over-represented in promoter regions based at the transcription start (5') end, and it is thought that they play a vital part in the function of these promoter regions. As expected, CpG islands occur in gene dense areas, just like CG base pairs.

Again, the human chromosomes differ in their content of CpG islands. The average number of CpG islands across all human chromosomes is 5-15 per Mb. The Y chromosome is relatively bereft, with only 2.9, while Chromosome 19 is an extreme outlier with 43 CpG islands per Mb.

Since there are 4 bases in RNA, the number of triplet codons on mRNA that can be made from these 4 bases is 4^3 or 64. As there are only 20 amino acids (21 if you include Selenocysteine), there is redundancy here. However redundancy is also reflected in the number of anticodons on tRNA, numbering only 46, due to the fact that the 1st RNA base on an anticodon, which corresponds with the 3rd RNA base on the codon, often shows "wobble". For 2-codon boxes (where the 3rd base on codon could be one of 2 choices), this is seen when the 3rd base is either C or U. The corresponding 1st base on the anticodon in such cases could be either G or A. Asparagine, for example has codons AAU & AAC. The cognate tRNA anticodon for asparagine could thus be either GUU or AUU. In reality, there are 33 genes which code for GUU and only one for AUU, thus reflecting both redundancy and codon preference.

In practice, when A is present as the first base on an anticodon, it is almost always post-translationally deaminated to inosine. Thus AUU, in reality, becomes IUU.


No comments:

Post a Comment