Introduction
This Bioinformatics Base Basics post will provide an introduction to fundamental concepts of biology for beginners. To do so, some simplifications have been made. This post should serve as a jumping off point for further reading. One good resource for further reading is a textbook commonly used in undergraduate courses ‘Molecular Biology of the Cell’ by Alberts et al. – which is available on the internet archive as a pdf. If you are interested in reading pop science and want to know more about these topics, including the history of the science that underpins this knowledge, you may be interested in reading ‘The Song of the Cell’ by Siddhartha Mukherjee.
Life on Earth
The fundamental unit of life is the cell. Some organisms consist as a single cell (such as bacteria), and are called unicellular, whereas other organisms consist of multiple cells (such as plants and animals) and are called multicellular. Multicellular organisms typically consist of cells that are different from each other in terms of how they look and behave as they are specialised for a particular function to help the organism survive. Cells use resources environment to produce copies of themselves. As such, cells are dynamic and are able to respond to the environment. To produce copies of themselves, they pass through the cell cycle – a sequential series of events that allow two daughter cells. In cell biology, the word ‘growth’ typically refers to growth in population size rather than size of the organisms as you may expect. The cell is made of smaller components that each have their own function; these components work together to allow the cell to survive. We still don’t fully understand how life arose nor do we understand the precise relationship between different organisms. It is thought that life on earth as we know it evolved once, meaning that all living organisms are related as they descend from this ancestor. This conclusion is due to the fact that all living things share the genetic code and other particular quirks of biochemistry. To categorise the evolutionary tree of life, we classify living organisms into distinct groups. An example that most people are familiar with is that humans are given the scientific name of Homo sapiens. Homo sapiens denotes that we are of the species ‘sapiens’ and that belongs to the genus Homo. The categorisation system includes multiple levels, but for the sake of brevity they wont be covered in large detail here. One category that everyone will be aware of is whether something can be called an ‘animal’; the scientific way to say this is that the organism belongs to the Kingdom of ‘Animalia’. Kingdoms are the second highest taxonomic rank, with Domain being the highest rank. At the time of writing, it is debated whether there are 3 domains (Archaea, Bacteria, Eukaryota) or 2 domains (Archaea (including Eukaryota as a subgroup) and Bacteria). Plants, animals and Fungi all belong to Eukaryota. This is important because there are large and crucial differences between the groups in terms of features of the cells; consider how different plants and animals are yet they are part of the same domain. Understanding the relationship between organisms is not a trivial matter as it does provide real world implications – understanding how organisms change and adapt overtime can aid in understanding how life functions. The most notable difference between organisms is the presence or absence of membrane-bound organelles which, as the name implies, are similar to organs in that they are specialised areas that perform specific tasks. One example is the nucleus, which stores the genetic material (DNA), and another is the mitochondria which is often described in memes as the ‘powerhouse of the cell’. The defining feature of Eukaryotes is the presence of the nucleus, and on the other hand, the defining feature of Bacteria, is the absence of the nucleus. However, Archaea is not this easily classified as it’s now known that organisms with and without a nucleus belong to this rather unusual group. Historically, only the Archaea without a nucleus were included in this group, but that is now no longer the case. This highlights a major point that must always keep in mind, biology is messy: there are lots of exceptions to any particular ‘rule’ and as a result its very difficult to make generalisations without having to make a bunch of caveats. This ‘messiness’ also contributes to why it’s difficult to figure out the precise relationships between species as things can seem related but actually, upon further inspection, are not. For example, the scientific classification of organisms was done according to phenotype (physical appearance) historically. When technological advances allowed scientists to investigate the genetics of organisms they found some organisms that had similar appearances are actually not closely related, and others that are seem totally unrelated are more closely related than you would expect. The terms ‘genetics’ and ‘genetic code’ have been used multiple times so far in this post but what exactly do these terms mean?
The Central Dogma
At the heart of biology lies the Central Dogma, a framework that describes how genetic information flows within a cell: DNA → RNA → Protein. It is important to note that the flow of genetic information is not entirely unidirectional (for instance, processes like reverse transcription (used by retroviruses) and RNA editing can modify the initial flow of information). Fundamentally, it is the control of this information flow which allows your body to produce different cell types (and therefore organs) despite having the same genetic information. You can think of this like different software running on a computer. To use an analogy, consider a situation where two people buy the same computer, one person uses it to play video games but the other uses it to edit videos. In this way, although the computer is using the same underlying hardware it is being used to produced a different output. DNA is the molecule of heredity (what the field of genetics studies) and is often referred to as a ‘blueprint’ for life (which is not really accurate but it is a useful metaphor for now). It stores the instructions for proteins which are the functional molecules responsible for nearly all cellular processes, from metabolism to signalling and structural support. The name DNA derives from the chemical groups that make the molecule (Deoxyribonucleic Acid). The molecule is often likened to a twisted ladder where the information of the DNA molecule is stored in the rungs of the ladder. However, the ladder is actually split down the middle, and the rungs are kept together by forces of attraction due to their chemistry. Also imagine that the ladder is made of small repetitive units ( e.g. |- -|) kind of like Lego bricks. These repetitive units are termed ‘nucleotides’. Crucially, the rungs of this bizarre twisted Lego ladder are where the information is stored. Each rung is a different ‘base’. There are four bases: A (Adenine), T (Thymine), C (Cytosine), and G (Guanine). The genome contains millions of these bases. The two halves of this ladder are held together by the complementary nature of the bases i.e. A always binds to T and C always binds to G. These bases are what contain the genetic code of life – which we will get to the specifics of later. Its important to note that although a specific region of the genome codes for (contains the instructions for) a specific protein, not all regions code for protein products and instead have regulatory functions – this topic will be covered in a later article, but for now just remember that not every base of the genome encodes a protein. To distinguish whether we are referring to a gene or gene product there are gene nomenclature that you should follow. The exact size of the genome overall and the ratio of coding and non-coding regions vary depending on the species: this does not correlated to complexity of the organism. When a protein is required by the cell, the instructions contained within the DNA are used to create it. This process is very complex and can vary widely depending on the species. However, there are 2 key steps to know; transcription and translation (writing down a message in the original ‘language’ and then translating it into another language (the language of proteins)). In the first step, the DNA is ‘transcribed’ into a specific type of RNA (Ribonucleic acid) called messenger RNA (mRNA). In the second step, the mRNA is then ‘translated’ into a protein product. The protein product is built by combining amino acids. But how do cells ‘know’ which amino acid to use? That is where the ‘genetic code’ comes in. Each 3 bases of mRNA correspond to an amino acid. This code is read by another type of RNA known as Transfer RNA (tRNA). There is a tRNA which is complementary to each of the possible 3 letter code combinations (e.g. ACG, AAT, AGA, etc.). Crucially, each of these tRNA always has the same amino acid attached to it and therefore a specific sequence of DNA always codes for a specific amino acid sequence. Hence, if there is a change in the DNA (a mutation) there is a change in the protein product. A mutation in and of itself is not necessarily a good or bad thing, it all depends on the context. For example, a mutation may increase the functionality of a protein: this could improve the efficiency of the system or it could lead to cancer (or both). Which proteins, and how many copies of them, are expressed is essentially what allows for multicellular organisms to have the same genetics in every cell yet have distinct cell types. Once proteins are made, their functional activity can also controlled by adding chemical groups (acting like a switch). These ‘post-translational modifications’ can have a variety of effects, including changing the degree to which a certain function is being carried out, or changing what function it is performing altogether.
Summary
For bioinformaticians, understanding these fundamental concepts is important as they underpin many common tasks such as identifying genes in raw DNA sequences, predicting RNA transcripts, and comparing gene expression levels. Moreover, disruptions in the outlined areas above are often the basis of disease and are therefore crucial for interpreting biological data in both research and clinical contexts.


Leave a comment