Biological data and its diversity

 Softwares for Biostatistics

Statistical computations are now made very feasible owing to availability of computers and suitable software programs. Nowadays, computers are mostly used for performing various statistical tests as it is very tedious to perform it manually. Commonly used software's are MS Office Excel, Graph Pad Prism, SPSS, NCSS, Instant, Dataplot, Sigmastat, Graph Pad Instat, Sysstat, Genstat, MINITAB, SAS, STATA, and Sigma Graph Pad. Statistical methods are necessary to draw valid conclusion from the data. The postgraduate students should be aware of different types of data, measures of central tendencies, and different tests commonly used in biostatistics, so that they would be able to apply these tests and analyze the data themselves. This chapter provides background information, and an attempt is made to highlight the basic principles of statistical techniques and methods.

 

Biological data and its diversity

Biology grew intensively in 20th century after the discovery of deoxyribose nucleic acid (DNA) in 1953. DNA is a long polymer made from repeating units called nucleotides (Saenger, 1984; Alberts, 2002). DNA was first identified and isolated by Friedrich Miescherand the double helix structure of DNA was first discovered by James Watson and Francis Crick, by using X-ray crystallographic data collected by Rosalind Franklin and Maurice Wilkins (Watson and Crick, 1953). These discoveries threw great challenges, of which some couldn’t be solved in the lab by an experiment. For example, the human genome which contains about three billion nucleotides, of which only few is actually genes. Parsing, searching, and organizing the three billion letters of human DNA are problems that computers are uniquely suited to handle.

Data heterogeneity of biology provided an immense challenge at the beginning of 21st century, when it grew more data and information intensive. Biology in this century is was more of managing the variety and complexity of biological data types leading to the inevitable use of computing technology and statistics. Then the biological information came in many forms and types. For instance,

 

a)     Sequence information:sequence data having in the form ofalphabets were made available. The DNA and protein sequencing projects were on going and were generating an enormous data. For example Human genome project, which was the biggest international project, was completed in 2003 reporting approximately 30,000 genes in humans (genome.gov). Also the genome sequences of organisms including yeast, chicken, fruit flies, mice and several bacteria’s were sequenced. These days the high-throughput - next generation sequencing (HT-NGS) technology is a field of genomics research of which is most talked about (Pareek et al., 2011). This technology can produce over 100 times more data compared to the earlier and most sophisticated capillary genome sequencers based on the Sanger method, hence amplifying the biological data challenge for processing into information.


b)    Spatial information:actual biological entities from biomolecules,cells to organism and ecosystems, represent spatial information. For example the three dimensional structure of a protein, encompasses the spatial organization of various amino acids in the entire axis. The structures here are deduced from either of the experimental techniques such as X-ray crystallography or nuclear magnetic resonance (NMR).

c)     Pattern information:Within the genome and proteome thebiologically interesting entities are characterized forming patterns which are an important part of sequence analysis. For example, the genome contains patterns associated with genes and also there exists patterns within proteins. Sequence patterns that are widespread are of significance for the organism, as in nature no information is developed for waste, having its own meaning and use.

 

d)    Geometric information:A great deal of biological function andinteraction depends on relative shape of the two molecules interacting within the biological system. The shape of the biomolecules interacting directly depends on the geometrical shape of the interacting partners. For example, the “docking” behavior of a ligand with a biomolecule at a potential binding site depends on the three-dimensional configuration of both of the molecules, expressing the significance of molecular structure data.

Apart from the above mentioned forms and types of biological information, it has appears in several other forms. For example, such as of scalar and vector types of some phenomena that vary in space and time periodically. Also some information is in the form of images produced from the electron and optical microscopes including fluorescence used for identification of expressions. Some are in the form of high-dimensional data, such as the information generated from that of systems biology, where a response of a biomolecule considered as a data point under the influence of sever other similar biomolecules is analyzed. Such systems finally exhibit cellular behaviors like secretion, proliferation and action potentials.

In all, the above mentioned types of biological information includes computational and statistical emphasis and inspire the ways of understanding and processing the biological data leading to interpret the underlying mechanisms of biology and its functions.