In many medical emergencies, such as a stroke, survivability requires fast diagnosis and treatment. But diagnosis may depend on a test that uses bulky, expensive equipment, such as the radiological imaging test that serves as a “gold standard” stroke test. That test is impractical in the field though, so a reliable portable test would be of great value. Data science offers a solution. Through the information embedded in a biological quantity known as gene expression, a data model can efficiently classify whether a patient is currently undergoing a stroke. This blog will discuss, specifically, the use of k-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) to isolate a small number of genes whose combined expression levels might indicate a stroke is in progress. This can provide an alternative way to identify stroke victims, with lower equipment requirements than traditional radiological imaging.
Gene Expression and its Measurement via Microarrays
The Central Dogma of molecular biology states that DNA produces RNA, which in turn synthesizes proteins. Sections of DNA, which encode RNA, are known as genes, and various cellular and environmental factors can turn different genes “on” or “off”—that is, cause them to code or stop coding. A coding gene is called expressed, and the gene’s expression level describes the amount of protein produced. The human body will respond to an event such as cancer or a stroke by creating certain proteins. Detecting the RNA molecules used to create these proteins will show which genes were expressed during the event, and to what degree.
Data science becomes useful here. Knowing which genes a patient expresses, and at what level, generates a feature space for each patient. A predictive model can attempt to classify a patient’s condition as “stroke” or “no stroke,” with radiological imaging, as the gold-standard test, providing case labels for training and evaluation.
The DNA microarray, also known as a “DNA chip,” quantifies the detection of gene expression. DNA chips contain millions of pre-fabricated reference DNA sequences, known as probes, which represent the group of possible gene sequences to be detected. DNA strands are combinations of the bases A, T, G, and C, and have an affinity to combine with a complementary strand—the A always bonds with T, and G always bonds with C. DNA microarrays exploit this principal, known as DNA hybridization.
This animation (from Sadava, et al., 2008) demonstrates the process of running a DNA chip; in the end, the process produces a heat map of genes and their expression levels. In Figure 1, each point on the chip represents a specific reference gene, and its color indicates which group of cells (control, experimental, or both) expressed that gene. The molecule of interest (DNA or RNA) is isolated from a blood sample, replicated, digested with specific enzymes, and fitted with fluorescent green and red tags (to distinguish between experimental and control groups). Using a series of mixing and washing, the digested pieces are exposed to the array of probes where they affix to those they find complementary. The completed chip is exposed to green and red laser light and the reflected emissions are measured. The resulting heat map depicts expression levels of all the genes in experimental and control samples.
Figure 1. DNA from cancerous tissue (dyed red), normal control tissue (green), or both (yellow) hybridizes with reference genes, indicating the genes expressed in abnormal cells. (Sadava, et al., 2008)
KNN Analysis and PCA with Microarray Data
KNN analysis classifies the expression of genes as part of a group or an underlying biological system. Plotting the red (experimental) and green (control) points on a graph, where the dimensions indicate expression level of each gene, allows identification of point clusters from the experimental group (and not the control group). Beginning with a random point in the set, select the k points that are nearest to it, where k is an odd integer. The KNN algorithm thereby creates a decision boundary within the space, which assigns a given class to points found on one side of it, and a different class to points found on the other side.
KNN creates decision boundaries across multiple dimensions, using all the genes (dimensions) in the training data. By experimenting with smaller subsets of genes, one can often find even more accurate classifications. (Note that this is unlike conventional methods like linear regression, where a full set of variables will always outperform any of its subset on the training data – though not necessarily on the evaluation data.) The practical goal then is to find the fewest genes that work most accurately, as the fewer genes that need measurement, the faster a potential test could happen. So, KNN is employed as a tool to search for the optimal experimental design.
PCA re-maps the original space into new orthogonal dimensions, which capture decreasing levels of variance of the data. In this way, one can keep only the first few principal component features created, and drop the rest, and retain as much variance as possible; that is, compress the data to lower dimensions with as little loss of information as possible.
Gene Expression and Strokes
A technique proposed by O’Connell, et al. (2016) measures the levels of expression of ten specific genes to determine if a stroke could be happening. O’Connell’s group used KNN to evaluate which genes were being expressed together most frequently in both the stroke and non-stroke groups (Figure 2). Using PCA and keeping just the single dimension with the most variance (70%), they mapped the original data into a single expression metric to compare stroke and non-stroke groups (Figure 3).
Figure 2. KNN classifier helps determine the combined predictive power of a subset of genes. (O’Connell, et al., 2016)
Figure 3. PCA gives a composite expression score for each gene, allowing for a clear separation in predicted classes. (O’Connell, et al., 2016)
Using this method, the research team was able to predict, with a high degree of confidence, a group of ten genes whose combined expression enabled detection of patients undergoing a stroke. In fact, this group of ten showed almost mutually-exclusive expression levels between the stroke and non-stroke groups. These results persisted for an independent validation cohort (95.6% accuracy, 92.3% sensitivity and 100% specificity) (O’Connell, et al., 2016).
Microarray technology has its limitations. For example, it takes time to reverse-transcribe RNA into the DNA from which it came, make a large number of copies of the sequence, digest and tag the sequence fragments, and so on. Therefore, methods like the one described here would require a significant amount of engineering in order to become a practical alternative to the more bulky radiological test.
Still, machine learning facilitates the rapid classification of a group of genes expressed during a stroke, possibly allowing for faster diagnosis and saving lives during critical moments. In this case, machine learning also served to limit the number of candidate genes to ten – a much more manageable number in the essential diagnosis phase.
This method provides an excellent example of how big data analytics serve to accelerate the diagnosis and treatment of life-threatening events, thereby improving life expectancy and quality.
Request a consultation to speak with an experienced data analytics consultant about biotech analytics.
- O’Connell, G., Petrone, A., Treadway, M., Tennant, C., Lucke-Wold, N. Chantler, P., & Barr, T. (2016). Machine-learning approach identifies a pattern of gene expression in peripheral blood that can accurately detect ischaemic stroke. Genomic Medicine, 1(16038). doi:10.1038/ npjgenmed.2016.38. Retrieved from: https://www.nature.com/articles/npjgenmed201638
- Sadava, et al. (2008). Life: The Science of Biology, Eighth Edition, Sinauer Associates, W. H. Freeman & Co., and Sumanas, Inc.