22. June 2020
A team of bioinformaticians from CEITEC Masaryk University, led by Panagiotis Alexiou, have recently designed a novel analytical tool called MuStARD. It uses a specific family of algorithms called Convolutional Neural Networks to learn patterns associated with user-defined sets of genomic regions, and is able to scan large genomic areas for novel regions exhibiting similar characteristics. This unique machine learning tool can scan large sets of genomic regions and precisely identify areas producing small RNAs exceptionally well within regions of imbalanced and variable background. MuStARD´s main advantage is the ability to identify functional elements across various species, which makes it easy to deploy and extend to a variety of genomic classification questions. This feature could be extremely helpful for the identification of novel genes that have not been annotated before. The network architecture and training scheme, including free access to code and trained models, were published in June 2020 in the scientific journal, Nature Scientific Reports.
Genomic regions that encode small RNA genes exhibit characteristic patterns in their sequence, secondary structure, and evolutionary conservation. The researchers devised a new way to search for those specific small locations on the genome that are responsible for producing small-RNA molecules, such as microRNAs. Several types of small-RNAs are encoded in cells, but all their members are unknown. For example, thousands of new human microRNAs have been discovered in the past few years, implying the existence of a trove of unknown small-RNA producing sequences in the genome. Small-RNAs play various roles in the regulation of development and disease.
The two main authors of this study, Brno based bioinformaticians of Greek origin, Georgios Georgakilas and Panagiotis Alexiou, managed to effectively teach machines to learn from previously identified examples of specific locations and scan large areas to find more similar locations in the genome with extreme precision. They used a specific family of algorithms called Convolutional Neural Networks, which are known for their ability to classify data based on learned patterns. The main advantage of this particular machine training philosophy is its ability to complete training tasks even with an imbalanced and variable background. The team managed to achieve their goal with extreme precision.
Iterative Background Selection: How to Teach a Machine to Find a Needle in a Haystack
The key method in this biomedical application is machine learning architecture called a Convolutional Neural Network. This machine learning technique takes in sequences of data, in this case genomic sequence, structure, and evolutionary conservation, and puts them through a series of layers that increasingly create more abstract representations of the data. Imagine the machine learning agent looking for a specific type of needle, not in a haystack, but in a heap of recycled metal. It can ‘see’ the shape, the similarity to other needles, and so on. However, within the heap, are not only needles or similarly looking objects such as screws, but also other pieces of scrap metal, like scrapped cars or refrigerators, which do not look at all like needles.
A naïve approach would be to take a random sample of materials and learn their characteristics, which is costly and not as efficient as learning on small samples with ever-increasing difficulty. Maybe in the first round we would pick a random selection of background and train our machine by using that. However, for the second round, we will exclude the items that are too easy to identify, such as old cars and refrigerators, thus increasing the difficulty and ‘focusing’ our machine on more minute differences. This approach will naturally lead to several rounds of refinement until the machine is able to learn to recognise very small differences between very similar items.
If we continue the 'junk pile' analogy, we could say that if we use the best state-of-the-art methods and we want to be able to retrieve half of the 'needles,' we would also be falsely retrieving something else 1/8000 times we evaluate an item. That does not sound like much, but if you consider a pile of billions of items, these false positives can 'pile up' quickly. Instead, using this method, false positives are retrieved less than 1/200,000 times. This allows for the scanning of a larger part of the genome without getting overwhelmed with false hits.
“We have used this training method, which we termed Iterative Background Selection when developing our machine learning model, which, as our results confirmed, improved the accuracy of our model beyond what was possible before,” explained Georgios Georgakilas, first author of the study. “The direct result is the development of the generic method for identification of small RNA genomic locations based on example, within the same species, but even across species. This feature will be helpful for genomic annotation of newly sequenced, but not previously annotated genomes. Once a genome is sequenced, it needs to be annotated to make sense of it, and MuStARD is trained to do so,” added Panagiotis Alexiou, head of the bioinformatics group and corresponding author of this stud
Georgios Georgakilas and Panagiotis Alexiou were assisted by Andrea Grioni and Eliska Chalupova, PhD students from Masaryk University, as well as Konstantinos Liakos, PhD student from the School of Engineering at the University of Thessaly in Greece. This research was supported by the Postdoc@MUNI grant, GACR grant, Brno PhD Talent grant, and Italian Cancer Association grant. The full publication can be found HERE. Code and trained models are freely available HERE.
Author: Ester Jarour