I promised a while ago to provide more information about the miRBase website features associated with the new “high confidence” microRNA set, as described in our 2014 NAR paper.
First, some background. Here is a figure showing the growth of the number of sequences deposited in miRBase, and the number of papers in Pubmed that contain the word “microRNA” in title, keywords, abstract (slightly out-of-date, from release 20, June 2013):
Since around 2007, the overwhelming majority of microRNAs deposited in miRBase have been predicted from small RNA deep sequencing experiments. Deep sequencing has become more and more available to more and more labs, and has been applied in more and more contexts. The upshot is that different groups have annotated microRNAs with different criteria and therefore different levels of stringency. miRBase has always been a community resource with a somewhat inclusive policy — acceptance of a manuscript describing novel microRNAs being the primary requirement for deposition of sequences in the database. (We have always pushed back on obvious problems — tRNA fragments, poor hairpin structures, etc.) However, because a single small RNA deep sequencing experiment can predict hundreds of novel microRNAs, there is a real danger that a small number of poorly performed analyses can swamp the bona fide microRNA gene set with dubious annotations. Indeed, some recent reports have claimed exactly this (Wang and Liu 2011; Hansen et al. 2011; Meng et al. 2012; Brown et al. 2013).
We know that, with sufficient read depth, the pattern of reads mapping to a putative microRNA locus can provide powerful evidence for the annotation of that locus as a microRNA. Drosha and Dicer (and DCL1) leave well-known and characteristic clues: 2 nt 3′ overhangs in the mature miR duplex, with more exact processing at the 5′ ends of the duplex sequences. To see these clues, we need enough sequencing depth to be sure that we will see reads representing both sequences in the mature microRNA duplex, remembering that one (the passenger strand or miR* sequence) is likely to be at much lower abundance than the other. Some groups have required this level of evidence to annotate novel loci, and some haven’t.
Since 2010, we’ve been collecting RNA deep sequencing experiments in miRBase, and hopefully you have already seen the views of these read data. The NAR paper describes how we are using the aggregated read data to automatically categorize a subset of microRNAs in miRBase as “high confidence”. The criteria we are using are described in detail in the paper. The exact details might change with time, but essentially, we’re looking for reads mapping to both sequences in the putative mature microRNA duplex, with approximately 2 nt 3′ overhangs, relatively consistent 5′ ends, and well-folded hairpin precursor sequences — no surprises I think.
Using these criteria, the first version of this analysis annotated 1761 “high confidence” microRNAs, across 38 organisms for which we have collected small RNAseq data in miRBase. We’ve already updated these data once, adding some more C. elegans and mouse datasets, and we now annotate 1828 high confidence sequences. Different species have very different proportions of their microRNA complement annotated as high confidence:
D. melanogaster is the leader, with around 60% of its sequences meeting all the criteria. There are several reasons for this, including that we’ve done a better job of incorporating fly deep seq datasets into miRBase (partly because we do lots of work in my lab on flies, and so we know about these datasets), and that the fly community seems to have been more conservative overall in their annotation of novel microRNAs. At the other end of the scale, less than 20% of rice and human microRNAs currently meet all the requirements for high confidence. That’s perhaps not surprising for rice, where we have only 6 RNAseq datasets incorporated (vs 50 for D. melanogaster). However, we have 81 datasets for human, so what’s the excuse here?
By far the most common reason that a sequence misses out on being called high confidence is insufficient reads (<10) mapping to both arms of the hairpin precursor. Many sequences look something like this:
Now mir-184 is pretty likely to be a real microRNA — it is conserved in other mammals, is seen in Ago pull-down datasets, and has been studied fairly extensively. But in neither mouse nor human does it currently have reads mapping to the passenger strand. The read data is consistent with it being a microRNA, but not, in itself, confirmatory. In human in particular, many groups have annotated low abundance microRNAs based on relatively few reads. Furthermore, these datasets are sometimes from very specific tissues or cell lines. In some cases, we haven’t yet pulled in the raw read data that was used in the original publications to annotate those sequences as microRNAs. Many sequences will be lifted into the high confidence set as we incorporate more RNAseq read data into miRBase. If you know of datasets (from GEO or SRA) that we are missing, then please let us know.
It is important to be clear that we are not claiming that 85% of human microRNAs in miRBase are not real — we simply don’t yet have enough evidence to label them as high confidence. That’s why you’ll see high confidence labels on the relevant entry pages, but you’ll notice there is no corresponding “low confidence” label. The main intention is to provide a subset of miRBase entries for purposes where you really only want to include sequences that you can be positive are real. Of course the read patterns sometimes do provide evidence that a given annotation is unlikely to be a bona fide microRNA. For example, the pattern of reads mapping to the annotated mmu-mir-1940 locus looks like this:
The reads from the 3′ arm of the hairpin do not have the usual expected consistency of the 5′ end, and the 5′ arm and 3′ arm reads do not pair together in the predicted hairpin structure. We have already been removing entries from the database where the pattern of reads is therefore not consistent with annotation as a microRNA, and we will continue to do so.
So, we can use deep sequencing data to address the challenge of rapid growth and variable annotation criteria. But the graph above highlights a second problem — the enormous rate of growth of the microRNA literature. We have previously written about trying to harness the knowledge and power of the community to provide textual information about specific microRNA families, using Wikipedia. That effort is going well, and again, hopefully you’ve seen some of the results (click the “Show Wikipedia entry” button on that page for more). The NAR paper also says something about this Wikipedia annotation, but we can also harness your expertise in making judgements about the confidence or otherwise of microRNA annotations themselves. We have therefore added a very simple interface to each entry page, where you can quickly and easily vote for whether you believe a particular microRNA is real or not — pick your favourite or least favourite microRNA and vote!
Here’s one a few people like so far:
This interface has only been active for since the start of January, and we’ve already had more than 1100 votes. In response to these votes, we will in the future be able to manually promote and demote annotations into and out of the “high confidence” microRNA set, and remove from the database completely sequences that have sufficient evidence to say that they are clearly not microRNAs. If you have a little more time than that required to click a yes or no button, you can also leave us a short comment about why you like/dislike a microRNA annotation. (These comments are for miRBase curators, not for public consumption right now.)
We’re pretty excited about more engagement of the microRNA community to improve the database, and we hope you find these new features useful and informative. As usual, comments, abuse, suggestions welcome here or by email.