Skip to content


High confidence microRNAs

I promised a while ago to provide more information about the miRBase website features associated with the new “high confidence” microRNA set, as described in our 2014 NAR paper.

First, some background. Here is a figure showing the growth of the number of sequences deposited in miRBase, and the number of papers in Pubmed that contain the word “microRNA” in title, keywords, abstract (slightly out-of-date, from release 20, June 2013):

Since around 2007, the overwhelming majority of microRNAs deposited in miRBase have been predicted from small RNA deep sequencing experiments. Deep sequencing has become more and more available to more and more labs, and has been applied in more and more contexts. The upshot is that different groups have annotated microRNAs with different criteria and therefore different levels of stringency. miRBase has always been a community resource with a somewhat inclusive policy — acceptance of a manuscript describing novel microRNAs being the primary requirement for deposition of sequences in the database. (We have always pushed back on obvious problems — tRNA fragments, poor hairpin structures, etc.) However, because a single small RNA deep sequencing experiment can predict hundreds of novel microRNAs, there is a real danger that a small number of poorly performed analyses can swamp the bona fide microRNA gene set with dubious annotations. Indeed, some recent reports have claimed exactly this (Wang and Liu 2011; Hansen et al. 2011; Meng et al. 2012; Brown et al. 2013).

We know that, with sufficient read depth, the pattern of reads mapping to a putative microRNA locus can provide powerful evidence for the annotation of that locus as a microRNA. Drosha and Dicer (and DCL1) leave well-known and characteristic clues: 2 nt 3′ overhangs in the mature miR duplex, with more exact processing at the 5′ ends of the duplex sequences. To see these clues, we need enough sequencing depth to be sure that we will see reads representing both sequences in the mature microRNA duplex, remembering that one (the passenger strand or miR* sequence) is likely to be at much lower abundance than the other. Some groups have required this level of evidence to annotate novel loci, and some haven’t.

Since 2010, we’ve been collecting RNA deep sequencing experiments in miRBase, and hopefully you have already seen the views of these read data. The NAR paper describes how we are using the aggregated read data to automatically categorize a subset of microRNAs in miRBase as “high confidence”. The criteria we are using are described in detail in the paper. The exact details might change with time, but essentially, we’re looking for reads mapping to both sequences in the putative mature microRNA duplex, with approximately 2 nt 3′ overhangs, relatively consistent 5′ ends, and well-folded hairpin precursor sequences — no surprises I think.

Using these criteria, the first version of this analysis annotated 1761 “high confidence” microRNAs, across 38 organisms for which we have collected small RNAseq data in miRBase. We’ve already updated these data once, adding some more C. elegans and mouse datasets, and we now annotate 1828 high confidence sequences. Different species have very different proportions of their microRNA complement annotated as high confidence:

D. melanogaster is the leader, with around 60% of its sequences meeting all the criteria. There are several reasons for this, including that we’ve done a better job of incorporating fly deep seq datasets into miRBase (partly because we do lots of work in my lab on flies, and so we know about these datasets), and that the fly community seems to have been more conservative overall in their annotation of novel microRNAs. At the other end of the scale, less than 20% of rice and human microRNAs currently meet all the requirements for high confidence. That’s perhaps not surprising for rice, where we have only 6 RNAseq datasets incorporated (vs 50 for D. melanogaster). However, we have 81 datasets for human, so what’s the excuse here?

By far the most common reason that a sequence misses out on being called high confidence is insufficient reads (<10) mapping to both arms of the hairpin precursor. Many sequences look something like this:

Now mir-184 is pretty likely to be a real microRNA — it is conserved in other mammals, is seen in Ago pull-down datasets, and has been studied fairly extensively. But in neither mouse nor human does it currently have reads mapping to the passenger strand. The read data is consistent with it being a microRNA, but not, in itself, confirmatory. In human in particular, many groups have annotated low abundance microRNAs based on relatively few reads. Furthermore, these datasets are sometimes from very specific tissues or cell lines. In some cases, we haven’t yet pulled in the raw read data that was used in the original publications to annotate those sequences as microRNAs. Many sequences will be lifted into the high confidence set as we incorporate more RNAseq read data into miRBase. If you know of datasets (from GEO or SRA) that we are missing, then please let us know.

It is important to be clear that we are not claiming that 85% of human microRNAs in miRBase are not real — we simply don’t yet have enough evidence to label them as high confidence. That’s why you’ll see high confidence labels on the relevant entry pages, but you’ll notice there is no corresponding “low confidence” label. The main intention is to provide a subset of miRBase entries for purposes where you really only want to include sequences that you can be positive are real. Of course the read patterns sometimes do provide evidence that a given annotation is unlikely to be a bona fide microRNA. For example, the pattern of reads mapping to the annotated mmu-mir-1940 locus looks like this:

The reads from the 3′ arm of the hairpin do not have the usual expected consistency of the 5′ end, and the 5′ arm and 3′ arm reads do not pair together in the predicted hairpin structure. We have already been removing entries from the database where the pattern of reads is therefore not consistent with annotation as a microRNA, and we will continue to do so.

So, we can use deep sequencing data to address the challenge of rapid growth and variable annotation criteria. But the graph above highlights a second problem — the enormous rate of growth of the microRNA literature. We have previously written about trying to harness the knowledge and power of the community to provide textual information about specific microRNA families, using Wikipedia. That effort is going well, and again, hopefully you’ve seen some of the results (click the “Show Wikipedia entry” button on that page for more). The NAR paper also says something about this Wikipedia annotation, but we can also harness your expertise in making judgements about the confidence or otherwise of microRNA annotations themselves. We have therefore added a very simple interface to each entry page, where you can quickly and easily vote for whether you believe a particular microRNA is real or not — pick your favourite or least favourite microRNA and vote!

Here’s one a few people like so far:

This interface has only been active for since the start of January, and we’ve already had more than 1100 votes. In response to these votes, we will in the future be able to manually promote and demote annotations into and out of the “high confidence” microRNA set, and remove from the database completely sequences that have sufficient evidence to say that they are clearly not microRNAs. If you have a little more time than that required to click a yes or no button, you can also leave us a short comment about why you like/dislike a microRNA annotation. (These comments are for miRBase curators, not for public consumption right now.)

We’re pretty excited about more engagement of the microRNA community to improve the database, and we hope you find these new features useful and informative. As usual, comments, abuse, suggestions welcome here or by email.

Posted in community annotation, new features, papers.


6 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Ravi Shankar says

    Hi,

    First of all congrats for finding an innovative way to present your research activities in more interactive manner, which will be certainly useful to gather opinions of the audience to improve the understanding of miRNAs and their classification process. (This should be the approach which I expect from journals also, publishing the work and leave it to the audience judgments, calling more intellectually transparent and convincing judgments instead of surrendering the fate of a good piece of work to a few backdoor academic Godmans!).

    This article has addressed a very relevant question: Whom to call microRNAs? What makes a real miRNAs? Despite of so many years of works, this questions remains a big. I see a clear divide across the community to accept the image of miRNAs as entities which “ideally” form a hairpin structure and “sincerely” produce duplex, following a canonical approach. There are several flaws in accepting predicted hairpin structures which is mainly based on incomplete sequence modeling and derivation of their ab-initio hairpin structures, completely ignoring the long range structural interactions, change in structure of the same sequence if precursor length is changed. Also, if you globally scan for the secondary structure of the genome and transcriptomes, you will find the “ideal” hairpin loop structures almost everywhere. Several studies in the past have already pointed this out that MFE and hairpin structures don’t emerge as a strong discriminator for miRNAs, and thus you see a huge number of miRNA predictions tools which still struggle to put us closer to the concept of real miRNAs. A big missing thing is that most of the miRNA identification approaches have mostly ignored the importance of miRNA biogenesis and this group hardly bothered to take points from those groups which worked on RBP-RNA interactions. Several non canonical forms of miRNAs have been reported, including miRNAs skipping Drosha processing, Dicer processing, or their processing highly dependent upon presence of some RBP. There are good evidences that there are several mature miRNA evidences which don’t obey the duplex theory and emerge without that ideal duplexes. Approaches which seek presence of both strands being present may work mostly with the cases where both the strands are some how active, while mostly one strand is degraded quickly, and thus approaches searching for the ideal duplex may not succeed. A careful analysis of some tools which consider the duplex ration terribly fail to predict even a genuine miRNAs. The ratio between the one 5p and 3p strand is a highly variable feature and majority of cases have almost negligible strand coming from another strand. Similarly, I am not sure how good it would be to look into the 5′ consistency factor, given the fact that we see 5′ editing in the form of Iso-miRs.

    It would be better to think and rethink and evolve a better approach to call a miRNAs. System and nature does not evolve to fit our scientific theories and intellectual conveniences but we need to fit into them. MiRNAs may not be that obsessed with the concept of things like an ideal hairpin, an ideal duplex, an ideal MFE, all arising from our convenience to have some folding tool with limitations. More so when these fields themselves are prone to imperfections at slightest of level. Why can’t we call all small regulatory RNAs capable to regulate as miRNAs instead? And we can characterize them further on the basis their appearances and proof of existence in different experiments? Why can’t I call a small RNA as a miRNA when I see it repeated coming out of the system from a region including some hairpins, but don’t follow the standard duplex. There could be more participation of cross linking and interactions studies and their data to improve miRNA biology. A reasonable and honest participation of community is required to shape up the concept of miRNAs and take the needful approaches to better their identification process. It is possible that our preference for being traditional and safer has in fact taken us away from the more acceptable truths. MiRNA biology needs introspection, really.

    • sam says

      Thanks for your detailed thoughts.

      There are several flaws in accepting predicted hairpin structures [....] A big missing thing is that most of the miRNA identification approaches have mostly ignored the importance of miRNA biogenesis.

      A predicted hairpin is certainly not enough evidence for a novel miRNA annotation, and you won’t find many people who think that it is. miRBase doesn’t accept hairpin predictions without experimental evidence of the mature miRNA expression. I don’t think the community does ignore biogenesis though. As discussed here, the patterns of deep sequencing reads mapped to a locus provide strong confirmatory evidence for a subset of annotated miRNAs, precisely because those patterns support what is known about biogenesis.

      There are good evidences that there are several mature miRNA evidences which don’t obey the duplex theory and emerge without that ideal duplexes.

      I’m not sure what you’re referring to here — I don’t know of bona fide miRNAs that don’t have a strong duplex between guide and passenger strand. Again, that is defined by the biogenesis mechanism.

      [...] majority of cases have almost negligible strand coming from another strand. Similarly, I am not sure how good it would be to look into the 5′ consistency factor, given the fact that we see 5′ editing in the form of Iso-miRs.

      It is true that one strand is likely to be seen at much lower abundance than the other in most cases. Nevertheless, if we sequence deeply enough, I would argue we should see some evidence for the passenger strand in *all* cases. “Editing” might not be the correct word for what’s going on with 5′ isomiRs, and the available evidence suggests that we expect to see one 5′ end dominate.

      Why can’t we call all small regulatory RNAs capable to regulate as miRNAs instead?

      The term “microRNA” is accepted to mean a specific class of RNAs, processed by Drosha/Dicer or DCL1, and loaded into the RISC to post-transcriptionally silence targets. There are outlier examples, which skip Drosha processing for example, but the class is pretty well-defined. (Although, a good biological definition of a class doesn’t necessarily mean that it is straightforward to tell whether or not a given sequence is a member of the class.) Of course there are other regulatory small RNAs, but it doesn’t make sense to widen the working definition of “microRNA” to make it less meaningful — we can use other terms (“small regulatory RNA” or something more memorable) for the more general classes.

  2. Ravi Shankar says

    Hi Sam,
    Thanks for sharing your views and replying to my inputs. If I go traditionally, I agree with your views on most of the points including standard duplex form for miRNAs or hairpin loop structures ( which after scanning data, I find as a point which requires correction and further investigations). When we look into miRBase, we find several cases of miRNAs which appear to not fully fit to the traditional assumptions or the points you raised above. I just quote a few examples here: pti-miR5480 has extremely long terminal loop which is also variable across the species, osa-miR2093 (overlapping stands; no sufficient reads for 3′ end, not clear if the miRNA itself belongs to the hairpin terminal loop), almost similar situation with miRNA like osa-miR2099, almost non-existent terminal loop(sbi-miR6219), or say good portions falling within the loop itself with additional examples like mmu-miR468, cel-miR5548. There are several such instances. Also to notice there, for several of them the standard 3′ overhang theory does not appear working. Considering the argument which you put here that there needs to be reads from both the ends of miRNA duplexes, as already mention above, we don’t find this happening perfectly and most of the time even in several experiments, we see support for only one strand from the read mapping data. There are several cases in miRBase with read mapping data which suggest that you need to be lucky enough to get into one such experimental condition where both the strands have read mapping. Such approach may lead to rejection of several genuine miRNAs, till one accidently does not fall into some state of experiment where the lady luck shines with the expression for both the strands. Also, the degradation process of the passenger strand is so fast that usually we don’t get them being reported till they are really saved by some binding proteins. Finally, on your last section, I feel that most of the regulatory small RNAs come through facing some RNAse cleave, mainly Dicer and join AGO to impact the process of gene expression. While several reported small regulatory RNAs including mirtrons or those dodging typical Drosha Dicer processing pathways are being called miRNAs. So this way then the definition of miRNA is really vague and requires evaluation.
    Thanks,
    Ravi

  3. mmt says

    This might be rudundant but will say it just to be sure.

    there are a lot of miRNA libraries:
    http://www.ncbi.nlm.nih.gov/sra/?term=%28miRNA%29+AND+%22Homo+sapiens%22orgn%3A__txid9606
    >> they have called it miRNA seq instead of smRNA seq, the heretics.

    Waiting to be mined, although this might give biases.

    Logically it might be a good idea to also include pcr error rates in the sm/miRNA libraries. This might largely reduce the amount complexity of the mirbase. Example assuming a PCR error rate of 1/1000:
    if a certain mature sequence has an AAAAAAAAAA sequence and is observed 1m times then the sequence TAAAAAAAAA is likely to be a false positive if it is observed 1k times or less. What do you guys think about this (did not read all the papers)?
    It could be done with a lot of samples. Adapter trimming can be done automatical by analysing oligomer representation.

  4. Aimin says

    Hi, all. New version annotated high confidence subset of miRNAs. Are they stored separately? Where can we get those subset? What are those 38 organisms? Thanks.

    • sam says

      Sorry for the confusion. In miRBase 21, we discovered a bug with the high confidence miRNA calculations at the last minute. Rather than hold up the release, we pushed everything else out, and will be adding the high confidence data files as soon as we can (within a couple of days, hopefully). I’ll announce on the blog when we do.



Some HTML is OK

or, reply to this post via trackback.