Skip to content

MicroRNA Gene Ontology annotations

You might have noticed some additional information on the mature miRNA pages in the last few weeks. See for example:

The new section “QuickGO function” contains a set of high quality manual annotations of Gene Ontology terms for mature miRNAs, the vast majority of which come from the work of Rachael Huntley et al. at the UCL Functional Gene Annotation group. The annotation has been an Herculean biocuration task — more than 4000 GO terms assigned to nearly 400 miRNAs, all from expert reading of primary literature. Human miR-21 is the star — 244 GO terms:

We’re pulling these data from the EBI’s QuickGO database. Their webservices make this straightforward (thanks Tony!). It’s also worth noting that the GO terms are actually assigned to RNAcentral IDs. RNAcentral maintains mappings of IDs between RNA sequence databases, including miRBase. Again, this legwork makes the task of providing these annotations much easier than it would otherwise be.

Functional information has been generally lacking in miRBase. These GO data have already made a significant difference to this, and we’re planning more. Look out for functional statements from text mining of the primary literature, coming to a web browser near you soon.

Rachael et al.’s latest paper on GO annotation of miRNAs is in the preprint section at RNA:

Expanding the horizons of microRNA bioinformatics.
Rachael P Huntley, Barbara Kramarz, Tony Sawford, Zara Umrao, Anastasia Z Kalea, Vanessa Acquaah, Maria-Jesus Martin, Manuel Mayr and Ruth C Lovering.
RNA 2018

See also:

Guidelines for the functional annotation of microRNAs using the Gene Ontology.
Huntley RP, Sitnikov D, Orlic-Milacic M, Balakrishnan R, D’Eustachio P, Gillespie ME, Howe D, Kalea AZ, Maegdefessel L, Osumi-Sutherland D, Petri V, Smith JR, Van Auken K, Wood V, Zampetaki A, Mayr M, Lovering RC.
RNA 2016 22:667-676.

The GOA database: Gene Ontology annotation updates for 2015.
Huntley RP, Sawford T, Mutowo-Muellenet P, Shypitsyna A, Bonilla C, Martin MJ, O’Donovan C.
Nucleic Acids Research 2014 43:D1057-D1063.

QuickGO: a web-based tool for Gene Ontology searching.
Binns D, Dimmer E, Huntley R, Barrell D, O’Donovan C, Apweiler R.
Bioinformatics 2009 25:3045-3046.

Posted in new features.

miRBase 22 release

After repeated and unreasonable delay, miRBase 22 is finally released. As you might expect with such a long gap, the number of sequences in the database has jumped significantly — by over a third. The vast majority of the increase comes from new microRNA annotations in species not previous represented in the database. Indeed, there are sequences for 48 new species in this release. Still, we know we are missing microRNA annotations that have been published. We apologise for that, and will be working hard to catch up and get back to more timely data releases. Please let us know if we are missing your data.

Other new things:

  • We’ve changed how we collect and manage the deep sequencing datasets that you can see in the miRBase read views. The number of deep sequencing datasets that we have mapped has jumped in this release — to 831. We have around 1000 more datasets mapped and ready to go, but we’ve hit a technical issue with database size and speed for the website, for which we didn’t want to hold up the release any further. As soon as we’ve fixed that problem, the deep sequencing data views in miRBase will expand dramatically. With that update, we expect the number of microRNA annotations that will be classified as “high confidence” to also jump significantly.
  • We’re developing interfaces to keep track of the changes in miRBase over time. The first view of that is available in miRBase 22 — click the “change log” links on the microRNA entry pages to see.
  • We’re also developing views of functional data, incorporating both literature mining, and the excellent work of Huntley et al. (RNA 2016 22:667-676). The first views of that will appear on the microRNA entry pages shortly.
  • Look out for a programmatic webservice to retrieve sequences, also coming shortly.
  • As always, please let us know if you have comments, questions, suggestions.

    Sam and Ana

Posted in data update, new features, releases.

High confidence miRNA set available for miRBase 21

As mentioned previously, we briefly held off from releasing the set of “high confidence” miRNAs for miRBase 21, because of a last-gasp bug. Those data are now available, tagged with the label “high confidence” on the entry pages, and for download on the FTP site. The total number of miRNAs labelled “high confidence” has increased by 168, to 1996. That increase is partly due to our incorporation of more deep sequencing datasets, and also because we’ve relaxed one criterion:

In miRBase 20, a high confidence sequence must have at least 10 reads that map to each of the two mature sequences (-5p and -3p). In miRBase 21, high confidence sequences must either (a) have at 10 reads mapping to each arm, as before, *or* (b) have at least 5 reads mapping to each arm *and* at least 100 reads mapping in total. The latter case helps us to catch some of the well-established, highly expressed miRNAs that have very high arm expression bias — that is, a large number of reads mapping to one arm, and a small number to the other.

A few sequences labelled as high confidence in miRBase 20 have disappeared in the miRBase 21 set, some because of the aforementioned bug.

Facilities remain in place for you to vote for whether or not you agree with our high confidence assertions — see individual entry pages, and sequencing read views.

Posted in Uncategorized.

miRBase 21 finally arrives

Apologies for the longer-than-usual wait.

miRBase 21 is now available on the website, and all data available for download on the FTP site. As usual, the release notes describe the major changes. Of particular note this time, the Genome Reference Consortium have released a new human genome assembly, GRCh38. We have therefore remapped the human microRNA dataset to this assembly, which includes the removal of a handful of duplicate entries that now map to a single locus — for example, GRCh37 had 6 loci representing miR-3118, whereas GRCh38 has only 4. In total, there is a small increase in the number of annotated human microRNA loci, to 1881. Elsewhere in the database, the increases have been larger — we have hundreds of new sequences in each of bat, horse, goat, cobra and salmon, amongst others. In total, 4196 new hairpin sequences and 5441 new mature products have been added. The work to clean up dubious and misannotated sequences also goes on, with another 72 entries in total removed from this release.

Unfortunately, at the last moment, we’ve found an issue with the update of the “high confidence” microRNA dataset. Rather than delay the release further, we’ve decided to go ahead without the “high confidence” set for now. That will follow in the next few days, with an announcement here.

As usual, please let us know (use the comments box below, or by email) if you have any questions or comments.

Posted in releases.

miRBase 21 is coming ….

The release of miRBase 21 has taken much longer than we would have liked, but it’s nearly there now. We anticipate making the data available within the next week. Because of the time since the last release, it’s another hefty update, with over 4000 new hairpin sequences, and over 5000 new mature sequences. The new sequences mostly represent organisms that previously had few or no microRNA annotations. More soon ….

Posted in releases.

High confidence microRNAs

I promised a while ago to provide more information about the miRBase website features associated with the new “high confidence” microRNA set, as described in our 2014 NAR paper.

First, some background. Here is a figure showing the growth of the number of sequences deposited in miRBase, and the number of papers in Pubmed that contain the word “microRNA” in title, keywords, abstract (slightly out-of-date, from release 20, June 2013):

Since around 2007, the overwhelming majority of microRNAs deposited in miRBase have been predicted from small RNA deep sequencing experiments. Deep sequencing has become more and more available to more and more labs, and has been applied in more and more contexts. The upshot is that different groups have annotated microRNAs with different criteria and therefore different levels of stringency. miRBase has always been a community resource with a somewhat inclusive policy — acceptance of a manuscript describing novel microRNAs being the primary requirement for deposition of sequences in the database. (We have always pushed back on obvious problems — tRNA fragments, poor hairpin structures, etc.) However, because a single small RNA deep sequencing experiment can predict hundreds of novel microRNAs, there is a real danger that a small number of poorly performed analyses can swamp the bona fide microRNA gene set with dubious annotations. Indeed, some recent reports have claimed exactly this (Wang and Liu 2011; Hansen et al. 2011; Meng et al. 2012; Brown et al. 2013).

We know that, with sufficient read depth, the pattern of reads mapping to a putative microRNA locus can provide powerful evidence for the annotation of that locus as a microRNA. Drosha and Dicer (and DCL1) leave well-known and characteristic clues: 2 nt 3′ overhangs in the mature miR duplex, with more exact processing at the 5′ ends of the duplex sequences. To see these clues, we need enough sequencing depth to be sure that we will see reads representing both sequences in the mature microRNA duplex, remembering that one (the passenger strand or miR* sequence) is likely to be at much lower abundance than the other. Some groups have required this level of evidence to annotate novel loci, and some haven’t.

Since 2010, we’ve been collecting RNA deep sequencing experiments in miRBase, and hopefully you have already seen the views of these read data. The NAR paper describes how we are using the aggregated read data to automatically categorize a subset of microRNAs in miRBase as “high confidence”. The criteria we are using are described in detail in the paper. The exact details might change with time, but essentially, we’re looking for reads mapping to both sequences in the putative mature microRNA duplex, with approximately 2 nt 3′ overhangs, relatively consistent 5′ ends, and well-folded hairpin precursor sequences — no surprises I think.

Using these criteria, the first version of this analysis annotated 1761 “high confidence” microRNAs, across 38 organisms for which we have collected small RNAseq data in miRBase. We’ve already updated these data once, adding some more C. elegans and mouse datasets, and we now annotate 1828 high confidence sequences. Different species have very different proportions of their microRNA complement annotated as high confidence:

D. melanogaster is the leader, with around 60% of its sequences meeting all the criteria. There are several reasons for this, including that we’ve done a better job of incorporating fly deep seq datasets into miRBase (partly because we do lots of work in my lab on flies, and so we know about these datasets), and that the fly community seems to have been more conservative overall in their annotation of novel microRNAs. At the other end of the scale, less than 20% of rice and human microRNAs currently meet all the requirements for high confidence. That’s perhaps not surprising for rice, where we have only 6 RNAseq datasets incorporated (vs 50 for D. melanogaster). However, we have 81 datasets for human, so what’s the excuse here?

By far the most common reason that a sequence misses out on being called high confidence is insufficient reads (<10) mapping to both arms of the hairpin precursor. Many sequences look something like this:

Now mir-184 is pretty likely to be a real microRNA — it is conserved in other mammals, is seen in Ago pull-down datasets, and has been studied fairly extensively. But in neither mouse nor human does it currently have reads mapping to the passenger strand. The read data is consistent with it being a microRNA, but not, in itself, confirmatory. In human in particular, many groups have annotated low abundance microRNAs based on relatively few reads. Furthermore, these datasets are sometimes from very specific tissues or cell lines. In some cases, we haven’t yet pulled in the raw read data that was used in the original publications to annotate those sequences as microRNAs. Many sequences will be lifted into the high confidence set as we incorporate more RNAseq read data into miRBase. If you know of datasets (from GEO or SRA) that we are missing, then please let us know.

It is important to be clear that we are not claiming that 85% of human microRNAs in miRBase are not real — we simply don’t yet have enough evidence to label them as high confidence. That’s why you’ll see high confidence labels on the relevant entry pages, but you’ll notice there is no corresponding “low confidence” label. The main intention is to provide a subset of miRBase entries for purposes where you really only want to include sequences that you can be positive are real. Of course the read patterns sometimes do provide evidence that a given annotation is unlikely to be a bona fide microRNA. For example, the pattern of reads mapping to the annotated mmu-mir-1940 locus looks like this:

The reads from the 3′ arm of the hairpin do not have the usual expected consistency of the 5′ end, and the 5′ arm and 3′ arm reads do not pair together in the predicted hairpin structure. We have already been removing entries from the database where the pattern of reads is therefore not consistent with annotation as a microRNA, and we will continue to do so.

So, we can use deep sequencing data to address the challenge of rapid growth and variable annotation criteria. But the graph above highlights a second problem — the enormous rate of growth of the microRNA literature. We have previously written about trying to harness the knowledge and power of the community to provide textual information about specific microRNA families, using Wikipedia. That effort is going well, and again, hopefully you’ve seen some of the results (click the “Show Wikipedia entry” button on that page for more). The NAR paper also says something about this Wikipedia annotation, but we can also harness your expertise in making judgements about the confidence or otherwise of microRNA annotations themselves. We have therefore added a very simple interface to each entry page, where you can quickly and easily vote for whether you believe a particular microRNA is real or not — pick your favourite or least favourite microRNA and vote!

Here’s one a few people like so far:

This interface has only been active for since the start of January, and we’ve already had more than 1100 votes. In response to these votes, we will in the future be able to manually promote and demote annotations into and out of the “high confidence” microRNA set, and remove from the database completely sequences that have sufficient evidence to say that they are clearly not microRNAs. If you have a little more time than that required to click a yes or no button, you can also leave us a short comment about why you like/dislike a microRNA annotation. (These comments are for miRBase curators, not for public consumption right now.)

We’re pretty excited about more engagement of the microRNA community to improve the database, and we hope you find these new features useful and informative. As usual, comments, abuse, suggestions welcome here or by email.

Posted in community annotation, new features, papers.

Website down time, Feb 4th, 8-10am GMT

Due to some essential network maintenance, the miRBase website is at risk of short periods of down time between 8 and 10am GMT on Tuesday 4th Feb. We apologise for any inconvenience.

Posted in down time.

miRBase paper out in NAR

The 2014 Database Issue of Nucleic Acids Research includes an update paper about miRBase. In particular, we describe how we are using publicly available deep sequencing data to classify a subset of miRBase microRNA entries as “high confidence”. A post with more details about the associated changes to the website is coming shortly …..

Posted in papers.

Bug fixes to release 20 MySQL database dumps

Read no further unless you care about the MySQL database dumps in the database_files directory on the FTP site.

A couple of people (many thanks Jeff and Jakob) found errors in the the release 20 MySQL database dumps: a small number of new mature sequences were not linked to their hairpin precursors, and the ends of a smaller number of old mature sequences were off by 1. The table affected was mirna_pre_mature. If you’re using these dumps you will probably want to grab the fixed version from the FTP site (timestamp 17/7/2013). You might notice other files in that directory with the same new timestamp — feel free to grab those too, but you are much less likely to care about those (the changes are either cosmetic or updated links to other resources). The FASTA format sequence files, the EMBL format data file, and all other data dumps were unaffected by these bugs.

Apologies for any inconvenience.

Posted in bugs, data update, releases.

miRBase 20 released

Phew. After considerably more pain and tears than usual, miRBase 20 is finally available on the website and for download on the FTP site (see also the README file). The gap between releases has also been longer than usual, which means that the increase in data is greater than usual (probably explaining the increase in pain). In all, we have 3355 new hairpin sequences and 5393 new mature microRNAs from around 40 new publications, increasing the totals to 24521 hairpin sequences and 30424 mature sequences. As always, the full list of additions, deletions and name changes in available in the miRNA.diff file on the FTP site, along with all other miRBase data in various file formats. There are minor changes to the structure of the MySQL database underlying the website, and therefore to the database dumps. As we still don’t have sensible documentation for those dumps, you should ask if you care about this.

Ana has also spent a fair bit of time adding datasets to the deep sequencing section of the site: we have now mapped reads from 306 small RNA deep sequencing experiments to miRBase hairpins, increasing the coverage to 37 species. In all, approximately 25% of all mature microRNAs have at least 10 reads mapping to them across all datasets. As we’ve said before, these data can be used for expression analysis, and for judging the validity of microRNA annotations. We’ve been working on a system to use these aggregated data to assess the confidence in a given microRNA annotation, and allow users to filter the data by this confidence measure. We aim to have something to show on that in the next release or two. Feel free to point us in the direction of publicly available datasets that we don’t already capture, preferably in the form of a GEO or SRA accession.

Comments, criticism, suggestions, abuse to the usual address.

Posted in data update, releases.