bacterial genome annotation system


Frequently asked questions; if you have a question not answered here, file an issue here: we’ll try to answer as soon as possible.

table of contents


What software do I need for running BG7?

We strive to keep dependencies at a minimum: to run BG7 you just need

  • Java a fairly recent x64 JVM, anything above 1.6 is ok. You can get installers of oracle/Sun JDK-6u27 for a plethora of platforms from Oracle jdk download website

  • BLAST we recommend blast+ 2.2.25, you can get precompiled binaries from the NCBI blast ftp site

what about hardware requirements?

With the current version, you just need a reasonable amount of RAM: we run it regularly on c1.xlarge EC2 instances, which have 7GB of RAM. Obviously, this means you’ll need a 64bit OS.


what is BG7 license?

BG7 is licensed under AGPLv3: GNU Affero General Public License, version 3.

AGPL -what?

This is GPLv3 + further copyleft restrictions. You can read nore about this in licensing.

can I include/import/use BG7 in my closed-source application x?

Absolutely not; Any application using BG7 code must be AGPLv3-licensed, and

can I sell something based on / using BG7??

Yes! but remember that you need to provide BG7 source code + source code of any modifications, software using BG7, etc to your users. See Selling Free Softare - GNU project.

BG7 input/output

what do I need for annotating my genome?

you’ll need:

  • your genome sequence - FASTA
  • genetic code - plain text a text file like this one_
  • reference proteins - FASTA file
  • reference RNAs - FASTA file
  • pipeline executiom template - XML here’s a template
  • only for gbk and/or embl output: additional info on name of the source, type of genome, etc - XML file

does BG7 output annotation data in format xyz?

BG7 output is available in the following formats:


what programming language is BG7 written in?

pure Java.

where’s the code?

The code is hosted in github, under the BG7 organization; we also do all BG7 development there.


do I need a reference genome?

Not at all! This system is not based on a ORF predicton step highly dependent on having a close reference genome. That being based on protein similarity, you’ll need a set of reference proteins but these proteins don’t need to be very close ones: Good results have been achieved in genomes with no close proteins available and the quality of annotations don’t seem to depend on whether the reference proteins are close or not.

why reference proteins vs contigs?

It’s important to point out that we use BLAST to detect all the putative coding regions along the contigs. So we need to run blast in a way that we can have multiple blast hits (supporting sequence similarity between proteins and contigs) in the same contig, given that we assume we have multiple coding regions in each contig.

A good way to achieve this is having the contigs as database and proteins as queries and setting blast to report only the best blast hit. In this way, each reference protein will have either no blast hit or 1 hit to a particular region in a particular contig (and only one contig) but each contig in the database will have multiple hits (all those reference proteins that had blast hits against it), as many as similar reference proteins are found.

what do you mean by bacterial?

we mean prokaryotes: bacteria and archaea

can I annotate say a fungal genome with BG7?

BG7 is not initially designed to deal with eukaryote genomes: exons, introns and all that.

However, you can get something useful by playing with the 4th argument in the PredictGenes module. This argument sets up the maximum difference (400 by default) admitted between the distance of two adjacent Blast HSPs in the query and in the hit. Allowing larger differences could make the system tolerant to introns in the hit.

what about annotating a human genome?

We haven’t tried BG7 on a higher eukaryote genome yet; maybe it yields something useful.