Frequently asked questions; if you have a question not answered here, file an issue here: we’ll try to answer as soon as possible.
table of contents
- BG7 input/output
We strive to keep dependencies at a minimum: to run BG7 you just need
Java a fairly recent
x64JVM, anything above
1.6is ok. You can get installers of oracle/Sun JDK-6u27 for a plethora of platforms from Oracle jdk download website
BLAST we recommend
blast+ 2.2.25, you can get precompiled binaries from the NCBI blast ftp site
With the current version, you just need a reasonable amount of RAM: we run it regularly on
c1.xlarge EC2 instances, which have 7GB of RAM. Obviously, this means you’ll need a
BG7 is licensed under AGPLv3: GNU Affero General Public License, version 3.
GPLv3 + further copyleft restrictions. You can read nore about this in licensing.
Absolutely not; Any application using BG7 code must be AGPLv3-licensed, and
Yes! but remember that you need to provide BG7 source code + source code of any modifications, software using BG7, etc to your users. See Selling Free Softare - GNU project.
- your genome sequence - FASTA
- genetic code - plain text a text file like this one_
- reference proteins - FASTA file
- reference RNAs - FASTA file
- pipeline executiom template - XML here’s a template
- only for gbk and/or embl output: additional info on name of the source, type of genome, etc - XML file
BG7 output is available in the following formats:
The code is hosted in github, under the BG7 organization; we also do all BG7 development there.
Not at all! This system is not based on a ORF predicton step highly dependent on having a close reference genome. That being based on protein similarity, you’ll need a set of reference proteins but these proteins don’t need to be very close ones: Good results have been achieved in genomes with no close proteins available and the quality of annotations don’t seem to depend on whether the reference proteins are close or not.
It’s important to point out that we use BLAST to detect all the putative coding regions along the contigs. So we need to run blast in a way that we can have multiple blast hits (supporting sequence similarity between proteins and contigs) in the same contig, given that we assume we have multiple coding regions in each contig.
A good way to achieve this is having the contigs as database and proteins as queries and setting blast to report only the best blast hit. In this way, each reference protein will have either no blast hit or 1 hit to a particular region in a particular contig (and only one contig) but each contig in the database will have multiple hits (all those reference proteins that had blast hits against it), as many as similar reference proteins are found.
we mean prokaryotes: bacteria and archaea
BG7 is not initially designed to deal with eukaryote genomes: exons, introns and all that.
However, you can get something useful by playing with the 4th argument in the PredictGenes module. This argument sets up the maximum difference (400 by default) admitted between the distance of two adjacent Blast HSPs in the query and in the hit. Allowing larger differences could make the system tolerant to introns in the hit.
We haven’t tried BG7 on a higher eukaryote genome yet; maybe it yields something useful.