Murasaki Project: Backup of Murasaki(No. 10)

What's Murasaki? †

Murasaki is an anchor alignment software, which is...

exteremely fast (17 CPU hours for whole Human x Mouse genome (with 40 nodes: 35 wall minutes), or 8 mammals in 21 CPU hours (42 wall minutes))
scalable (Arbitrarily parallelizable across multiple nodes using MPI)
memory efficient. (Even a single node with 16GB of ram can handle over 1Gbp of sequence)
unlimited pattern length
repeat tolerant
intelligent noise reduction

↑

Compatability †

Targeted for runs on 32/64bit Linux and other POSIX compatible operating systems.

Tested on:

Debian >=4
FreeBSD >=7
MacOS X >=10.4
Ubuntu 9.10
Fedora 12

With some luck, sometimes working in win32 with mingw, but no guarantees for Windows.

↑

License †

Murasaki is distributed under the GNU General Public License.

↑

Download †

Murasaki download packages are available in Murasaki download area.
Or, keep up with the latest release using Mercurial:

hg clone http://murasaki.hg.sourceforge.net:8000/hgroot/murasaki/murasaki

Subversion support is deprecated, but technically still exists. There will be no further releases to the subversion tree, so it's advised that you migrate to mercurial when convenient.

↑

Requirements †

Boost to build/run the core Murasaki algorithm
CryptoPP (optional, but enabled by default) provides CPU specific enhancements.
- If you don't want to use CryptoPP you can disable it any of the following ways:
  - commenting out (ie: putting a # beginning of the line) the "WITH_LIBCRYPTOPP ?= YES" line in src/Makefile
  - compiling via a command like "make WITH_LIBCRYPTOPP=NO"
  - setting WITH_LIBCRYPTOPP=NO as an environment variable before running make

↑

Optional requirements †

To use Murasaki in a cluster, you'll need some implementation of MPI. While Murasaki should be implementation agnostic, we've done most of our testing and tuning on OpenMPI. MPICH and MPICH-MX are also tested and known to work.

Murasaki interfaces with a lot of other free software to generate graphs and statistical information. To use all the features of Murasaki, you should also have:

Perl, to run the supporting perl programs (for filtering, visualizing results, etc).
- BioPerl is required by the annotation reading parts of the perl scripts.

↑

Build instructions †

In the future, I'm planning on updating the Murasaki build process to be automated using Boost Jam, but for now it's very manual.

If your system is already set up perfectly, once you've download one of the above packages, the following should work:

cd murasaki
make

You may need to tweak src/Makefile to fit your system (in particular: CXX and LIBRARIES (some distributions require that you specify boost_regex-st instead of boost_regex). If you're running bash, you can also specify CXX easily on the commandline by running "CXX=g++ make".

↑

Getting started †

Most of the documentation for Murasaki currently exists inside the various programs. You can find out what any command does by running it with the "--help" option. For example "./murasaki --help" lists how to run Murasaki. It's long, so you might want to use "./murasaki --help | less".

An example Murasaki run might go like this:

./murasaki seq/MtC.gbk seq/Mle.gbk -p[28:36] -H2 -b24 --name myalignment	Runs the core alignment program. "seq/MtC.gbk seq/Mle.gbk" specifies the input sequences. "-p[28:36]" uses a random string consiting of 28 1's and 8 0's. -H2 specifies to include anchor component information (for calculation of tf-idf scores by filter.pl). -b24 specifies to use only 24bit hash keys (as opposed to the default 26). This is desriable (possibly necessary) for machines with limited RAM. --name obviously the output file prefix.
./simplegraph.pl output/myalignment.anchors	This generates (in this case 1) graph of the anchors produced. For multiple alignments this outputs all pairings of the component sequences.
./filter.pl --kogfile COG output/myalignment.anchors --rocr --dumpstats tfidf	--kogfile specifies where to find reference COG data for calculating sensitivity and specificity. For this case (comparing MtC and Mle) this can be downloaded from NCBI's COGs website. --rocr generates ROC plots using R and ROCR. "--dumpstats tfidf" dumps the generated tf-idf scores to a separate file which can be read by GMV.

Obviously this is just a sample run. You're strongly encouraged to read the documentation for each command. Murasaki includes a great deal of functionality without the need to write any custom scripts.

↑

Sample alignments †

As an example of some of the huge alignments Murasaki is capable of, you can download the complete set of our whole genome mammalian alignments here. Be aware, however, that these alignments can be huge (for example, murasaki-mammals.tar.gz contains the Human-Mouse-Rat, Human-Chimp-Rhesus, and Human-Mouse alignments, and is a 340MB download which decompresses into about 1GB of files), and you may have to edit the .seq files to point to the correct data files (and download them from ensembl or UCSC Genome Browser.

↑

Documentation †

Documentation is still a work in progress. For now please email questions to the author. I'll start building a FAQ.