Murasaki Project: Backup diff of FAQ vs current(No. 3)

List of Backups
View the diff.
View the source.
View the backup.
Go to FAQ.
- 1 (2009-04-16 (Thu) 15:06:10)
- 2 (2009-04-16 (Thu) 15:06:31)
- 3 (2009-08-28 (Fri) 10:05:42)

The added line is THIS COLOR.
The deleted line is THIS COLOR.

#contents

* I get an error like "Encountered exception: Murasaki: Error creating System V IPC shared memory segment (size: 36.93 mb) for dna/human/chrX.fa.gz: Invalid argument" [#zd78b873]

The default linux kernel only permits a total of 32mb of shared memory. Using System V IPC shared memory with murasaki and large sequences you'll quickly run into this limit. You can set a higher limit (say 6gb) by running "sysctl -w kernel.shmmax=6442450944" (or setting this in /etc/sysctl.conf).

* What's the -p (--pattern) argument? [#t3db91c0]

The -p parameter specifies the
pattern used when creating seeds. While there's a number of posters for
Murasaki, the journal paper is still in preparation, so it's rather hard
to explain, but spaced seeds are a common feature now in homology search
programs, and if you want some more information on them, you should
check out the [[PatternHunter paper>http://www.bioinformaticssolutions.com/functions_db_download.php?id=159]] which introduced the idea. Basically a
''pattern'' is sequence which represents which bases in a seed must match
and which can be mismatched. For example, for the pattern 101, "ATA"
matches "AAA" but not "AAT". The -p argument can take a specific pattern
like -p101, but in general we find that random patterns are generally
acceptable, so we specify "random patterns" to Murasaki in the form of -p[weight:length] (the [ ] characters can be omitted) where "weight" represents the number of 1s in the pattern and "length" is the total length. The longer the pattern and the
more 1s in it, the more specific but less sensitive it becomes.

* I'm comparing some complex genomes and it gets to "Extracting anchors from hash-space" and shows the % like 70 or 80 then it won't move. What's up with that? [#b96f15e0]

Without knowledge about the input sequences or the other options you're supplying to Murasaki, it's hard to guess, but I suspect that you're running into a lot of repeats, which can cause Murasaki to take exponential time (depending on the number of input sequences). You can use "--mergefilter X" where X is maximum number of anchors to generate from a given seed by setting. Any seeds which would cause more anchors than X to be generated are classed as "repeats" and will be stored to <name>.repeats. Usually a number like 100 or so is safe here.

An alternate hypothesis is that your pattern is too short and you're generating repeats that (with a little more context) are not in fact repeats. Using a longer might also help.