一个用来处理BAM (http://samtools.sourceforge.net) 格式的高通量测序数据的 (Java) 工具箱。
python src/scripts/explain_sam_flags.py [flag]
A: Picard programs that sort their output (e.g. SortSam, MergeSamFiles if the inputs are not all sorted in the same order as the output) will run faster when given more RAM, and told to store more alignment records in memory before writing to a temporary file. You can give the program more memory by increasing the size of the Java heap with the -Xmx argument. If your operating system imposes a hard memory limit on a process, a rule of thumb is to set the -Xmx value no higher than 2GB less than the hard memory limit. Even if your system does not set a hard memory limit, if you set the -Xmx value much greater than the available RAM on the computer, excessive swapping will hurt performance. The Picard MAX_RECORDS_IN_RAM option controls the number of records to store in RAM before writing to a temporary file when producing sorted output. The optimal setting for this option depends on a number of factors, including:
A rule of thumb for reads of ~100bp is to set MAX_RECORDS_IN_RAM to be 250,000 reads per each GB given to the -Xmx parameter for SortSam. Thanks to Keiran Raine for performing the experiments to arrive at these numbers.
A: Picard validation errors may be turned into warnings by passing the command line argument VALIDATION_STRINGENCY=LENIENT. Picard validation messages may be suppressed completely with VALIDATION_STRINGENCY=SILENT. Another option is to use CleanSam to soft-clip these reads so they don't map off the end of the reference.
A: Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.
If your reads have been divided into separate BAMs by chromosome, inter-chromosomal pairs will not be identified, but MarkDuplicates will not fail due to inability to find the mate pair for a read.
A: Not necessarily. MarkDuplicates is very memory intensive. This is required in order to detect interchromosomal duplication. At Broad we run MarkDuplicates with 2GB Java heap (java -Xmx2g) and 10GB hard memory limit. Some example times:
Increasing the amount of RAM you give MarkDuplicates (the -Xmx value) will improve the speed somewhat. Note that if you have a hard memory limit, this will ned to be increased also. Marking duplicates for different libraries independently, and then merging the marked files, rather than merging followed by marking duplicates, will be faster.
A: samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates.
A: It estimates the return on investment for sequencing a library to a higher coverage than the observed coverage. The first column is the coverage multiple, and the second column is the multiple of additional actual coverage for the given coverage multiple. The first row (1x, i.e. the actual amount of sequencing done) should have ROI of approximately 1. The next row estimates the ROI for twice as much sequencing of the library. As one increases the amount of sequencing for a library, the ROI for additional sequencing diminishes because more and more of the reads are duplicates. The algorithm is described here. The code can be found here.
A: Many aligners produce output that is not up to Picard standards. E.g. an aligner might not output unmapped reads, might drop some important tags, might not set mate information correctly, might produce multiple primary alignments, or other problems. To address this problem, MergeBamAlignment is designed to merge an aligned BAM produced directly by an aligner with an unmapped BAM. The input aligned BAM will not necessarily meet Picard validation standards but the output of MergeBamAlignment should work properly with other Picard programs. If you do not have an unmapped BAM because the input to your aligner is in FASTQ format, you can create one with FastqToSam.
A: See this thread for a discussion of the algorithm.
A: This can be caused by the GC method of Java when used on 64 bit Java. By default the JVM switches to 'server' settings when on 64 bit, this automatically implements parallel GC and will use as many cores as it can get it's hands on. The approach we decided on to get round this was to define the number of threads we would allow Java for GC.
-XX:ParallelGCThreads=<number of threads>
An alternative approach is to turn off Parallel Gc (boolean option so note the '-' to indicate it is turned off):
-XX:+UseSerialGC
. We found this to be sub-optimal as the process has to stop completely when GC occurs and takes much longer as (from what I can tell) a full GC sweep is the only type performed which in many cases is not required (parallel GC employs ~7 different types of GC). See here for further details of the tuneable parameters.
A: samtools sort does not change the SAM file header to indicate that the file is sorted. Some Picard programs (CollectAlignmentSummaryMetrics, MarkDuplicates, MergeSamFiles) have an ASSUME_SORTED option. If you put ASSUME_SORTED=true on the command line, the program will assume that the input SAM or BAM is sorted appropriately. Alternately, you can use Picard's SortSam instead of samtools sort. Finally, you can capture the text header using Picard's ViewSam, edit the text header in a text editor, and then create a new BAM with Picard's ReplaceSamHeader. The downside of this last alternative is that currently replacing the header requires writing a copy of the input BAM, rather than editing in place.
A: Most Picard programs can do this. To read from stdin, specify /dev/stdin as the input file. To write to stdout, specify /dev/stdout as the output file, and also add the option QUIET=true to suppress other messages that otherwise would be written to stdout. Not that Picard determines whether to write in SAM or BAM format by examining the file extension of the output file. Since /dev/stdout ends in neither .sam nor .bam, Picard defaults to writing in BAM format. Some Picard programs, e.g. MarkDuplicates, cannot read its input from stdin, because it makes multiple passes over the input file. When writing a BAM to stdout so that it can be read by another program, passing the argument COMPRESSION_LEVEL=0 to the program writing the BAM to stdout can reduce unnecessary computation.
A: BWA can produce SAM records that are marked as unmapped but have non-zero MAPQ and/or non-"*" CIGAR. Typically this is because BWA found an alignment for the read that hangs off the end of the reference sequence. Picard considers such input to be invalid. In general, this error can be suppressed in Picard programs by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENT. For ValidateSamFile, you can pass the arguments IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR.
A: This warning message can be ignored. R is trying to write error bars with length 0, because there are no reads with a particular GC content.
A: Currently there is no Picard paper. You can cite Picard by referring to the website, http://broadinstitute.github.io/picard.
A: Description of this format can be found in the IntervalList javadoc
A: If you want to invoke a Picard program from within your java program, but not have it exit, do this:
int returnValue = new PicardProgram().instanceMain(argv);Make sure to check the return value from this call. If it returns non-zero, an error has occurred.
A: There are several things you can do: 1) On Unix systems, the allowed number of open files can be increased. ulimit -n will show you how many open files a process is allowed to have. You can ask your system administrator to increase that number. 2) Many Picard programs use temporary files to sort. You can increase the value of the MAX_RECORDS_IN_RAM command-line parameter in order to tell Picard to store more records in fewer files in order to reduce the number of open files. This will, however, increase memory consumption. 3) MarkDuplicates has the command-line parameter MAX_FILE_HANDLES_FOR_READ_ENDS_MAP. By reducing this number, you reduce the number of concurrently open files (trading off execution speed).
A: Snappy is a compression library that some Picard programs can optionally. See [Using_Snappy_in_Picard] for details.
A: First, this file is not required. If not provided, CollectRnaSeqMetrics will not be able to identify bases as ribosomal, so they will fall into one of the other categories in the metrics output. There is no public distribution of ribosomal RNA interval lists. However, you may create your own. You will need to obtain ribosomal sequence for the organism in question, and then determine where reads containing these sequences will align to the genome. You will then need to put the intervals to which the ribosomal sequences align into an interval list (format describe here. We do this by taking a FASTA containing ribosomal sequences, creating pseudo-reads the are close to the length of real reads, aligning these reads using TopHat, and then creating an interval list using the alignments produced by TopHat. It may be useful to use IntervalListTools during this process.
A: This may be due to a long-standing JVM bug. Try increasing the value of the JVM option -XX:MaxDirectMemorySize. Although this is documented as having a default value of unbounded, others have reported that the default is 64M. Try, e.g. -XX:MaxDirectMemorySize=4G .
A: The denominator is 1. E.g. if PERCENT_DUPLICATION is 0.5, it means that half the reads are marked as duplicates.
A There are several areas in which Picard diverges from the SAM spec. See [Differences_between_Picard_and_SAM] for details.
A libIntelDeflater.so is a library that speeds compression on some Intel processors. See [IntelDeflater] for details.
A The BAM index format imposes a limit of 512MB for a single reference sequence. If this limit is exceeded, various errors may occur depending on what steps have been taken, including the following:
bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned Value too large to be written as ushort. java.lang.ArrayIndexOutOfBoundsException
If you want to create a BAM index, you will need to ensure that each reference sequence is less than 512MB, perhaps by splitting long reference sequences.