Picard中文使用手册

一个用来处理BAM (http://samtools.sourceforge.net) 格式的高通量测序数据的 (Java) 工具箱。

View the Project on GitHub broadinstitute/picard

Picard Metrics Definitions

Click on a metric to see a description of its fields.

AlignmentSummaryMetrics: High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".
BaseDistributionByCycleMetrics:
CollectHiSeqXPfFailMetrics.PFFailDetailedMetric: a metric class for describing FP failing reads from an Illumina HiSeqX lane *
CollectHiSeqXPfFailMetrics.PFFailSummaryMetric: Metrics produced by the GetHiSeqXPFFailMetrics program.
CollectOxoGMetrics.CpcgMetrics: Metrics class for outputs.
CollectQualityYieldMetrics.QualityYieldMetrics: A set of metrics used to describe the general quality of a BAM file
CollectRawWgsMetrics.RawWgsMetrics:
CollectVariantCallingMetrics.VariantCallingDetailMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.
CollectVariantCallingMetrics.VariantCallingSummaryMetrics: A collection of metrics relating to snps and indels within a variant-calling file (VCF).
CollectWgsMetrics.WgsMetrics: Metrics for evaluating the performance of whole genome sequencing experiments.
CollectWgsMetricsFromQuerySorted.QuerySortedSeqMetrics: Metrics for evaluating the performance of whole genome sequencing experiments.
CollectWgsMetricsFromSampledSites.SampledWgsMetrics:
DuplicationMetrics: Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.
ExtractIlluminaBarcodes.BarcodeMetric: Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.
GcBiasDetailMetrics: Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.
GcBiasMetrics:
GcBiasSummaryMetrics: High level metrics that capture how biased the coverage in a certain lane is.
GenotypeConcordanceContingencyMetrics: Class that holds metrics about the Genotype Concordance contingency tables.
GenotypeConcordanceDetailMetrics: Class that holds detail metrics about Genotype Concordance
GenotypeConcordanceSummaryMetrics: Class that holds summary metrics about Genotype Concordance
HsMetrics: The set of metrics captured that are specific to a hybrid selection analysis.
IlluminaBasecallingMetrics: Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis.
IlluminaLaneMetrics: Embodies characteristics that describe a lane.
IlluminaPhasingMetrics: Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis.
InsertSizeMetrics: Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics".
JumpingLibraryMetrics: High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".
MultilevelMetrics:
RnaSeqMetrics: Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".
RrbsCpgDetailMetrics: Holds information about CpG sites encountered for RRBS processing QC
RrbsSummaryMetrics: Holds summary statistics from RRBS processing QC
SamFileValidator.ValidationMetrics:
SequencingArtifactMetrics.BaitBiasDetailMetrics: Bait bias artifacts broken down by context.
SequencingArtifactMetrics.BaitBiasSummaryMetrics: Summary analysis of a single bait bias artifact, also known as a reference bias artifact.
SequencingArtifactMetrics.PreAdapterDetailMetrics: Pre-adapter artifacts broken down by context.
SequencingArtifactMetrics.PreAdapterSummaryMetrics: Summary analysis of a single pre-adapter artifact.
TargetedPcrMetrics: Metrics class for targeted pcr runs such as TSCA runs

AlignmentSummaryMetrics

High level metrics about the alignment of reads within a SAM file, produced by the CollectAlignmentSummaryMetrics program and usually stored in a file with the extension ".alignment_summary_metrics".

Field	Description
CATEGORY	One of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read in a paired run or PAIR when the metrics are aggregated for both first and second reads in a pair.
TOTAL_READS	The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters.
PF_READS	The number of PF reads where PF is defined as passing Illumina's filter.
PCT_PF_READS	The percentage of reads that are PF (PF_READS / TOTAL_READS)
PF_NOISE_READS	The number of PF reads that are marked as noise reads. A noise read is one which is composed entirely of A bases and/or N bases. These reads are marked as they are usually artifactual and are of no use in downstream analysis.
PF_READS_ALIGNED	The number of PF reads that were aligned to the reference sequence. This includes reads that aligned with low quality (i.e. their alignments are ambiguous).
PCT_PF_READS_ALIGNED	The percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS
PF_ALIGNED_BASES	The total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence.
PF_HQ_ALIGNED_READS	The number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong.
PF_HQ_ALIGNED_BASES	The number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when either mixed read lengths are present or many reads are aligned with gaps.
PF_HQ_ALIGNED_Q20_BASES	The subset of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher.
PF_HQ_MEDIAN_MISMATCHES	The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS).
PF_MISMATCH_RATE	The rate of bases mismatching the reference for all bases aligned to the reference sequence.
PF_HQ_ERROR_RATE	The percentage of bases that mismatch the reference in PF HQ aligned reads.
PF_INDEL_RATE	The number of insertion and deletion events per 100 aligned bases. Uses the number of events as the numerator, not the number of inserted or deleted bases.
MEAN_READ_LENGTH	The mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads.
READS_ALIGNED_IN_PAIRS	The number of aligned reads whose mate pair was also aligned to the reference.
PCT_READS_ALIGNED_IN_PAIRS	The percentage of reads whose mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED
BAD_CYCLES	The number of instrument cycles in which 80% or more of base calls were no-calls.
STRAND_BALANCE	The number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome.
PCT_CHIMERAS	The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes.
PCT_ADAPTER	The percentage of PF reads that are unaligned and match to a known adapter sequence right from the start of the read.

BaseDistributionByCycleMetrics

Field	Description
READ_END
CYCLE
PCT_A
PCT_C
PCT_G
PCT_T
PCT_N

CollectHiSeqXPfFailMetrics.PFFailDetailedMetric

a metric class for describing FP failing reads from an Illumina HiSeqX lane *

Field	Description
TILE
X
Y
NUM_N
NUM_Q_GT_TWO
CLASSIFICATION	The classification of this read: {EMPTY, POLYCLONAL, MISALIGNED, UNKNOWN} (See PFFailSummaryMetric for explanation regarding the possible classification.)

CollectHiSeqXPfFailMetrics.PFFailSummaryMetric

Metrics produced by the GetHiSeqXPFFailMetrics program. Used to diagnose lanes from HiSeqX Sequencing, providing the number and fraction of each of the reasons that reads could have not passed PF. Possible reasons are EMPTY (reads from empty wells with no template strand), POLYCLONAL (reads from wells that had more than one strand cloned in them), MISALIGNED (reads from wells that are near the edge of the tile), UNKNOWN (reads that didn't pass PF but couldn't be diagnosed)

Field	Description
TILE	The Tile that is described by this metric. Can be a string (like "All") to mean some marginal over tiles. *
READS	The total number of reads examined
PF_FAIL_READS	The number of non-PF reads in this tile.
PCT_PF_FAIL_READS	The fraction of PF_READS
PF_FAIL_EMPTY	The number of non-PF reads in this tile that are deemed empty.
PCT_PF_FAIL_EMPTY	The fraction of non-PF reads in this tile that are deemed empty (as fraction of all non-PF reads).
PF_FAIL_POLYCLONAL	The number of non-PF reads in this tile that are deemed multiclonal.
PCT_PF_FAIL_POLYCLONAL	The fraction of non-PF reads in this tile that are deemed multiclonal (as fraction of all non-PF reads).
PF_FAIL_MISALIGNED	The number of non-PF reads in this tile that are deemed "misaligned".
PCT_PF_FAIL_MISALIGNED	The fraction of non-PF reads in this tile that are deemed "misaligned" (as fraction of all non-PF reads).
PF_FAIL_UNKNOWN	The number of non-PF reads in this tile that have not been classified.
PCT_PF_FAIL_UNKNOWN	The fraction of non-PF reads in this tile that have not been classified (as fraction of all non-PF reads).

CollectOxoGMetrics.CpcgMetrics

Metrics class for outputs.

Field	Description
SAMPLE_ALIAS	The name of the sample being assayed.
LIBRARY	The name of the library being assayed.
CONTEXT	The sequence context being reported on.
TOTAL_SITES	The total number of sites that had at least one base covering them.
TOTAL_BASES	The total number of basecalls observed at all sites.
REF_NONOXO_BASES	The number of reference alleles observed as C in read 1 and G in read 2.
REF_OXO_BASES	The number of reference alleles observed as G in read 1 and C in read 2.
REF_TOTAL_BASES	The total number of reference alleles observed
ALT_NONOXO_BASES	The count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that rules out oxidation as the cause
ALT_OXO_BASES	The count of observed A basecalls at C reference positions and T basecalls at G reference bases that are correlated to instrument read number in a way that is consistent with oxidative damage.
OXIDATION_ERROR_RATE	The oxo error rate, calculated as max(ALT_OXO_BASES - ALT_NONOXO_BASES, 1) / TOTAL_BASES
OXIDATION_Q	-10 * log10(OXIDATION_ERROR_RATE)
C_REF_REF_BASES	The number of ref basecalls observed at sites where the genome reference == C.
G_REF_REF_BASES	The number of ref basecalls observed at sites where the genome reference == G.
C_REF_ALT_BASES	The number of alt (A/T) basecalls observed at sites where the genome reference == C.
G_REF_ALT_BASES	The number of alt (A/T) basecalls observed at sites where the genome reference == G.
C_REF_OXO_ERROR_RATE	The rate at which C>A and G>T substitutions are observed at C reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.
C_REF_OXO_Q	C_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.
G_REF_OXO_ERROR_RATE	The rate at which C>A and G>T substitutions are observed at G reference sites above the expected rate if there were no bias between sites with a C reference base vs. a G reference base.
G_REF_OXO_Q	G_REF_OXO_ERROR_RATE expressed as a phred-scaled quality score.

CollectQualityYieldMetrics.QualityYieldMetrics

A set of metrics used to describe the general quality of a BAM file

Field	Description
TOTAL_READS	The total number of reads in the input file
PF_READS	The number of reads that are PF - pass filter
READ_LENGTH	The average read length of all the reads (will be fixed for a lane)
TOTAL_BASES	The total number of bases in all reads
PF_BASES	The total number of bases in all PF reads
Q20_BASES	The number of bases in all reads that achieve quality score 20 or higher
PF_Q20_BASES	The number of bases in PF reads that achieve quality score 20 or higher
Q30_BASES	The number of bases in all reads that achieve quality score 20 or higher
PF_Q30_BASES	The number of bases in PF reads that achieve quality score 20 or higher
Q20_EQUIVALENT_YIELD	The sum of quality scores of all bases divided by 20
PF_Q20_EQUIVALENT_YIELD	The sum of quality scores of all bases divided by 20

CollectRawWgsMetrics.RawWgsMetrics

Field	Description

CollectVariantCallingMetrics.VariantCallingDetailMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF) for a given sample.

Field	Description
SAMPLE_ALIAS	The name of the sample being assayed
HET_HOMVAR_RATIO	(count of hets)/(count of homozygous non-ref) for this sample

CollectVariantCallingMetrics.VariantCallingSummaryMetrics

A collection of metrics relating to snps and indels within a variant-calling file (VCF).

Field	Description
TOTAL_SNPS	The number of high confidence SNPs calls (i.e. non-reference genotypes) that were examined
NUM_IN_DB_SNP	The number of high confidence SNPs found in dbSNP
NOVEL_SNPS	The number of high confidence SNPS called that were not found in dbSNP
FILTERED_SNPS	The number of SNPs that are also filtered
PCT_DBSNP	The percentage of high confidence SNPs in dbSNP
DBSNP_TITV	The Transition/Transversion ratio of the SNP calls made at dbSNP sites
NOVEL_TITV	The Transition/Transversion ratio of the SNP calls made at non-dbSNP sites
TOTAL_INDELS	The number of high confidence Indel calls that were examined
NOVEL_INDELS	The number of high confidence Indels called that were not found in dbSNP
FILTERED_INDELS	The number of indels that are also filtered
PCT_DBSNP_INDELS	The percentage of high confidence Indels in dbSNP
NUM_IN_DB_SNP_INDELS	The number of high confidence Indels found in dbSNP
DBSNP_INS_DEL_RATIO	The Insertion/Deletion ratio of the Indel calls made at dbSNP sites
NOVEL_INS_DEL_RATIO	The Insertion/Deletion ratio of the Indel calls made at non-dbSNP sites
TOTAL_MULTIALLELIC_SNPS	The number of high confidence multiallelic SNP calls that were examined
NUM_IN_DB_SNP_MULTIALLELIC	The number of high confidence multiallelic SNPs found in dbSNP
TOTAL_COMPLEX_INDELS	The number of high confidence complex Indel calls that were examined
NUM_IN_DB_SNP_COMPLEX_INDELS	The number of high confidence complex Indels found in dbSNP
SNP_REFERENCE_BIAS	The rate at which reference bases are observed at ref/alt heterozygous SNP sites.
NUM_SINGLETONS	For summary metrics, the number of variants that appear in only one sample. For detail metrics, the number of variants that appear only in the current sample.

CollectWgsMetrics.WgsMetrics

Metrics for evaluating the performance of whole genome sequencing experiments.

Field	Description
GENOME_TERRITORY	The number of non-N bases in the genome reference over which coverage will be evaluated.
MEAN_COVERAGE	The mean coverage in bases of the genome territory, after all filters are applied.
SD_COVERAGE	The standard deviation of coverage of the genome after all filters are applied.
MEDIAN_COVERAGE	The median coverage in bases of the genome territory, after all filters are applied.
MAD_COVERAGE	The median absolute deviation of coverage of the genome after all filters are applied.
PCT_EXC_MAPQ	The fraction of aligned bases that were filtered out because they were in reads with low mapping quality (default is < 20).
PCT_EXC_DUPE	The fraction of aligned bases that were filtered out because they were in reads marked as duplicates.
PCT_EXC_UNPAIRED	The fraction of aligned bases that were filtered out because they were in reads without a mapped mate pair.
PCT_EXC_BASEQ	The fraction of aligned bases that were filtered out because they were of low base quality (default is < 20).
PCT_EXC_OVERLAP	The fraction of aligned bases that were filtered out because they were the second observation from an insert with overlapping reads.
PCT_EXC_CAPPED	The fraction of aligned bases that were filtered out because they would have raised coverage above the capped value (default cap = 250x).
PCT_EXC_TOTAL	The total fraction of aligned bases excluded due to all filters.
PCT_5X	The fraction of bases that attained at least 5X sequence coverage in post-filtering bases.
PCT_10X	The fraction of bases that attained at least 10X sequence coverage in post-filtering bases.
PCT_15X	The fraction of bases that attained at least 15X sequence coverage in post-filtering bases.
PCT_20X	The fraction of bases that attained at least 20X sequence coverage in post-filtering bases.
PCT_25X	The fraction of bases that attained at least 25X sequence coverage in post-filtering bases.
PCT_30X	The fraction of bases that attained at least 30X sequence coverage in post-filtering bases.
PCT_40X	The fraction of bases that attained at least 40X sequence coverage in post-filtering bases.
PCT_50X	The fraction of bases that attained at least 50X sequence coverage in post-filtering bases.
PCT_60X	The fraction of bases that attained at least 60X sequence coverage in post-filtering bases.
PCT_70X	The fraction of bases that attained at least 70X sequence coverage in post-filtering bases.
PCT_80X	The fraction of bases that attained at least 80X sequence coverage in post-filtering bases.
PCT_90X	The fraction of bases that attained at least 90X sequence coverage in post-filtering bases.
PCT_100X	The fraction of bases that attained at least 100X sequence coverage in post-filtering bases.

CollectWgsMetricsFromQuerySorted.QuerySortedSeqMetrics

Metrics for evaluating the performance of whole genome sequencing experiments.

Field	Description
TOTAL_BASES	The total number of bases, before any filters are applied.
TOTAL_USABLE_BASES	The number of usable bases, after all filters are applied.
TOTAL_READ_PAIRS	The number of read pairs, before all filters are applied.
TOTAL_DUPE_PAIRS	The number of duplicate read pairs, before all filters are applied.
TOTAL_ORIENTED_PAIRS	The number of read pairs with standard orientations from which to calculate mean insert size, after filters are applied.
MEAN_INSERT_SIZE	The mean insert size, after filters are applied.

CollectWgsMetricsFromSampledSites.SampledWgsMetrics

Field	Description

DuplicationMetrics

Metrics that are calculated during the process of marking duplicates within a stream of SAMRecords.

Field	Description
LIBRARY	The library on which the duplicate marking was performed.
UNPAIRED_READS_EXAMINED	The number of mapped reads examined which did not have a mapped mate pair, either because the read is unpaired, or the read is paired to an unmapped mate.
READ_PAIRS_EXAMINED	The number of mapped read pairs examined.
UNMAPPED_READS	The total number of unmapped reads examined.
UNPAIRED_READ_DUPLICATES	The number of fragments that were marked as duplicates.
READ_PAIR_DUPLICATES	The number of read pairs that were marked as duplicates.
READ_PAIR_OPTICAL_DUPLICATES	The number of read pairs duplicates that were caused by optical duplication. Value is always < READ_PAIR_DUPLICATES, which counts all duplicates regardless of source.
PERCENT_DUPLICATION	The percentage of mapped sequence that is marked as duplicate.
ESTIMATED_LIBRARY_SIZE	The estimated number of unique molecules in the library based on PE duplication.

ExtractIlluminaBarcodes.BarcodeMetric

Metrics produced by the ExtractIlluminaBarcodes program that is used to parse data in the basecalls directory and determine to which barcode each read should be assigned.

Field	Description
BARCODE	The barcode (from the set of expected barcodes) for which the following metrics apply. Note that the "symbolic" barcode of NNNNNN is used to report metrics for all reads that do not match a barcode.
BARCODE_NAME
LIBRARY_NAME
READS	The total number of reads matching the barcode.
PF_READS	The number of PF reads matching this barcode (always less than or equal to READS).
PERFECT_MATCHES	The number of all reads matching this barcode that matched with 0 errors or no-calls.
PF_PERFECT_MATCHES	The number of PF reads matching this barcode that matched with 0 errors or no-calls.
ONE_MISMATCH_MATCHES	The number of all reads matching this barcode that matched with 1 error or no-call.
PF_ONE_MISMATCH_MATCHES	The number of PF reads matching this barcode that matched with 1 error or no-call.
PCT_MATCHES	The percentage of all reads in the lane that matched to this barcode.
RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT	The rate of all reads matching this barcode to all reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.
PF_PCT_MATCHES	The percentage of PF reads in the lane that matched to this barcode.
PF_RATIO_THIS_BARCODE_TO_BEST_BARCODE_PCT	The rate of PF reads matching this barcode to PF reads matching the most prevelant barcode. For the most prevelant barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more orphan reads than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation of PF reads between barcodes.
PF_NORMALIZED_MATCHES	The "normalized" matches to each barcode. This is calculated as the number of pf reads matching this barcode over the sum of all pf reads matching any barcode (excluding orphans). If all barcodes are represented equally this will be 1.

GcBiasDetailMetrics

Class that holds detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.

Field	Description
ACCUMULATION_LEVEL
GC	The G+C content of the reference sequence represented by this bin. Values are from 0% to 100%
WINDOWS	The number of windows on the reference genome that have this G+C content.
READ_STARTS	The number of reads whose start position is at the start of a window of this GC.
MEAN_BASE_QUALITY	The mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC.
NORMALIZED_COVERAGE	The ration of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average).
ERROR_BAR_WIDTH	The radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85.

GcBiasMetrics

Field	Description
DETAILS
SUMMARY

GcBiasSummaryMetrics

High level metrics that capture how biased the coverage in a certain lane is.

Field	Description
ACCUMULATION_LEVEL
WINDOW_SIZE	The window size on the genome used to calculate the GC of the sequence.
TOTAL_CLUSTERS	The total number of clusters that were seen in the gc bias calculation.
ALIGNED_READS	The total number of aligned reads used to compute the gc bias metrics.
AT_DROPOUT	Illumina-style AT dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[0..50].
GC_DROPOUT	Illumina-style GC dropout metric. Calculated by taking each GC bin independently and calculating (%ref_at_gc - %reads_at_gc) and summing all positive values for GC=[50..100].

GenotypeConcordanceContingencyMetrics

Class that holds metrics about the Genotype Concordance contingency tables.

Field	Description
VARIANT_TYPE	The type of the event (i.e. either SNP or INDEL)
TRUTH_SAMPLE	The name of the 'truth' sample
CALL_SAMPLE	The name of the 'call' sample
TP_COUNT	The TP (true positive) count across all variants
TN_COUNT	The TN (true negative) count across all variants
FP_COUNT	The FP (false positive) count across all variants
FN_COUNT	The FN (false negative) count across all variants
EMPTY_COUNT	The empty (no contingency info) count across all variants

GenotypeConcordanceDetailMetrics

Class that holds detail metrics about Genotype Concordance

Field	Description
VARIANT_TYPE	The type of the event (i.e. either SNP or INDEL)
TRUTH_SAMPLE	The name of the 'truth' sample
CALL_SAMPLE	The name of the 'call' sample
TRUTH_STATE	The state of the 'truth' sample (i.e. HOM_REF, HET_REF_VAR1, HET_VAR1_VAR2...)
CALL_STATE	The state of the 'call' sample (i.e. HOM_REF, HET_REF_VAR1...)
COUNT	The number of events of type TRUTH_STATE and CALL_STATE for the EVENT_TYPE and SAMPLEs
CONTINGENCY_VALUES	The list of contingency table values (TP, TN, FP, FN) that are deduced from the truth/call state comparison, given the reference. In general, we are comparing two sets of alleles. Therefore, we can have zero or more contingency table values represented in one comparison. For example, if the truthset is a heterozygous call with both alleles non-reference (HET_VAR1_VAR2), and the callset is a heterozygous call with both alleles non-reference with one of the alternate alleles matching an alternate allele in the callset, we would have a true positive, false positive, and false negative. The true positive is from the matching alternate alleles, the false positive is the alternate allele found in the callset but not found in the truthset, and the false negative is the alternate in the truthset not found in the callset. We also include a true negative in cases where the reference allele is found in both the truthset and callset.

GenotypeConcordanceSummaryMetrics

Class that holds summary metrics about Genotype Concordance

Field	Description
VARIANT_TYPE	The type of the event (i.e. either SNP or INDEL)
TRUTH_SAMPLE	The name of the 'truth' sample
CALL_SAMPLE	The name of the 'call' sample
HET_SENSITIVITY	The sensitivity for all heterozygous variants (Sensitivity is TP / (TP + FN))
HET_PPV	The ppv (positive predictive value) for all heterozygous variants (PPV is the TP / (TP + FP))
HET_SPECIFICITY	The specificity for all heterozygous variants cannot be calculated
HOMVAR_SENSITIVITY	The sensitivity for all homozygous variants (Sensitivity is TP / (TP + FN))
HOMVAR_PPV	The ppv (positive predictive value) for all homozygous variants (PPV is the TP / (TP + FP))
HOMVAR_SPECIFICITY	The specificity for all homozygous variants cannot be calculated.
VAR_SENSITIVITY	The sensitivity for all (heterozygous and homozygous) variants (Sensitivity is TP / (TP + FN))
VAR_PPV	The ppv (positive predictive value) for all (heterozygous and homozygous) variants (PPV is the TP / (TP + FP))
VAR_SPECIFICITY	The specificity for all (heterozygous and homozygous) variants (Specificity is TN / (FP + TN))
GENOTYPE_CONCORDANCE	The genotype concordance for all possible states. Genotype Concordance is the number of times the truth and call states match exactly / all truth and call combinations made
NON_REF_GENOTYPE_CONCORDANCE	The non-ref genotype concordance, ie for all var states only. Non Ref Genotype Concordance is the number of times the truth and call states match exactly for vars only / all truth and call var combinations made

HsMetrics

The set of metrics captured that are specific to a hybrid selection analysis.

Field	Description
BAIT_SET	The name of the bait set used in the hybrid selection.
GENOME_SIZE	The number of bases in the reference genome used for alignment.
BAIT_TERRITORY	The number of bases which have one or more baits on top of them.
TARGET_TERRITORY	The unique number of target bases in the experiment where target is usually exons etc.
BAIT_DESIGN_EFFICIENCY	Target terrirtoy / bait territory. 1 == perfectly efficient, 0.5 = half of baited bases are not target.
TOTAL_READS	The total number of reads in the SAM or BAM file examine.
PF_READS	The number of reads that pass the vendor's filter.
PF_UNIQUE_READS	The number of PF reads that are not marked as duplicates.
PCT_PF_READS	PF reads / total reads. The percent of reads passing filter.
PCT_PF_UQ_READS	PF Unique Reads / Total Reads.
PF_UQ_READS_ALIGNED	The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.
PCT_PF_UQ_READS_ALIGNED	PF Reads Aligned / PF Reads.
PF_UQ_BASES_ALIGNED	The number of bases in the PF aligned reads that are mapped to a reference base. Accounts for clipping and gaps.
ON_BAIT_BASES	The number of PF aligned bases that mapped to a baited region of the genome.
NEAR_BAIT_BASES	The number of PF aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region.
OFF_BAIT_BASES	The number of PF aligned bases that mapped to neither on or near a bait.
ON_TARGET_BASES	The number of PF aligned bases that mapped to a targeted region of the genome.
PCT_SELECTED_BASES	On+Near Bait Bases / PF Bases Aligned.
PCT_OFF_BAIT	The percentage of aligned PF bases that mapped neither on or near a bait.
ON_BAIT_VS_SELECTED	The percentage of on+near bait bases that are on as opposed to near.
MEAN_BAIT_COVERAGE	The mean coverage of all baits in the experiment.
MEAN_TARGET_COVERAGE	The mean coverage of targets that received at least coverage depth = 2 at one base.
PCT_USABLE_BASES_ON_BAIT	The number of aligned, de-duped, on-bait bases out of the PF bases available.
PCT_USABLE_BASES_ON_TARGET	The number of aligned, de-duped, on-target bases out of the PF bases available.
FOLD_ENRICHMENT	The fold by which the baited region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCT	The number of targets that did not reach coverage=2 over any base.
FOLD_80_BASE_PENALTY	The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.
PCT_TARGET_BASES_2X	The percentage of ALL target bases achieving 2X or greater coverage.
PCT_TARGET_BASES_10X	The percentage of ALL target bases achieving 10X or greater coverage.
PCT_TARGET_BASES_20X	The percentage of ALL target bases achieving 20X or greater coverage.
PCT_TARGET_BASES_30X	The percentage of ALL target bases achieving 30X or greater coverage.
PCT_TARGET_BASES_40X	The percentage of ALL target bases achieving 40X or greater coverage.
PCT_TARGET_BASES_50X	The percentage of ALL target bases achieving 50X or greater coverage.
PCT_TARGET_BASES_100X	The percentage of ALL target bases achieving 100X or greater coverage.
HS_LIBRARY_SIZE	The estimated number of unique molecules in the selected part of the library.
HS_PENALTY_10X	The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 10 * HS_PENALTY_10X.
HS_PENALTY_20X	The "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 20X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 20 * HS_PENALTY_20X.
HS_PENALTY_30X	The "hybrid selection penalty" incurred to get 80% of target bases to 30X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 30X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 30 * HS_PENALTY_30X.
HS_PENALTY_40X	The "hybrid selection penalty" incurred to get 80% of target bases to 40X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 40X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 40 * HS_PENALTY_40X.
HS_PENALTY_50X	The "hybrid selection penalty" incurred to get 80% of target bases to 50X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 50X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 50 * HS_PENALTY_50X.
HS_PENALTY_100X	The "hybrid selection penalty" incurred to get 80% of target bases to 100X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 100X coverage I need to sequence until PF_ALIGNED_BASES = 10^7 * 100 * HS_PENALTY_100X.
AT_DROPOUT	A measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC<=50% regions mapped elsewhere.
GC_DROPOUT	A measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC>=50% regions mapped elsewhere.

IlluminaBasecallingMetrics

Metric for Illumina Basecalling that stores means and standard deviations on a per-barcode per-lane basis. Averages and means are taken over all tiles.

Field	Description
LANE	The lane for which the metrics were calculated.
MOLECULAR_BARCODE_SEQUENCE_1	The barcode sequence for which the metrics were calculated.
MOLECULAR_BARCODE_NAME	The barcode name for which the metrics were calculated.
TOTAL_BASES	The total number of bases assigned to the index.
PF_BASES	The total number of passing-filter bases assigned to the index.
TOTAL_READS	The total number of reads assigned to the index.
PF_READS	The total number of passing-filter reads assigned to the index.
TOTAL_CLUSTERS	The total number of clusters assigned to the index.
PF_CLUSTERS	The total number of PF clusters assigned to the index.
MEAN_CLUSTERS_PER_TILE	The mean number of clusters per tile.
SD_CLUSTERS_PER_TILE	The standard deviation of clusters per tile.
MEAN_PCT_PF_CLUSTERS_PER_TILE	The mean percentage of pf clusters per tile.
SD_PCT_PF_CLUSTERS_PER_TILE	The standard deviation in percentage of pf clusters per tile.
MEAN_PF_CLUSTERS_PER_TILE	The mean number of pf clusters per tile.
SD_PF_CLUSTERS_PER_TILE	The standard deviation in number of pf clusters per tile.

IlluminaLaneMetrics

Embodies characteristics that describe a lane.

Field	Description
CLUSTER_DENSITY	The number of clusters per unit area on the this lane expressed in units of [cluster / mm^2].
LANE	This lane's number.

IlluminaPhasingMetrics

Metrics for Illumina Basecalling that stores median phasing and prephasing percentages on a per-template-read, per-lane basis. For each lane/template read # (i.e. FIRST, SECOND) combination we will store the median values of both the phasing and prephasing values for every tile in that lane/template read pair.

Field	Description
LANE
TYPE_NAME
PHASING_APPLIED
PREPHASING_APPLIED

InsertSizeMetrics

Metrics about the insert size distribution of a paired-end library, created by the CollectInsertSizeMetrics program and usually written to a file with the extension ".insert_size_metrics". In addition the insert size distribution is plotted to a file with the extension ".insert_size_Histogram.pdf".

Field	Description
MEDIAN_INSERT_SIZE	The MEDIAN insert size of all paired end reads where both ends mapped to the same chromosome.
MEDIAN_ABSOLUTE_DEVIATION	The median absolute deviation of the distribution. If the distribution is essentially normal then the standard deviation can be estimated as ~1.4826 * MAD.
MIN_INSERT_SIZE	The minimum measured insert size. This is usually 1 and not very useful as it is likely artifactual.
MAX_INSERT_SIZE	The maximum measure insert size by alignment. This is usually very high representing either an artifact or possibly the presence of a structural re-arrangement.
MEAN_INSERT_SIZE	The mean insert size of the "core" of the distribution. Artefactual outliers in the distribution often cause calculation of nonsensical mean and stdev values. To avoid this the distribution is first trimmed to a "core" distribution of +/- N median absolute deviations around the median insert size. By default N=10, but this is configurable.
STANDARD_DEVIATION	Standard deviation of insert sizes over the "core" of the distribution.
READ_PAIRS	The total number of read pairs that were examined in the entire distribution.
PAIR_ORIENTATION	The pair orientation of the reads in this data category.
WIDTH_OF_10_PERCENT	The "width" of the bins, centered around the median, that encompass 10% of all read pairs.
WIDTH_OF_20_PERCENT	The "width" of the bins, centered around the median, that encompass 20% of all read pairs.
WIDTH_OF_30_PERCENT	The "width" of the bins, centered around the median, that encompass 30% of all read pairs.
WIDTH_OF_40_PERCENT	The "width" of the bins, centered around the median, that encompass 40% of all read pairs.
WIDTH_OF_50_PERCENT	The "width" of the bins, centered around the median, that encompass 50% of all read pairs.
WIDTH_OF_60_PERCENT	The "width" of the bins, centered around the median, that encompass 60% of all read pairs.
WIDTH_OF_70_PERCENT	The "width" of the bins, centered around the median, that encompass 70% of all read pairs. This metric divided by 2 should approximate the standard deviation when the insert size distribution is a normal distribution.
WIDTH_OF_80_PERCENT	The "width" of the bins, centered around the median, that encompass 80% of all read pairs.
WIDTH_OF_90_PERCENT	The "width" of the bins, centered around the median, that encompass 90% of all read pairs.
WIDTH_OF_99_PERCENT	The "width" of the bins, centered around the median, that encompass 100% of all read pairs.

JumpingLibraryMetrics

High level metrics about the presence of outward- and inward-facing pairs within a SAM file generated with a jumping library, produced by the CollectJumpingLibraryMetrics program and usually stored in a file with the extension ".jump_metrics".

Field	Description
JUMP_PAIRS	The number of outward-facing pairs in the SAM file
JUMP_DUPLICATE_PAIRS	The number of outward-facing pairs that are duplicates
JUMP_DUPLICATE_PCT	The percentage of outward-facing pairs that are marked as duplicates
JUMP_LIBRARY_SIZE	The estimated library size for outward-facing pairs
JUMP_MEAN_INSERT_SIZE	The mean insert size for outward-facing pairs
JUMP_STDEV_INSERT_SIZE	The standard deviation on the insert size for outward-facing pairs
NONJUMP_PAIRS	The number of inward-facing pairs in the SAM file
NONJUMP_DUPLICATE_PAIRS	The number of inward-facing pais that are duplicates
NONJUMP_DUPLICATE_PCT	The percentage of inward-facing pairs that are marked as duplicates
NONJUMP_LIBRARY_SIZE	The estimated library size for inward-facing pairs
NONJUMP_MEAN_INSERT_SIZE	The mean insert size for inward-facing pairs
NONJUMP_STDEV_INSERT_SIZE	The standard deviation on the insert size for inward-facing pairs
CHIMERIC_PAIRS	The number of pairs where either (a) the ends fall on different chromosomes or (b) the insert size is greater than the maximum of 100000 or 2 times the mode of the insert size for outward-facing pairs.
FRAGMENTS	The number of fragments in the SAM file
PCT_JUMPS	The number of outward-facing pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.
PCT_NONJUMPS	The number of inward-facing pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.
PCT_CHIMERAS	The number of chimeric pairs expressed as a percentage of the total of all outward facing pairs, inward-facing pairs, and chimeric pairs.

MultilevelMetrics

Field	Description
SAMPLE	The sample to which these metrics apply. If null, it means they apply to all reads in the file.
LIBRARY	The library to which these metrics apply. If null, it means that the metrics were accumulated at the sample level.
READ_GROUP	The read group to which these metrics apply. If null, it means that the metrics were accumulated at the library or sample level.

RnaSeqMetrics

Metrics about the alignment of RNA-seq reads within a SAM file to genes, produced by the CollectRnaSeqMetrics program and usually stored in a file with the extension ".rna_metrics".

Field	Description
PF_BASES	The total number of PF bases including non-aligned reads.
PF_ALIGNED_BASES	The total number of aligned PF bases. Non-primary alignments are not counted. Bases in aligned reads that do not correspond to reference (e.g. soft clips, insertions) are not counted.
RIBOSOMAL_BASES	Number of bases in primary aligments that align to ribosomal sequence.
CODING_BASES	Number of bases in primary aligments that align to a non-UTR coding base for some gene, and not ribosomal sequence.
UTR_BASES	Number of bases in primary aligments that align to a UTR base for some gene, and not a coding base.
INTRONIC_BASES	Number of bases in primary aligments that align to an intronic base for some gene, and not a coding or UTR base.
INTERGENIC_BASES	Number of bases in primary aligments that do not align to any gene.
IGNORED_READS	Number of primary alignments that map to a sequence specified on command-line as IGNORED_SEQUENCE. These are not counted in PF_ALIGNED_BASES, CORRECT_STRAND_READS, INCORRECT_STRAND_READS, or any of the base-counting metrics. These reads are counted in PF_BASES.
CORRECT_STRAND_READS	Number of aligned reads that map to the correct strand. 0 if library is not strand-specific.
INCORRECT_STRAND_READS	Number of aligned reads that map to the incorrect strand. 0 if library is not strand-specific.
PCT_RIBOSOMAL_BASES	RIBOSOMAL_BASES / PF_ALIGNED_BASES
PCT_CODING_BASES	CODING_BASES / PF_ALIGNED_BASES
PCT_UTR_BASES	UTR_BASES / PF_ALIGNED_BASES
PCT_INTRONIC_BASES	INTRONIC_BASES / PF_ALIGNED_BASES
PCT_INTERGENIC_BASES	INTERGENIC_BASES / PF_ALIGNED_BASES
PCT_MRNA_BASES	PCT_UTR_BASES + PCT_CODING_BASES
PCT_USABLE_BASES	The percentage of bases mapping to mRNA divided by the total number of PF bases.
PCT_CORRECT_STRAND_READS	CORRECT_STRAND_READS/(CORRECT_STRAND_READS + INCORRECT_STRAND_READS). 0 if library is not strand-specific.
MEDIAN_CV_COVERAGE	The median CV of coverage of the 1000 most highly expressed transcripts. Ideal value = 0.
MEDIAN_5PRIME_BIAS	The median 5 prime bias of the 1000 most highly expressed transcripts, where 5 prime bias is calculated per transcript as: mean coverage of the 5' most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_3PRIME_BIAS	The median 3 prime bias of the 1000 most highly expressed transcripts, where 3 prime bias is calculated per transcript as: mean coverage of the 3' most 100 bases divided by the mean coverage of the whole transcript.
MEDIAN_5PRIME_TO_3PRIME_BIAS	The ratio of coverage at the 5' end of to the 3' end based on the 1000 most highly expressed transcripts.

RrbsCpgDetailMetrics

Holds information about CpG sites encountered for RRBS processing QC

Field	Description
SEQUENCE_NAME	Sequence the CpG is seen in
POSITION	Position within the sequence of the CpG site
TOTAL_SITES	Number of times this CpG site was encountered
CONVERTED_SITES	Number of times this CpG site was converted (TG for + strand, CA for - strand)
PCT_CONVERTED	TOTAL_BASES / CONVERTED_BASES

RrbsSummaryMetrics

Holds summary statistics from RRBS processing QC

Field	Description
READS_ALIGNED	Number of mapped reads processed
NON_CPG_BASES	Number of times a non-CpG cytosine was encountered
NON_CPG_CONVERTED_BASES	Number of times a non-CpG cytosine was converted (C->T for +, G->A for -)
PCT_NON_CPG_BASES_CONVERTED	NON_CPG_BASES / NON_CPG_CONVERTED_BASES
CPG_BASES_SEEN	Number of CpG sites encountered
CPG_BASES_CONVERTED	Number of CpG sites that were converted (TG for +, CA for -)
PCT_CPG_BASES_CONVERTED	CPG_BASES_SEEN / CPG_BASES_CONVERTED
MEAN_CPG_COVERAGE	Mean coverage of CpG sites
MEDIAN_CPG_COVERAGE	Median coverage of CpG sites
READS_WITH_NO_CPG	Number of reads discarded for having no CpG sites
READS_IGNORED_SHORT	Number of reads discarded due to being too short
READS_IGNORED_MISMATCHES	Number of reads discarded for exceeding the mismatch threshold

SamFileValidator.ValidationMetrics

Field	Description

SequencingArtifactMetrics.BaitBiasDetailMetrics

Bait bias artifacts broken down by context.

Field	Description
SAMPLE_ALIAS	The name of the sample being assayed.
LIBRARY	The name of the library being assayed.
REF_BASE	The (upper-case) original base on the reference strand.
ALT_BASE	The (upper-case) alternative base that is called as a result of DNA damage.
CONTEXT	The sequence context to which the analysis is constrained.
FWD_CXT_REF_BASES	The number of REF_BASE:REF_BASE alignments at sites with the given reference context.
FWD_CXT_ALT_BASES	The number of REF_BASE:ALT_BASE alignments at sites with the given reference context.
REV_CXT_REF_BASES	The number of ~REF_BASE:~REF_BASE alignments at sites complementary to the given reference context.
REV_CXT_ALT_BASES	The number of ~REF_BASE:~ALT_BASE alignments at sites complementary to the given reference context.
FWD_ERROR_RATE	The substitution rate of REF_BASE:ALT_BASE, calculated as max(1e-10, FWD_CXT_ALT_BASES / (FWD_CXT_ALT_BASES + FWD_CXT_REF_BASES)).
REV_ERROR_RATE	The substitution rate of ~REF_BASE:~ALT_BASE, calculated as max(1e-10, REV_CXT_ALT_BASES / (REV_CXT_ALT_BASES + REV_CXT_REF_BASES)).
ERROR_RATE	The bait bias error rate, calculated as max(1e-10, FWD_ERROR_RATE - REV_ERROR_RATE).
QSCORE	The Phred-scaled quality score of the artifact, calculated as -10 * log10(ERROR_RATE).

SequencingArtifactMetrics.BaitBiasSummaryMetrics

Summary analysis of a single bait bias artifact, also known as a reference bias artifact. These artifacts occur during or after the target selection step, and correlate with substitution rates that are "biased", or higher for sites having one base on the reference/positive strand relative to sites having the complementary base on that strand. For example, a G>T artifact during the target selection step might result in a higher G>T / C>A substitution rate at sites with a G on the positive strand (and C on the negative), relative to sites with the flip (C positive / G negative). This is known as the "G-Ref" artifact.

Field	Description
SAMPLE_ALIAS	The name of the sample being assayed.
LIBRARY	The name of the library being assayed.
REF_BASE	The (upper-case) original base on the reference strand.
ALT_BASE	The (upper-case) alternative base that is called as a result of DNA damage.
TOTAL_QSCORE	The total Phred-scaled Q-score for this artifact. A lower Q-score means a higher probability that a REF_BASE:ALT_BASE observation randomly picked from the data will be due to this artifact, rather than a true variant.
WORST_CXT	The sequence context (reference bases surrounding the locus of interest) having the lowest Q-score among all contexts for this artifact.
WORST_CXT_QSCORE	The Q-score for the worst context.
WORST_PRE_CXT	The pre-context (reference bases leading up to the locus of interest) with the lowest Q-score.
WORST_PRE_CXT_QSCORE	The Q-score for the worst pre-context.
WORST_POST_CXT	The post-context (reference bases trailing after the locus of interest) with the lowest Q-score.
WORST_POST_CXT_QSCORE	The Q-score for the worst post-context.
ARTIFACT_NAME	A "nickname" of this artifact, if it is a known error mode.

SequencingArtifactMetrics.PreAdapterDetailMetrics

Pre-adapter artifacts broken down by context.

Field	Description
SAMPLE_ALIAS	The name of the sample being assayed.
LIBRARY	The name of the library being assayed.
REF_BASE	The (upper-case) original base on the reference strand.
ALT_BASE	The (upper-case) alternative base that is called as a result of DNA damage.
CONTEXT	The sequence context to which the analysis is constrained.
PRO_REF_BASES	The number of REF_BASE:REF_BASE alignments having a read number and orientation that supports the presence of this artifact.
PRO_ALT_BASES	The number of REF_BASE:ALT_BASE alignments having a read number and orientation that supports the presence of this artifact.
CON_REF_BASES	The number of REF_BASE:REF_BASE alignments having a read number and orientation that refutes the presence of this artifact.
CON_ALT_BASES	The number of REF_BASE:ALT_BASE alignments having a read number and orientation that refutes the presence of this artifact.
ERROR_RATE	The estimated error rate due to this artifact. Calculated as max(1e-10, (PRO_ALT_BASES - CON_ALT_BASES) / (PRO_ALT_BASES + PRO_REF_BASES + CON_ALT_BASES + CON_REF_BASES)).
QSCORE	The Phred-scaled quality score of the artifact, calculated as -10 * log10(ERROR_RATE).

SequencingArtifactMetrics.PreAdapterSummaryMetrics

Summary analysis of a single pre-adapter artifact. These artifacts occur on the original template strand, before the addition of adapters, so they correlate with read number / orientation in a specific way. For example, the well-known "Oxo-G" artifact occurs when a G on the template strand is oxidized, giving it an affinity for binding to A rather than the usual C. Thus PCR will introduce apparent G>T substitutions in read 1 and C>A in read 2. In the resulting alignments, a given G>T or C>A observation could either be: 1. a true mutation 2. an OxoG artifact 3. some other kind of artifact On average, we assume that 1 and 3 will not display this read number / orientation bias, so their contributions will cancel out in the calculation.

Field	Description
SAMPLE_ALIAS	The name of the sample being assayed.
LIBRARY	The name of the library being assayed.
REF_BASE	The (upper-case) original base on the reference strand.
ALT_BASE	The (upper-case) alternative base that is called as a result of DNA damage.
TOTAL_QSCORE	The total Phred-scaled Q-score for this artifact. A lower Q-score means a higher probability that a REF_BASE:ALT_BASE observation randomly picked from the data will be due to this artifact, rather than a true variant.
WORST_CXT	The sequence context (reference bases surrounding the locus of interest) having the lowest Q-score among all contexts for this artifact.
WORST_CXT_QSCORE	The Q-score for the worst context.
WORST_PRE_CXT	The pre-context (reference bases leading up to the locus of interest) with the lowest Q-score.
WORST_PRE_CXT_QSCORE	The Q-score for the worst pre-context.
WORST_POST_CXT	The post-context (reference bases trailing after the locus of interest) with the lowest Q-score.
WORST_POST_CXT_QSCORE	The Q-score for the worst post-context.
ARTIFACT_NAME	A "nickname" of this artifact, if it is a known error mode.

TargetedPcrMetrics

Metrics class for targeted pcr runs such as TSCA runs

Field	Description
CUSTOM_AMPLICON_SET	The name of the amplicon set used in this metrics collection run
GENOME_SIZE	The number of bases in the reference genome used for alignment.
AMPLICON_TERRITORY	The number of unique bases covered by the intervals of all amplicons in the amplicon set
TARGET_TERRITORY	The number of unique bases covered by the intervals of all targets that should be covered
TOTAL_READS	The total number of reads in the SAM or BAM file examine.
PF_READS	The number of reads that pass the vendor's filter.
PF_BASES	THe number of bases in the SAM or BAM file to be examined
PF_UNIQUE_READS	The number of PF reads that are not marked as duplicates.
PCT_PF_READS	PF reads / total reads. The percent of reads passing filter.
PCT_PF_UQ_READS	PF Unique Reads / Total Reads.
PF_UQ_READS_ALIGNED	The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.
PF_SELECTED_PAIRS	Tracks the number of read pairs that we see that are PF (used to calculate library size)
PF_SELECTED_UNIQUE_PAIRS	Tracks the number of unique PF reads pairs we see (used to calc library size)
PCT_PF_UQ_READS_ALIGNED	PF Reads Aligned / PF Reads.
PF_UQ_BASES_ALIGNED	The number of PF unique bases that are aligned with mapping score > 0 to the reference genome.
ON_AMPLICON_BASES	The number of PF aligned amplified that mapped to an amplified region of the genome.
NEAR_AMPLICON_BASES	The number of PF aligned bases that mapped to within a fixed interval of an amplified region, but not on a baited region.
OFF_AMPLICON_BASES	The number of PF aligned bases that mapped to neither on or near an amplicon.
ON_TARGET_BASES	The number of PF aligned bases that mapped to a targeted region of the genome.
ON_TARGET_FROM_PAIR_BASES	The number of PF aligned bases that are mapped in pair to a targeted region of the genome.
PCT_AMPLIFIED_BASES	On+Near Amplicon Bases / PF Bases Aligned.
PCT_OFF_AMPLICON	The percentage of aligned PF bases that mapped neither on or near an amplicon.
ON_AMPLICON_VS_SELECTED	The percentage of on+near amplicon bases that are on as opposed to near.
MEAN_AMPLICON_COVERAGE	The mean coverage of all amplicons in the experiment.
MEAN_TARGET_COVERAGE	The mean coverage of targets that recieved at least coverage depth = 2 at one base.
FOLD_ENRICHMENT	The fold by which the amplicon region has been amplified above genomic background.
ZERO_CVG_TARGETS_PCT	The number of targets that did not reach coverage=2 over any base.
FOLD_80_BASE_PENALTY	The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.
PCT_TARGET_BASES_2X	The percentage of ALL target bases achieving 2X or greater coverage.
PCT_TARGET_BASES_10X	The percentage of ALL target bases achieving 10X or greater coverage.
PCT_TARGET_BASES_20X	The percentage of ALL target bases achieving 20X or greater coverage.
PCT_TARGET_BASES_30X	The percentage of ALL target bases achieving 30X or greater coverage.
AT_DROPOUT	A measure of how undercovered <= 50% GC regions are relative to the mean. For each GC bin [0..50] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. AT DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC<=50% regions mapped elsewhere.
GC_DROPOUT	A measure of how undercovered >= 50% GC regions are relative to the mean. For each GC bin [50..100] we calculate a = % of target territory, and b = % of aligned reads aligned to these targets. GC DROPOUT is then abs(sum(a-b when a-b < 0)). E.g. if the value is 5% this implies that 5% of total reads that should have mapped to GC>=50% regions mapped elsewhere.