Roslin Bioinformatics - VIPER

Using VIPER: suggested workflow

  1. Start application (see Running VIPER)

  2. Load pedigree file (see Loading Data)
    • Displays pedigree structure in sandwich view
    • Fails if invalid data (e.g. pedigree loops, sex contradictions): lists fatal errors which may be saved to file

  3. Load genotype file (see Loading Data)
    • Parses markers and alleles, performs inheritance algorithm to infer missing genotypes and report inheritance inconsistencies (populates Pedigree view, Marker and Individual Tables)
    • Fails if invalid data (e.g. duplicated entries): lists fatal errors which may be saved to file
    • Warns of any markers ignored

  4. Set GUI options (see Options)

  5. Explore Data (see the VIPER Interface)

  6. Mask suspect data (genotypes, individuals or markers) and recheck the inheritance (see Inheritance Checking). This is an iterative process, atypical work flow might involve:

    1. Identify markers exhibiting unrecognized sex-linkage.
    2. Identify unreliable marker assays.
    3. Identify inconsistent pedigree relationships.
    4. Investigate/remove sporadic genotype errors.

  7. Export final clean (masked) dataset and log file (see Saving Data)

Identify markers exhibiting unrecognized sex-linkage

The first step of the data cleaning analysis is to identify those markers with systematically high error rates. Amongst these, if there are unrecognized sex-linked markers in the analysis, these may report a very high rate of inheritance problems in a distinctive gross systematic error pattern.

In mammalian pedigrees this is caused by males which have been mis-genotyped as homozygous instead of hemizygous/Y-null. This typically manifests with many male offspring reporting 'nil-from-sire' inheritance inconsistencies. This pattern is easy to distinguish in the 'Marker Table' by sorting the markers on the 'Sire' error count, and identifying those with a high 'Sire':'Dam' error ratio. When such a marker is selected for focus view, most of the error 'triangles' will point from male offspring to sire (Figure 1).

Figure 1 shows a typical pattern of error shown for an unrecognized sex-linked marker. If this pattern were limited to a single family, for a single marker, the sire would probably have been mistyped for this marker; if the pattern occurred in a single family for many markers, the paternity relationship would be suspect; but this pattern occurring for many families for a given marker indicates a sex-linkage problem.

Markers with apparent sex-linkage issues should be removed from the analysis by masking (using the check-box selection in the 'Marker Table'), but their genotype data can typically be rescued in the data file by converting all the male genotypes to hemizygous.

Other sex determination systems cause similar problems, in birds where females are the heterogametic sex the error pattern is the opposite and introduction of W-null 'alleles' is required.

Figure 1. Identification of a sex-linked marker.

Figure 1. Single marker focus view on a marker selected in the Marker Table as exhibiting a high nil-from-sire:nil-from-dam error ratio. A Detail View window shows one selected (in yellow) family showing actual and inferred genotypes (with blue border) for an unrecognized sex-linked marker. Note that the incomplete genotypes have only been partially inferred (?/G), as inference will not create inheritance inconsistencies. This pattern of inconsistency in a large number of separate families indicates that the homozygote males should in fact be genotyped as G/Y-null or A/Y-null. The incomplete genotypes could then be correctly completely inferred.

Identify unreliable marker assays

Once potential sex-linkage issues have been identified, any remaining markers with high error rates are generally the result of unreliable marker assays and should be discarded. Again these markers can be identified by sorting the 'Marker Table', but the 'Master Marker Errorgram' control can be used to filter out those markers above an acceptable threshold of error reporting. The coordinated 'Filtered Marker Errorgram' can be used to restrict the window of markers selected for the aggregate pedigree view, to perform analyses on markers with higher than background error rates (see more detail).

Identify inconsistent pedigree relationships

Individuals with very high reported error rates can be sorted and highlighted using the 'Individual Table', or can be emphasized in the pedigree display by using the  'Filtered Individual Errorgram' to alter the colour thresholds for error reporting (see more detail).

Several types of data error can lead to the concentration of errors reported on a particular individual: erroneous pedigree information (e.g. wrongly asserted sires), sample misidentification or contamination, poor sample quality, data record errors etc. At this stage it may be appropriate for the user to check whether there is any other information regarding problems with these samples. There are two approaches to cleaning such inconsistencies: either identifying and removing any suspect pedigree relationships, or removing all genotype data for this individual from the dataset (masking all marker genotypes for the individual but leaving the individual in place in the pedigree).

For example, an offspring (i.e. sample) reporting many 'nil-from-sire' inconsistencies across multiple markers may have a wrongly assigned sire. This hypothesis can be tested by using the right-mouse-click context menu to (reversibly) mask this paternity relationship, then recalculating errors and comparing whether the inheritance pattern is improved. Alternatively, the pedigree information may be correct, but all the genotype data suspect for that individual sample (due to sample mix-up etc.) In this case, orphaning the individual by removing even both parent links will not remove inheritance inconsistencies with children of the individual, so the best approach is to remove all genotype data for this individual (again using the right-mouse-click context menu).

Individuals with masked genotypes are 'hatched' blue by default (i.e. the preferred colour for 'unknown' genotypes), whilst broken pedigree links are indicated by colouring the triangular 'from-dam' or 'from-sire' glyphs blue. Orphaning both parents will remove an offspring from one generation sandwich, and where it appears as apparent lower down a broken 'chain-link' icon indicates that parentage has been masked.

The example shown in Figure 2 demonstrates how VIPER can be used to identify wrongly assigned parentage data in simple sire/dam/offspring data. In one highlighted family the offspring can be cleaned by removing the maternity link, because all errors are attributable to a wrongly assigned dam; in the other family, although the majority of errors come from the sire, removing this link does not clean the dataset and this trio must be discarded (completely masked).

Figure 4. Masking Pedigree Relationships.

Real data analysed for 946 markers for a set of sire/dam/offspring trios. A. Families are ordered by error rate, and the Individual Errorgram adjusted to highlight 2 individuals with exceptionally high reported errors. (562_004 has 19% nil from dam, 1% novel alleles; 562_070 2% nil from dam, 9% novel alleles, 7% nil from sire).  B. Maternity and paternity maskings are selected. C. After error recalculation one offspring has been completely cleaned by removing the erroneous maternity assertion, but the other still reports 2% nil from dam errors and must be discarded.

Investigate/remove sporadic genotype errors

After all systematic error patterns in the data have been resolved and cleaned, typically a low level of sporadic error remains distributed across many markers. Rigorous consideration of these errors involves focusing on each marker in turn, and confirming that the errors reflect sporadic mistypings rather than a systematic error. Identifying the causal error can be confounded by propagation of the error reporting by the algorithm, particularly where genotype information is missing from some individuals.

Masking suspect genotypes (again via the right-mouse-click context menu) allows prospective bad data pints to be verified. The 'Mask Remaining Errors' button provides a shortcut for iterative manual masking, but will simply mask where errors are reported rather than attempt to identify the underlying bad data point. Consequently it is typically necessary to repeat this action to get a totally clean dataset.