Roslin Bioinformatics - Genotype Checker

Handling Large Datasets

Depending upon the nature of your experiment, a Pedigree may include thousands of Individuals, Genotyped for tens, hundreds or thousands of Markers. The prototype GenotypeChecker is able to parse large datasets (albeit slowly) but an interactive display of complete datasets is not feasible, not only would it be memory intensive and interactively unresponsive, but the user would not be able to browse very large data tables easily. The use of colour coding, re-sizable columns and an 'Overview' table assist usability, but for large data sets it is necessary to display only portions of the data at once. Two methods are available to the user:

   1. Segment the Marker data sequentially into chunks of say 20 or 50 Markers at once, and analyse inheritance problems in each chunk independently

   2. Pre-parse and inheritance check all Marker Genotypes, and then view only those Markers with reported inheritance problems.

Smaller datasets, with fewer than 50 Markers are probably best viewed in their entirety. As the data size increases the user may wish to segment the data, looking at subsets of Markers independently. In order to effectively find problematic Markers it is advisable to pre-check the data for problematic Markers. This generates a report on the whole data set, and stores the list of 'Bad' Markers, allowing the user to view only these data (which can again be segmented if there are a large number of Markers to display). Pre-checking the data set can take several to many minutes depending on the number of Genotypes (for example 800000 Genotypes for a 1700 Individual Pedigree with 1400 Markers can take up to 10 minutes to pre-check on a 2G processor), but is more efficient than attempting to analyse the data in successive 50 Marker chunks.

See Running GenotypeChecker for notes about memory usage.