Roslin Bioinformatics - VIPER

Handling Large Datasets

The ResSpecies algorithm has been optimized for fast inheritance checking and inference as well as a small memory footprint. Never-the-less visualisation of extremely large datasets will become unresponsive largely because of the time taken by the ResSpecies algorithm rather than metric calculation or GUI repainting.

The recalculation of error rates when applying data masks and filters is still acceptably responsive on low-spec machines with up to 20 million genotypes (10 000 markers / 2 000 individuals takes 2-4s on 32-bit Windows). For datasets above this size the user should consider using a more RAM on a 64-bit machine (see Running Viper) or the genotype data could be analysed in separate segments containing data for discrete sets of markers. (This is appropriate because the ResSpecies algorithm is applied separately for each individual marker).

The initial parsing of the complete genotype data file can be slow once more than a few million records are input. This issue might be addressed by preprocessing large datasets to extract only the problematic markers for analysis and perhaps by automating the filtration of unreliable markers and the conversion of previously unrecognized sex-linked markers. We are interested in providing greater support for the pre-processing and segmentation of large datasets (see the Future of VIPER).