Short method summary

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation. This is especially critical in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific techniques to deal with imbalanced data which results in a significant reduction of sensitivity and precision of learned models. Regulatory Mendelian Mutation (ReMM) score aims at closing this gap by adopting imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods for prediction of non-coding variants associated with Mendelian disorders.

Details

Dataset

ReMM scores are based on an imbalanced aware machine learning algorithm, hyperSMURF, trained on known pathogenic non-coding variants of Mendelian disorders and a set of putatively benign variants (Schubach et.al., 2017). As the pathogenic set, we use 406 hand-curated variants. The proxy-benign set includes around 13.8 million of human-lineage-derived sequence alterations (Rentzsch et al. 2019), filtered in non-coding sequence using Jannovar and RefSeq. It can be assumed that these changes saw many generations of purifying selection and may therefore be used as a good proxy for benign variants.

Sampling

We apply a special sampling technique essential for the highly imbalanced data of human pathogenic variants. The minority class (positive) is oversampled based on SMOTE, the Synthetic Minority Over-sampling Technique, that creates synthetic examples using k-nearest neighbors rather than over-sampling the data with replacement. The majority class (negative) is divided into 100 non-overlapping partitions, which then are subsampled so that the ratio between positive and negative examples in the training set is 2:3.

Cytogenic band-aware cross-validation

Besides being highly imbalanced, genome variant data has another pitfall: DNA variants are not distributed evenly over the genome. There are regions of DNA that exhibit almost no variation and regions that have entire clusters of variants. Due to various ascertainment effects for pathogenic variants, this is even more pronounced for the positive set. To avoid clustered data to influence cross-validation results when positive variants close to each other and with similar features fall in different cross-validation folds, hyperSMURF puts them in the same folds. The folds are created according to cytobands: chromosomal bands with at least one positive data point are assigned to one of the ten folds such that folds have similar number of positive mutations. Negative variants are then put into the folds of their associated bands. Since they are genomically proximal to the positive variants of the same cytoband, it is more challenging for the learner to discriminate between the two groups. Being trained on nine folds and validated on the tenth fold, allows the learner to be more accurate and unbiased.

How to cite ReMM?

Smedley D, Schubach M, Jacobsen JOB, Köhler S, Zemojtel T, Spielmann M, Jäger M, Hochheiser H, Washington NL, McMurry JA, Haendel MA, Mungall CJ, Lewis SE, Groza T, Valentini G, Robinson PN. A
Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease.
Am J Hum Genet. 2016 Sep 1;99(3):595-606. doi: 10.1016/j.ajhg.2016.07.005. PMID: 27569544