 |
 |

The Challenges of Whole-Genome Approaches to Common Diseases
Jason H. Moore, PhD;
Marylyn D. Ritchie, PhD
Vanderbilt University Medical School, Nashville, Tenn
JAMA. 2004;291:1642-1643.
Recent technological advances may soon enable the study of hundreds of thousands of human single-nucleotide polymorphisms (SNPs) at the population level.1 Because strategies for analyzing these data have not kept pace with the laboratory methods that generate the data, however, it is unlikely that these advances will immediately lead to an improved understanding of the genetic contribution to common human diseases. In addition, the underlying genetics of common diseases such as sporadic breast cancer or essential hypertension are far more complex than that of rare mendelian diseases such as cystic fibrosis and sickle cell anemia. As a result, several important technical challenges will need to be overcome to identify susceptibility genes that can be used to improve the prevention, diagnosis, and treatment of common diseases. These challenges include developing statistical methods to analyze genetic data, selecting appropriate genetic variables, and interpreting interactions between individual genes.
Although specific DNA sequence variations have been linked to a variety of rare diseases, they have not been as informative for predicting the onset of more common conditions. This difference is illustrated when comparing familial (rare) and sporadic (common) forms of breast cancer. Women with a strong family history for breast cancer, for example, can be tested for specific mutations in the BRCA1 and BRCA2 genes, which result in 50% chance of developing the disease.2 The risk for sporadic breast cancer, however, cannot adequately be predicted by DNA sequence variations alone. Similar to other common diseases, the underlying genetic etiology of sporadic breast cancer probably involves many genes, each of which influences susceptibility primarily through nonadditive interactions with other genes (termed "epistasis") and with environmental factors.3 It is possible that interactions between genes are ubiquitous in the underlying etiology of most common diseases, given the complex molecular interactions that occur during biological processes such as transcription, translation, and signal transduction.4 Knowledge about DNA sequence variations from many different genes, in the context of environmental exposure, may thus be necessary to apply genetic information to human health. This problem suggests several challenges in identifying susceptibility genes from the entire human genome.
First, powerful statistical and computational methods will need to be developed to model the relationship between combinations of SNPs and disease susceptibility. Due to the large number of potential genotypes, analyzing SNP combinations is a far more difficult task than assessing each SNP individually. The difficulty increases exponentially with the number of SNPs under consideration. For example, while a single SNP with 3 genotypes has only 3 categories, 2 SNPs with 3 genotypes will have 9 possible 2-locus genotype combinations. With 3 SNPs, the number of combinations increases to 27. As the number of possible combinations increases, it may become impossible to recruit enough subjects into epidemiological studies to represent every possible genotypic combination. This problem has been referred to as the "curse of dimensionality."5
This limitation may be partially addressed with statistical and modeling approaches. Traditional parametric statistical approaches (ie, methods that compute estimates of population parameters) such as logistic regression do not deal with the dimensionality problem very effectively and are thus not well suited to detecting and characterizing gene-gene interactions.6 This is due to the inaccuracy of parameter estimates when there are too many variables in relation to the amount of data. By contrast, nonparametric "data-mining" methods usually do not require a prespecified statistical hypothesis and thus are better suited to search for trends or patterns in high-dimensional data sets. Although data-mining techniques such as multifactor dimensionality reduction (MDR)7-10 and neural networks11-14 may be more powerful than parametric statistical approaches, they have their own limitations. Neural network models, for example, can be very difficult to interpret and their results may not be intuitive.15 Furthermore, data-mining approaches may be influenced by chance patterns in data, which can result in false-positive results.16
A second challenge is the selection of genetic variables that should be included for analysis. If complex interactions between genes explain most of the heritability of common diseases, then combinations of SNPs will need to be evaluated from a list of hundreds of thousands of candidates. When single, functional polymorphisms each have a statistically detectable independent effect, each polymorphism can be evaluated individually for an association with disease, followed by an analysis of gene-gene interactions that considers only those polymorphisms. This greatly reduces the number of potential combinations of variables that must be examined. When SNPs do not have independent effects, however, it is impossible for most current computer technologies to analyze the resulting astronomical number of possible combinations. For instance, if 300 000 SNPs have been measured at a density of 1 SNP every 10 kilobases (kb), and if 10 statistical evaluations can be computed each second, then evaluation of each individual SNP would require 30 000 seconds (ie, 8.3 hours) of computer time. Exhaustive evaluation of the approximately 4 x 1010 pairwise combinations of SNPs would require 1286 years. Although it might be possible for a large supercomputer to complete these computations in a reasonable amount of time, an exhaustive search of all combinations of 3 or 4 SNPs would not be possible even if every computer in the world were simultaneously working on the problem.
One alternative approach would be to prioritize polymorphism evaluation based on knowledge of biological function. Estrogen metabolism genes, for example, could be given higher priority in analyses of sporadic breast cancer because estradiol is a known risk factor. This approach would likely become more effective as each gene becomes better annotated with functional information. There are currently no gene-gene interaction models that incorporate this type of information.
A third challenge is the interpretation of gene-gene interaction models. Although a statistical model can be used to identify genetic variants that confer risk for disease, this approach cannot be translated into specific prevention and treatment strategies without interpreting the results in the context of human biology. For example, a model of 4 SNPs (each with 3 genotypes) would have 81 possible genotype combinations. Most biochemical analyses, however, cannot evaluate more than 2 factors at once.17 While only a few experiments may be necessary to evaluate the effect of a single polymorphism on enzyme activity in a pathway, at least 81 experiments would be needed to evaluate the effects of 4 polymorphisms with 81 four-locus genotypes. This may be prohibitively time-consuming and expensive. Biological interpretation of a 4-locus model, for example, might require creating and characterizing 81 different transgenic mouse lines instead of 3 for each genotype at a single locus.
There are 3 general strategies to attribute biological meaning to gene-gene interaction models. The first is to assess the biological plausibility of the statistical model in light of current knowledge about the underlying biochemistry of the system. Such interpretations, of course, may be influenced by a current lack of knowledge about certain biological processes. A second strategy involves perturbing an experimental system in an effort to reproduce the results of the genetic model.18 It is not uncommon, however, for the results of studies of animal models to disagree with those obtained from human studies.19 A third strategy involves computer simulations to generate models of hypothetical biochemical systems that are consistent with a given gene-gene interaction model.20
The ultimate utility of whole-genome association studies for improving the prevention, diagnosis, and treatment of common human diseases will depend largely on the development of innovative strategies for overcoming the modeling, variable selection, and interpretation challenges outlined above. Although MDR and neural network strategies represent a first step in the development of methods for detecting gene-gene interactions, it is unlikely that any single approach will be uniformly most powerful. Instead, an optimal modeling strategy will most likely involve the search for consistent trends across several different methods. The next step would then be to address challenges relating to variable selection and biological interpretation. Such research will likely involve the multidisciplinary expertise of statisticians, computer scientists, engineers, geneticists, biochemists, and clinicians.
Funding/Support: This work was supported by National Institutes of Health grants HL65234, HL65962, GM31304, AG19085, AG20135, and LM007450.
REFERENCES
 |  |
1. The International HapMap Consortium. The International HapMap Project. Nature. 2003;426:789-796.
FULL TEXT
| PUBMED
2. Easton DF, Ford D, Bishop DT. Breast and ovarian cancer incidence in BRCA-1 mutation carriers. Am J Hum Genet. 1995;56:265-271.
ISI
| PUBMED
3. Sing CF, Stengard JH, Kardia SL. Genes, environment, and cardiovascular disease. Arterioscler Thromb Vasc Biol. 2003;23:1190-1196.
FREE FULL TEXT
4. Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56:73-82.
FULL TEXT
|
ISI
| PUBMED
5. Bellman RE. Adaptive Control Processes. Princeton, NJ: Princeton University Press; 1961.
6. Moore JH, Williams SM. New strategies for identifying gene-gene interactions in hypertension. Ann Med. 2002;34:88-95.
FULL TEXT
|
ISI
| PUBMED
7. Ritchie MD, Hahn LW, Roodi N, et al. Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69:138-147.
FULL TEXT
|
ISI
| PUBMED
8. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003;24:150-157.
FULL TEXT
|
ISI
| PUBMED
9. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003;19:376-382.
FREE FULL TEXT
10. Hahn LW, Moore JH. Ideal discrimination of discrete clinical endpoints using multifocus genotypes. In Silico Biol. 2004;4:16.
11. Lucek PR, Ott J. Neural network analysis of complex traits. Genet Epidemiol. 1997;14:1101-1106.
FULL TEXT
|
ISI
| PUBMED
12. Marinov M, Weeks DE. The complexity of linkage analysis with neural networks. Hum Hered. 2001;51:169-176.
ISI
| PUBMED
13. North BV, Curtis D, Cassell PG, Hitman GA, Sham PC. Assessing optimal neural network architecture for identifying disease-associated multi-marker genotypes using a permutation test, and application to calpain 10 polymorphisms associated with diabetes. Ann Hum Genet. 2003;67:348-356.
FULL TEXT
|
ISI
| PUBMED
14. Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH. Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics. 2003;4:28.
FULL TEXT
| PUBMED
15. Wu CH, McLarty JW. Neural Networks and Genome Informatics. New York, NY: Elsevier Science; 2000.
16. Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning. New York, NY: Springer Verlag; 2001.
17. Strohman R. Maneuvering in the complex path from genotype to phenotype. Science. 2002;296:701-703.
FREE FULL TEXT
18. Jansen RC. Studying complex biological systems using multifactorial perturbation. Nat Rev Genet. 2003;4:145-151.
FULL TEXT
|
ISI
| PUBMED
19. Williams SM, Haines JL, Moore JH. The use of animal models in the study of complex disease: all else is never equal or why do so many human studies fail to replicate animal findings? Bioessays. 2004;26:170-179.
FULL TEXT
|
ISI
| PUBMED
20. Moore JH, Hahn LW. Petri net modeling of high-order genetic systems using grammatical evolution. Biosystems. 2003;72:177-186.
FULL TEXT
|
ISI
| PUBMED
CiteULike Connotea Del.icio.us Digg Reddit Technorati
What's this?
THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES
A supervised approach for identifying discriminating genotype patterns and its application to breast cancer data
Yosef et al.
Bioinformatics 2007;23:e91-e98.
ABSTRACT
| FULL TEXT
Clinically Translated Ischemic Stroke Genomics
Meschia
Stroke 2004;35:2735-2739.
ABSTRACT
| FULL TEXT
|