NCI Releases New Human Gene Variation Resource

National Cancer Institute
Tuesday, 8 June 1999

One of the major challenges in cancer research is to understand the molecular differences that control the growth of tumors. Scientists believe that one of the sources of this detailed information is the slight variations in our genes that make all people genetically unique and which can influence how fast a cancer grows and how well it responds to treatment.

However, finding these gene variations, known as SNPs, or single nucleotide polymorphisms, has been technically challenging. Current strategies to locate variations do not discriminate between changes in the so-called junk DNA and those in genes, meaning many known SNPs have minimal biological importance.

Today, a team of scientists from the National Cancer Institute (NCI) reported that they have taken a step forward in solving the problem. NCI scientists announced they have discovered a total of 10,435 possible new variations in human genes, using a special computer software package designed to "mine" hidden gems of information stored in gene databases.

According to Kenneth Buetow, Ph.D., co-presenter of the data and a member of NCI's Genetic Annotation Initiative (GAI), the gene variations still must be validated. However, each of the candidate SNPs met statistical confidence levels of .99 percent, meaning there is a 1 percent chance or less of error in identifying them as SNPs.

"These variations represent the cleanest, most reliable of our data," said Buetow of his group's study, which involved an analysis of nearly 22,000 genes and required three specialized computer software programs to sort out the sequence.

Buetow said NCI has made information on all of the candidate SNPs available to researchers free of charge on the Cancer Genome Anatomy Project (CGAP) web page. In addition, the web page provides scientists with free access to the special software package, allowing them to look for SNPs in sequence generated from their own studies.

"What is nice about this approach is that mining existing data is relatively low cost and extremely high yield," said Richard Klausner, M.D., NCI director. "Ken's group has created a tremendous tool for the entire scientific community to identify genetic variations in the human genome, one of the great challenges now facing science."

The release of today's data is linked to Buetow's efforts with the Genetic Annotation Initiative, part of CGAP. Launched last year, the GAI aims to compile a comprehensive catalogue of SNPs that occur in genes involved in cancer. As part of the initiative, the GAI is also exploring the feasibility of applying various technologies and approaches to locating SNPs. "Right now, there is no so-called best way to identify and characterize SNPs," said Buetow, noting that each person has about 1 million SNPs, which occur once about every 1,000 to 2,000 bases, or units, of DNA.

This led Buetow and colleagues last year to consider developing the data-mining strategy that resulted in today's findings. The strategy builds on the idea that the majority of sequence information in public databases is redundant. For example, according to the National Center for Biotechnology Information, there currently are banked transcripts from a total of 70,871 human genes, but there are over 1.1 million gene transcripts in all logged into the databases, with over half of these gene transcripts originating from CGAP's Human Tumor Gene Index initiative.

Buetow said he and his colleagues theorized that they could go back into the databases and attempt to align the redundant copies of raw, unedited sequence from any given gene, then look for signs of possible SNPs. "The main technical challenge for us was the data that goes in through the EST [gene transcript] sequencing work is dirty, with errors arising in an estimated 2 percent of the bases," said Buetow. "With relatively frequent mistakes in the sequence, it is a significant challenge to distinguish signal, or true variation, from the noise."

To get around this problem, Buetow and colleagues developed the SNPpipeline. It utilizes two standard semi-automated software tools for editing sequence--PHRED, which scores the probability that a nucleotide call from a sequencing reaction is correct, and PHRAP, which takes multiple snippets of sequence and aligns them in their correct sequential order. Once aligned, the sequence can be assessed by a third software program called DEMIGLACE, which was developed by Michael Edmonson, of Buetow's laboratory.

"DEMIGLACE tests for the statistical probability that a base either is an error or an actual variant," said Edmonson. "Part of the way it does this is by throwing out data elements that it recognizes as being suspect. So, by a process of elimination, the program ends up leaving behind only the variants that have the true essence of a SNP."

Last March, Buetow and colleagues published in Nature Genetics the successful results of the pilot study of the SNPpipeline. Knowing their approach worked, the group followed up with the more comprehensive investigation that is the source of today's data release.

The researchers started with 21,993 genes that had five or more sequence transcripts logged into gene databases and then entered them into the SNPpipeline. As Beutow reported today, the software program identified 10,435 candidate SNPs with a .99 confidence level. He noted that if the confidence level was reduced to .95, still a reliable level, 13,902 possible SNPs were identified.

"It took us approximately a week's worth of computing time to build this collection of data," said Buetow, who said his group ultimately hopes to be able to perform the analysis in real time. "But technically, we can redo the analysis at any time as new data becomes available."

Buetow said this data release marks the first of what will be many future releases as the GAI moves ahead in its efforts to identify new SNPs. He added that the GAI is just one of the many integrated resources for genetic analysis being developed through CGAP, which also is building comprehensive indexes of genes that are expressed in cancer or are inactivated by chromosomal breakages.

"GAI helps to broaden the types of investigation that can be leveraged out of CGAP," he said. "Scientists will be able to access CGAP's Tumor Gene Index, find a gene of interest, then in a matter of seconds look for known variations in the gene's structure."

The SNPs can found at http://lpg.nci.nih.gov/GAI.

For more information, or to contact National Cancer Institute, see their website at: www.cancer.gov

Email Article To A Friend Link to us!
Home » Medical Research » National Cancer Institute » Article 01850