Internet for Pharmaceutical and Biotech Communities
| Newsletter | Advertising |
 
 
 

  

Pharm/Biotech
Resources

Outsourcing Guide

Cont. Education

Software/Reports

Training Courses

Web Seminars

Jobs

Buyer's Guide

Home Page

Pharm Patents /
Licensing

Pharm News

Federal Register

Pharm Stocks

FDA Links

FDA Warning Letters

FDA Doc/cGMP

Pharm/Biotech Events

Consultants

Advertiser Info

Newsletter Subscription

Web Links

Suggestions

Site Map
 

 
   



 

Title:  Least-square deconvolution (LSD): a method to resolve DNA mixtures
United States Patent: 
7,162,372
Issued: 
January 9, 2007

Inventors: 
Wang; Tse-Wei (Oak Ridge, TN), Xue; Ning (Knoxville, TN), Birdwell; John D. (Oak Ridge, TN), Rader; Mark (Knoxville, TN), Flaherty; John (Knoxville, TN)
Appl. No.: 
10/265,908
Filed: 
October 8, 2002


 

Pharm Bus Intell & Healthcare Studies


Abstract

Least Square Deconvolution (LSD) uses quantitative allele peak data derived obtained from a sample containing the DNA of more than one contributor to resolve the best-fit genotype profile of each contributor. The resolution is based on finding the least square fit of the mass ratio coefficients at each locus to come closest to the quantitative allele peak data. Consistent top-ranked mass ratio combinations from each locus can be pooled to form at least one composite DNA profile at a subset of the available loci. The top-ranked DNA profiles can be used to check against the profile of a suspect or be used to search for a matching profile in a DNA database.

BRIEF SUMMARY OF THE INVENTION

The invention encompasses a method of resolving a mixture comprising DNA of more than one individual into genotype profiles for individuals in the mixture. When the method of the present invention is implemented in application software or otherwise, it will be referred to herein as LSD. LSD is an acronym for a mathematical process, in particular, least square deconvolution, which we have picked as the name of the present method, for example, when embodied in software. The use of the acronym LSD is not intended to be limited in describing the present method itself or particular steps of the method for which other steps and known mathematical processes may be substituted by one of skill in the art to equivalent advantage. A step of the method is obtaining quantitative allele peak data at a first locus. A best fit mass ratio coefficient vector is solved using the quantitative allele peak data for allele combinations that can be contributed by the individuals. Residuals are calculated for the allele combinations. An allele combination is selected for the individuals at the first locus having the smallest residual. The smallest residual does not cluster with the second smallest residual. The allele combination selected comprises the genotype profiles of the individuals.

The invention also encompasses a method of analyzing quantitative allele peak data from a sample comprising DNA of more than one individual into a genotype profile for individuals in the sample. A step of the method is solving for a best fit mass ratio coefficient vector using allele peak data for allele combinations at a first locus that can be contributed by the individuals. Residuals are calculated for the allele combinations. An allele combination for the individuals at the first locus having the smallest residual. The smallest residual does not cluster with the second smallest residual. The allele combination selected comprises the genotype profiles of the individuals.

The invention further encompasses a method of remotely accessing a software application in a secure manner for resolving a mixture of DNA. The software application is hosted on a secure server. The software application is accessed from a client remotely via a network. The secure server and the client are protected via a firewall. The DNA mixture is transmitted to the secure server. The analysis results are received from the secure server at the client.

The invention further encompasses a method of generating genotype profiles for individuals who contribute DNA to a sample comprising DNA of more than one individual. A step of the method is obtaining quantitative allele peak data for a set of more than one loci in the sample. The quantitative allele peak data for each locus of the set of loci is separately assigned to allele combination that can comprise the genotype profiles of the individuals at each locus of the set of loci. A residual error and a mass ratio is separately computed for the allele combinations that can comprise the genotype profiles of the individuals at each locus of the set of loci. The allele combinations for each locus of the set of loci are selected. The mass ratio for the allele combinations selected is consistent. The residual error for the allele combinations selected is the smallest or the second smallest residual error and the allele combinations selected comprise the genotype profiles of the individuals who contribute DNA to the sample.

The invention also encompasses a method of analyzing least square deconvolution output data wherein the data include a mass ratio and residual for allele combinations at a first locus in a set of loci in a sample comprising DNA of two individuals. A step of the method is preliminarily selecting either a genotype combination for the two individuals having a residual that is smallest if the smallest residual does not cluster with the second smallest residual or preliminarily selecting more than one genotype combination for the two individuals if the more than one genotype combination comprises residuals that are the smallest and that cluster. The genotype combination for the two individuals from the preliminarily selected combination are determined where the genotype combination has a mass ratio consistent, with that of a second locus determined for the sample.

DETAILED DESCRIPTION OF THE INVENTION

To date, no direct, systematic, analytic, and quantitative method exists to resolve DNA mixture samples. The instant invention, based on quantitative allele peak data, provides the art with a method to resolve DNA mixture samples contributed by two individuals.

Least Square Deconvolution (LSD) is a novel method applying the least-square modeling approach to find the best-fit genotype combination to resolve mixed DNA samples comprising DNA of more than one person. Quantitative allele peak data are used in LSD because allele peak areas are theoretically proportional to the mass of the corresponding DNA alleles in a mixture and because the proportional relationship of allele peak areas is approximately preserved during PCR amplification. Other types of measurements with a known theoretical relationship to DNA allele mass in a sample and that approximately preserve this relationship in practice are equivalent. Examples of such equivalent types of measurement include allele peak height and optical density. The theoretical relationship does not have to be linear as long as it is known. The objective of LSD is to first find the best-fit genotype combination for the two contributors at loci where peak data are available, using least-square techniques and the measured allele peak data. Then, using the best fit mass ratio information for all loci processed, a composite genotype profile for each of the individual contributors can be formed that is compatible with the results of the least square analysis.

The advantage of LSD compared to other quantitative approaches is its direct calculation of the best-fit genotype combination and the approximate mass ratio, without iterative searching for the optimal mass ratio. See (5) and (18). Some matrix calculations are involved in this method. FIG. 1 is a flow diagram of an embodiment of LSD. It is apparent from the diagram that LSD is applied at each locus independently of other loci, thus allowing each locus to be fitted with its own best-fit mass ratio, which can differ from the best-fit mass ratio arrived at other loci.

Compared with other approaches published in the literature, LSD is more efficient, simpler, more comprehensive, and gives true genotypes when the quantitative allele peak data and its theoretical relationship to allele mass approximately preserve the relative DNA mass proportionality. The advantage of LSD is its direct calculation of the best-fit genotype combination and the approximate mass ratio without iteration (5). Furthermore, LSD is applied to each locus independently of other loci, thus allowing each locus to be fitted independently with its own best-fit mass ratio, and the degree of confidence of fit to be independently assessed for each locus. From examining the relative errors of each fit at a locus, the degree of confidence can be separately assigned to the resulting best-fit genotype, locus by locus, allowing a composite profile to be assembled with a high degree of confidence containing only those loci whose LSD results are clear cut.

A locus refers to the position occupied by a segment of a specific sequence of base pairs along a gene sequence of DNA (2). Genes are differentiated by their specific sequences of base pairs at each locus. An allele refers to the specific gene sequence at a locus. At most two possible alleles can be present at one locus of a chromosome pair for each individual: one contributed by the paternal and the other contributed by the maternal source (8). If these two alleles are the same, the DNA profile is homozygous at that locus. If these two copies are different, the DNA profile is heterozygous at the locus (8). There are multiple alleles that can be contributed by either parent at each locus.

A genotype or DNA profile is the set of alleles that an individual has at a given locus. A genotype or DNA profile may also comprise the sets of alleles that an individual has at more than one locus. For example, a genotype or DNA profile may comprise the set of alleles at each of at least 2 loci, 3 loci, 4 loci, 5 loci, 7 loci, 9 loci, 11 loci, 13 loci, or 20 loci.

A DNA or genotype profile is developed from a nucleic acid sample, usually a DNA sample. Sources of nucleic acid include tissue, blood, semen, vaginal smears, sputum, nail scrapings, or saliva.

The DNA of interest can be prepared for analysis by the LSD method by amplification and subsequent separation. Amplification may be performed by any suitable procedures and by using any suitable apparatus available in the art. For example, enzymes can be used to perform an amplification reaction, such as Taq, Pfu, Klenow, Vent, Tth, or Deep Vent. Amplification may be performed under modified conditions that include "hot-start" conditions to prevent nonspecific priming. "Hot-start" amplification may be performed with a polymerase that has an antibody or other peptide tightly bound to it. The polymerase does not become available for amplification until a sufficiently high temperature is reached in the reaction. "Hot start" amplification may also be performed using a physical barrier that separates the primers from the DNA template in the amplification reaction until a temperature sufficiently high to break down the barrier has been reached. Barriers include wax, which does not melt until the temperature of the reaction exceeds the temperature at which the primers will not anneal nonspecifically to DNA.

The products of the amplification reaction are detected as different alleles present at a locus or loci. The alleles of at least one locus are amplified and detected after the amplification reaction. If desired, however, the alleles of multiple loci, e.g., two, three, four, five, six, ten, fifteen, twenty, twenty-five, or thirty, or more different loci may be detected after amplification. Sets of loci may include at least two, three, five, ten, fifteen, twenty, thirty, or fifty loci. Amplification of all of the alleles may be performed in a single amplification reaction or in a multiplex amplification reaction. Alternatively, the sample may be divided into several portions, each of which is amplified with primers that yield product for the alleles present at a single locus. Multiplex amplification is preferred.

The different alleles at a locus typically are detected because they differ in size. Alleles can differ in size due to the presence of repeated DNA units within loci. A repeated unit of DNA can be, e.g., a dinucleotide, trinucleotide, tetranucleotide, or pentanucleotide repeat. Short Tandem Repeats (STR) are DNA segments with repeat units of 2 6 bp in length (10). The repeated unit can be of a longer length that ranges from ten to one hundred base pairs. These are medium-length repeats and may be referred to as a Variant Number of Tandem Repeat (VNTR) (10). Repeat units of several hundred to several thousand base pairs may also be present in a locus. These are the long repeat units.

The number of repeated units at a locus also varies. The number of repeated units may be, for example, at least five, at least ten, at least fifteen, at least twenty, at least twenty-five, or at least fifty units. The effect of these repeated units of DNA is the presence of multiple types of alleles that an individual can possess at any given locus that can be detected by size (10).

Preferably, alleles that harbor different numbers of STR repeat units are detected. More than 8000 STRs (loci) scattered across the 23 pairs of human chromosomes have been collected in the Marshfield Medical Research Foundation in Marshfield, Wis. (10). Preferably, alleles at the 13 core loci used by the FBI Combined DNA Index System (CODIS): CSF1PO, FGA, TH01, TPOX, VWA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, and D21S11 (11), are detected.

It is also contemplated that amplification may be performed to detect an allele by amplifying microsatellite DNA repeats, DNA flanking Alu repeat sequences, or any other known polymorphic region of DNA that can be distinguished based on the size of different alleles.

Any method that separates amplification products based on size and any method that quantitates the amount of the allele present in the sample can be used to prepare the data required for analysis of genotype profiles in the method. The amplification products may be separated by electrophoresis in a gel or capillary, or mass spectrometry. The amount of each allele present may be determined flourometrically in a flourometer, or via ultraviolet spectrometry. For example, a Beckman Biomek.RTM. 2000 Liquid Handling System can be used to detect and quantitate alleles present for a locus in a sample. Optical density or optical signal can be used to detect the presence of an allele after gel or capillary electrophoresis.

Preferably, alleles are detected using an ABI Prism 310 Genetic Analyzer, or a HITACHI FMBIO II Fluorescence Imaging System (10). The ABI 310 Genetic Analyzer identifies alleles present at a locus, as outlined in FIG. 2(see Original Patent), and provides a data output result, as shown in FIG. 3 (see Original Patent). One advantage of this instrument is that, in addition to sizing the detected allele signals, the related software can also display their peak heights and automatically calculate the area under each peak (10).

The HITACHI FMBIO II Fluorescence Imaging System uses gel electrophoresis instead of capillary electrophoresis to separate the alleles of a DNA sample (10). This system requires much more sample and a longer time to complete a separation. In this genetic analyzer, each allele corresponds to a specific band in a gel lane. The band size for each allele is compared with a well-calibrated allelic ladder to identify the corresponding allele (10).

If the amplification products are input into an apparatus that both separates and quantitates alleles for a locus in a sample, four different types of peaks can be obtained from these raw data: true or allele peaks, stutter peaks, artifact peaks, and pull up peaks.

True or allele peaks are peaks that indicate the presence of an allele at a locus. The most important characteristic of an allele peak is that the measured peak area or height is roughly proportional to the mass of the corresponding allele in the DNA sample (10). Preferably peak area is used.

Stutter peaks are peaks generated by the enzyme's slippage during the amplification process (12). In most cases, stutter peaks are located on the left side of the associated alleles, and the gene distance between the stutter peak and the associated allele peak is usually less than 4 bp (12). The height of the stutter peak is usually less than 15% of the height of the corresponding true allele peak (12).

Artifact peaks are peaks due to impurities in the DNA samples. Generally, the artifact peaks have one or more of the following three characteristics: (1) about 53% of them are less than 5% of the nearest allele peak's height (12), (2) some artifact peaks consist of multiple peaks, and the distances among them are always less than 1 bp (12), and (3) some artifact peaks are within 0.5 bp of an allelic ladder marker (12). If a peak satisfies any of the above three rules, the peak can be defined as an artifact peak, and the peak's effect can be eliminated.

A pull-up peak is a minor peak directly to the right of a `true` allele peak. Usually, a pull-up peak is located on the right side of a `true` allele peak with a distance less than 3/8 bp, and its height is less than 50% of the major peak.

Quantitative peak data of `true` alleles are determined at a locus. These measurements may be the peak height or peak area of a signal detected by an instrument or procedure designed to quantify the presence of each allele. The peak height, peak area, and any other measurement that is related to the relative masses of each allele present in the original stain or sample are equivalent. Quantitative allele peak data will be referred to as "peak height," "peak area," or "quantitative allele peak data." Each of these terms is interchangeable.

The allele peaks or areas are calculated and analyzed using LSD. LSD then returns the "best-fit" genotype profiles for the two individuals that contribute to the sample. "Best-fit" refers to an assumption that the allele peak area/height is proportional to the relative mass proportion of the corresponding DNA allele in the mixture, the returned genotypes at the specified mass proportions would yield a set of allele peak areas/heights that is `closest` to the measured set of allele areas/heights, in the least square sense (as measured by the Euclidean distance metric).

The genotype profile assigned to each individual by the LSD method can be verified by comparing the known genotype profile of one individual that contributed to the sample to that of one person developed by the LSD method. The known genotype profile may be obtained from an individual that is the victim of a crime.

A genotype profile obtained from the LSD method may also be matched to an individual to identify the individual as potentially having contributed to the sample. The genotype profile may be matched to the individual after obtaining a sample from the individual. The genotype profile may also be matched to an individual by comparing it to other genotype profiles in a database. The database may be any public or proprietary database that stores and/or matches genotype profiles. The database may be CODIS, which may be used to store genotype profiles in a national, state, or regional collection, and which may separate these profiles into disjoint parts, such as a convicted offenders database, a forensic DNA database, or a missing persons database.

A preferred embodiment of the invention is shown in FIG. 4 (see Original Patent). In this embodiment, LSD is implemented using software running under a secure web server 1 on a protected network 2 that is isolated from a public or private network 3 by a firewall 4. A remote user located at LSD/Database Client station 8 may access the LSD software at the web server 1 via the public or private network. The communication may be via the public switched telephone network (PSTN) preferably using known encryption algorithms for confidential data but is preferably via a private network and encrypted. The firewall 4 allows communications with the secure web server 1 using an encrypted communications protocol such as the Hypertext Transfer Protocol (HTTP) over a Secure Sockets Layer (SSL). The firewall 4 connects the protected network 2 to the public or private network 3 using either an Internet service provider (ISP), leased, or owned telecommunications equipment/circuits 5 having appropriate bandwidth capability (although the data may be suitably compressed via known compression algorithms and transmitted over lower bandwidth facilities). The connection to the firewall 4 and all connections and equipment collocated with the protected network 2 are housed in a secure server facility 6 that provides LSD services to a community of clients located at forensic laboratories 7 or other organizations. Location 7, 8, 9 is shown by way of example only and is no way intended to be limited to forensic laboratory locations.

A client 8 located at a forensic laboratory or other organization may use the public or private network 3 to gain access to LSD services offered by the secure server facility 6. Preferably, the client 8 is connected to a protected network 9 which connects to the public or private network 3 through a firewall 10, and the firewall 10, the protected network 9, and all equipment connected to the protected network 9, such as the LSD/Database Client 8, are housed in a secure client facility such as a forensic laboratory 7 (or other secure facility). The firewall 10 located at the forensic laboratory 7 connects the protected network 9 to the public or private network 3 using either an ISP, leased, or owned telecommunications equipment/circuits 11 having similar bandwidth considerations as described above for equipment/circuits 5.

The client 8 may make requests to analyze data derived from DNA mixtures on the secure LSD web server 1 by accessing the secure web server 1, transmitting DNA mixture data to the secure web server, and receiving analysis results. These results may then be interpreted using mixture interpretation guidelines to obtain one or more DNA profiles that may be associated with a suspect to a crime.

Optionally, the LSD/Database Client 8 may access a local laboratory, state, or national DNA database 12 to search for matches to the one or more DNA profiles formed using the results of the LSD analysis. The DNA database 12 may be located in a separate secure facility at the state, local, or national level and is preferentially protected by a firewall 13. The firewall 13 is connected to the public or private network using either an ISP, leased, or owned telecommunications equipment/circuits 14, and preferentially allows communications with a DNA database server 12 using only an encrypted communications protocol such as HTTP over SSL. The firewall 13 and DNA database server 12 are connected to a protected network 15. The connections to the firewall 13 and all connections and equipment collocated with the protected network 15 are housed in a secure server facility 16 that provides DNA database services to a community of clients located at forensic laboratories 7 or other organizations.

Nothing shown in FIG. 4 (see Original Patent) or described above should be taken to restrict the domain of the invention. For example, the DNA database server and the secure LSD server may be connected through firewalls to two separate and isolated public or private networks, requiring a separate client and protected network located at a forensic laboratory in order to communicate with each server. This is the case at present with the FBI's National DNA Index System (NDIS), which is connected to state and local facilities through the FBI-owned and operated Criminal Justice Information System's Wide Area Network (CJIS-WAN), and with the current implementation of the secure LSD server. This server is located on a protected network within The University of Tennessee's Laboratory for Information Technologies (LIT) and is connected through a firewall owned and operated by LIT to the university's campus network and thence to the public Internet. In this case, the functionality remains the same, except that an investigator or analyst transfers results obtained by a client from the secure LSD server to a client computer of the FBI's NDIS facilities in order to perform a search on the national DNA database.

The invention is not restricted to operation on protected computers and networks, nor is it restricted to require security of communications using encryption and secure authentication protocols. However, these measures are usually necessitated by the privacy laws of the United States and other countries. In a similar manner, it is not required that the LSD software, LSD/Database Client, and DNA database software operate on separate and communicating computers. They may in fact all be installed and operated on a single computer in some applications, or on two computers. There may also be multiple instances of the DNA database software running on several computers. The realities of multiple jurisdictions and multiple ownership of and responsibility for controlled access to data that are considered sensitive usually necessitates the use of multiple computers under the control of independent but cooperating agencies.
 


Claim 1 of 34 Claims

1. A method of resolving a mixture comprising DNA of more than one individual into genotype profiles for individuals in the mixture comprising: (a) obtaining quantitative allele peak data for alleles present at a first locus in a DNA mixture comprising DNA of more than one individual; (b) solving a best fit mass ratio coefficient vector using data consisting of the quantitative allele peak data obtained in step (a) for possible allele combinations that can be contributed by the individuals of the more than one individual at the first locus; (c) calculating residuals for the possible allele combinations of step (b); (d) selecting an allele combination from the possible allele combinations for the individuals at the first locus having the smallest residual, wherein the smallest residual does not cluster with the second smallest residual, wherein the allele combination selected resolves the DNA in the mixture at the first locus into respective genotype profiles for the individuals; and (e) repeating the steps of obtaining, solving, calculating and selecting for a second locus.

 

____________________________________________
If you want to learn more about this patent, please go directly to the U.S. Patent and Trademark Office Web site to access the full patent.

 

 

     
[ Outsourcing Guide ] [ Cont. Education ] [ Software/Reports ] [ Training Courses ]
[ Web Seminars ] [ Jobs ] [ Consultants ] [ Buyer's Guide ] [ Advertiser Info ]

[ Home ] [ Pharm Patents / Licensing ] [ Pharm News ] [ Federal Register ]
[ Pharm Stocks ] [ FDA Links ] [ FDA Warning Letters ] [ FDA Doc/cGMP ]
[ Pharm/Biotech Events ] [ Newsletter Subscription ] [ Web Links ] [ Suggestions ]
[ Site Map ]