|
|
Title: Least-square
deconvolution (LSD): a method to resolve DNA mixtures
United States Patent: 7,162,372
Issued: January 9, 2007
Inventors: Wang; Tse-Wei
(Oak Ridge, TN), Xue; Ning (Knoxville, TN), Birdwell; John D. (Oak Ridge,
TN), Rader; Mark (Knoxville, TN), Flaherty; John (Knoxville, TN)
Appl. No.: 10/265,908
Filed: October 8, 2002
|
|
|
Pharm Bus Intell
& Healthcare Studies
|
Abstract
Least Square Deconvolution (LSD) uses
quantitative allele peak data derived obtained from a sample containing
the DNA of more than one contributor to resolve the best-fit genotype
profile of each contributor. The resolution is based on finding the least
square fit of the mass ratio coefficients at each locus to come closest to
the quantitative allele peak data. Consistent top-ranked mass ratio
combinations from each locus can be pooled to form at least one composite
DNA profile at a subset of the available loci. The top-ranked DNA profiles
can be used to check against the profile of a suspect or be used to search
for a matching profile in a DNA database.
BRIEF SUMMARY OF THE
INVENTION
The invention encompasses a method of
resolving a mixture comprising DNA of more than one individual into
genotype profiles for individuals in the mixture. When the method of the
present invention is implemented in application software or otherwise, it
will be referred to herein as LSD. LSD is an acronym for a mathematical
process, in particular, least square deconvolution, which we have picked
as the name of the present method, for example, when embodied in software.
The use of the acronym LSD is not intended to be limited in describing the
present method itself or particular steps of the method for which other
steps and known mathematical processes may be substituted by one of skill
in the art to equivalent advantage. A step of the method is obtaining
quantitative allele peak data at a first locus. A best fit mass ratio
coefficient vector is solved using the quantitative allele peak data for
allele combinations that can be contributed by the individuals. Residuals
are calculated for the allele combinations. An allele combination is
selected for the individuals at the first locus having the smallest
residual. The smallest residual does not cluster with the second smallest
residual. The allele combination selected comprises the genotype profiles
of the individuals.
The invention also encompasses a method of analyzing quantitative allele
peak data from a sample comprising DNA of more than one individual into a
genotype profile for individuals in the sample. A step of the method is
solving for a best fit mass ratio coefficient vector using allele peak
data for allele combinations at a first locus that can be contributed by
the individuals. Residuals are calculated for the allele combinations. An
allele combination for the individuals at the first locus having the
smallest residual. The smallest residual does not cluster with the second
smallest residual. The allele combination selected comprises the genotype
profiles of the individuals.
The invention further encompasses a method of remotely accessing a
software application in a secure manner for resolving a mixture of DNA.
The software application is hosted on a secure server. The software
application is accessed from a client remotely via a network. The secure
server and the client are protected via a firewall. The DNA mixture is
transmitted to the secure server. The analysis results are received from
the secure server at the client.
The invention further encompasses a method of generating genotype profiles
for individuals who contribute DNA to a sample comprising DNA of more than
one individual. A step of the method is obtaining quantitative allele peak
data for a set of more than one loci in the sample. The quantitative
allele peak data for each locus of the set of loci is separately assigned
to allele combination that can comprise the genotype profiles of the
individuals at each locus of the set of loci. A residual error and a mass
ratio is separately computed for the allele combinations that can comprise
the genotype profiles of the individuals at each locus of the set of loci.
The allele combinations for each locus of the set of loci are selected.
The mass ratio for the allele combinations selected is consistent. The
residual error for the allele combinations selected is the smallest or the
second smallest residual error and the allele combinations selected
comprise the genotype profiles of the individuals who contribute DNA to
the sample.
The invention also encompasses a method of analyzing least square
deconvolution output data wherein the data include a mass ratio and
residual for allele combinations at a first locus in a set of loci in a
sample comprising DNA of two individuals. A step of the method is
preliminarily selecting either a genotype combination for the two
individuals having a residual that is smallest if the smallest residual
does not cluster with the second smallest residual or preliminarily
selecting more than one genotype combination for the two individuals if
the more than one genotype combination comprises residuals that are the
smallest and that cluster. The genotype combination for the two
individuals from the preliminarily selected combination are determined
where the genotype combination has a mass ratio consistent, with that of a
second locus determined for the sample.
DETAILED DESCRIPTION
OF THE INVENTION
To date, no direct, systematic, analytic,
and quantitative method exists to resolve DNA mixture samples. The instant
invention, based on quantitative allele peak data, provides the art with a
method to resolve DNA mixture samples contributed by two individuals.
Least Square Deconvolution (LSD) is a novel method applying the
least-square modeling approach to find the best-fit genotype combination
to resolve mixed DNA samples comprising DNA of more than one person.
Quantitative allele peak data are used in LSD because allele peak areas
are theoretically proportional to the mass of the corresponding DNA
alleles in a mixture and because the proportional relationship of allele
peak areas is approximately preserved during PCR amplification. Other
types of measurements with a known theoretical relationship to DNA allele
mass in a sample and that approximately preserve this relationship in
practice are equivalent. Examples of such equivalent types of measurement
include allele peak height and optical density. The theoretical
relationship does not have to be linear as long as it is known. The
objective of LSD is to first find the best-fit genotype combination for
the two contributors at loci where peak data are available, using
least-square techniques and the measured allele peak data. Then, using the
best fit mass ratio information for all loci processed, a composite
genotype profile for each of the individual contributors can be formed
that is compatible with the results of the least square analysis.
The advantage of LSD compared to other quantitative approaches is its
direct calculation of the best-fit genotype combination and the
approximate mass ratio, without iterative searching for the optimal mass
ratio. See (5) and (18). Some matrix calculations are involved in this
method. FIG. 1 is a flow diagram of an embodiment of LSD. It is apparent
from the diagram that LSD is applied at each locus independently of other
loci, thus allowing each locus to be fitted with its own best-fit mass
ratio, which can differ from the best-fit mass ratio arrived at other
loci.
Compared with other approaches published in the literature, LSD is more
efficient, simpler, more comprehensive, and gives true genotypes when the
quantitative allele peak data and its theoretical relationship to allele
mass approximately preserve the relative DNA mass proportionality. The
advantage of LSD is its direct calculation of the best-fit genotype
combination and the approximate mass ratio without iteration (5).
Furthermore, LSD is applied to each locus independently of other loci,
thus allowing each locus to be fitted independently with its own best-fit
mass ratio, and the degree of confidence of fit to be independently
assessed for each locus. From examining the relative errors of each fit at
a locus, the degree of confidence can be separately assigned to the
resulting best-fit genotype, locus by locus, allowing a composite profile
to be assembled with a high degree of confidence containing only those
loci whose LSD results are clear cut.
A locus refers to the position occupied by a segment of a specific
sequence of base pairs along a gene sequence of DNA (2). Genes are
differentiated by their specific sequences of base pairs at each locus. An
allele refers to the specific gene sequence at a locus. At most two
possible alleles can be present at one locus of a chromosome pair for each
individual: one contributed by the paternal and the other contributed by
the maternal source (8). If these two alleles are the same, the DNA
profile is homozygous at that locus. If these two copies are different,
the DNA profile is heterozygous at the locus (8). There are multiple
alleles that can be contributed by either parent at each locus.
A genotype or DNA profile is the set of alleles that an individual has at
a given locus. A genotype or DNA profile may also comprise the sets of
alleles that an individual has at more than one locus. For example, a
genotype or DNA profile may comprise the set of alleles at each of at
least 2 loci, 3 loci, 4 loci, 5 loci, 7 loci, 9 loci, 11 loci, 13 loci, or
20 loci.
A DNA or genotype profile is developed from a nucleic acid sample, usually
a DNA sample. Sources of nucleic acid include tissue, blood, semen,
vaginal smears, sputum, nail scrapings, or saliva.
The DNA of interest can be prepared for analysis by the LSD method by
amplification and subsequent separation. Amplification may be performed by
any suitable procedures and by using any suitable apparatus available in
the art. For example, enzymes can be used to perform an amplification
reaction, such as Taq, Pfu, Klenow, Vent, Tth, or Deep Vent. Amplification
may be performed under modified conditions that include "hot-start"
conditions to prevent nonspecific priming. "Hot-start" amplification may
be performed with a polymerase that has an antibody or other peptide
tightly bound to it. The polymerase does not become available for
amplification until a sufficiently high temperature is reached in the
reaction. "Hot start" amplification may also be performed using a physical
barrier that separates the primers from the DNA template in the
amplification reaction until a temperature sufficiently high to break down
the barrier has been reached. Barriers include wax, which does not melt
until the temperature of the reaction exceeds the temperature at which the
primers will not anneal nonspecifically to DNA.
The products of the amplification reaction are detected as different
alleles present at a locus or loci. The alleles of at least one locus are
amplified and detected after the amplification reaction. If desired,
however, the alleles of multiple loci, e.g., two, three, four, five, six,
ten, fifteen, twenty, twenty-five, or thirty, or more different loci may
be detected after amplification. Sets of loci may include at least two,
three, five, ten, fifteen, twenty, thirty, or fifty loci. Amplification of
all of the alleles may be performed in a single amplification reaction or
in a multiplex amplification reaction. Alternatively, the sample may be
divided into several portions, each of which is amplified with primers
that yield product for the alleles present at a single locus. Multiplex
amplification is preferred.
The different alleles at a locus typically are detected because they
differ in size. Alleles can differ in size due to the presence of repeated
DNA units within loci. A repeated unit of DNA can be, e.g., a dinucleotide,
trinucleotide, tetranucleotide, or pentanucleotide repeat. Short Tandem
Repeats (STR) are DNA segments with repeat units of 2 6 bp in length (10).
The repeated unit can be of a longer length that ranges from ten to one
hundred base pairs. These are medium-length repeats and may be referred to
as a Variant Number of Tandem Repeat (VNTR) (10). Repeat units of several
hundred to several thousand base pairs may also be present in a locus.
These are the long repeat units.
The number of repeated units at a locus also varies. The number of
repeated units may be, for example, at least five, at least ten, at least
fifteen, at least twenty, at least twenty-five, or at least fifty units.
The effect of these repeated units of DNA is the presence of multiple
types of alleles that an individual can possess at any given locus that
can be detected by size (10).
Preferably, alleles that harbor different numbers of STR repeat units are
detected. More than 8000 STRs (loci) scattered across the 23 pairs of
human chromosomes have been collected in the Marshfield Medical Research
Foundation in Marshfield, Wis. (10). Preferably, alleles at the 13 core
loci used by the FBI Combined DNA Index System (CODIS): CSF1PO, FGA, TH01,
TPOX, VWA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D16S539, D18S51, and
D21S11 (11), are detected.
It is also contemplated that amplification may be performed to detect an
allele by amplifying microsatellite DNA repeats, DNA flanking Alu repeat
sequences, or any other known polymorphic region of DNA that can be
distinguished based on the size of different alleles.
Any method that separates amplification products based on size and any
method that quantitates the amount of the allele present in the sample can
be used to prepare the data required for analysis of genotype profiles in
the method. The amplification products may be separated by electrophoresis
in a gel or capillary, or mass spectrometry. The amount of each allele
present may be determined flourometrically in a flourometer, or via
ultraviolet spectrometry. For example, a Beckman Biomek.RTM. 2000 Liquid
Handling System can be used to detect and quantitate alleles present for a
locus in a sample. Optical density or optical signal can be used to detect
the presence of an allele after gel or capillary electrophoresis.
Preferably, alleles are detected using an ABI Prism 310 Genetic Analyzer,
or a HITACHI FMBIO II Fluorescence Imaging System (10). The ABI 310
Genetic Analyzer identifies alleles present at a locus, as outlined in
FIG. 2(see Original Patent), and provides a data output result, as shown
in FIG. 3 (see Original Patent). One advantage of this instrument is that,
in addition to sizing the detected allele signals, the related software
can also display their peak heights and automatically calculate the area
under each peak (10).
The HITACHI FMBIO II Fluorescence Imaging System uses gel electrophoresis
instead of capillary electrophoresis to separate the alleles of a DNA
sample (10). This system requires much more sample and a longer time to
complete a separation. In this genetic analyzer, each allele corresponds
to a specific band in a gel lane. The band size for each allele is
compared with a well-calibrated allelic ladder to identify the
corresponding allele (10).
If the amplification products are input into an apparatus that both
separates and quantitates alleles for a locus in a sample, four different
types of peaks can be obtained from these raw data: true or allele peaks,
stutter peaks, artifact peaks, and pull up peaks.
True or allele peaks are peaks that indicate the presence of an allele at
a locus. The most important characteristic of an allele peak is that the
measured peak area or height is roughly proportional to the mass of the
corresponding allele in the DNA sample (10). Preferably peak area is used.
Stutter peaks are peaks generated by the enzyme's slippage during the
amplification process (12). In most cases, stutter peaks are located on
the left side of the associated alleles, and the gene distance between the
stutter peak and the associated allele peak is usually less than 4 bp
(12). The height of the stutter peak is usually less than 15% of the
height of the corresponding true allele peak (12).
Artifact peaks are peaks due to impurities in the DNA samples. Generally,
the artifact peaks have one or more of the following three
characteristics: (1) about 53% of them are less than 5% of the nearest
allele peak's height (12), (2) some artifact peaks consist of multiple
peaks, and the distances among them are always less than 1 bp (12), and
(3) some artifact peaks are within 0.5 bp of an allelic ladder marker
(12). If a peak satisfies any of the above three rules, the peak can be
defined as an artifact peak, and the peak's effect can be eliminated.
A pull-up peak is a minor peak directly to the right of a `true` allele
peak. Usually, a pull-up peak is located on the right side of a `true`
allele peak with a distance less than 3/8 bp, and its height is less than
50% of the major peak.
Quantitative peak data of `true` alleles are determined at a locus. These
measurements may be the peak height or peak area of a signal detected by
an instrument or procedure designed to quantify the presence of each
allele. The peak height, peak area, and any other measurement that is
related to the relative masses of each allele present in the original
stain or sample are equivalent. Quantitative allele peak data will be
referred to as "peak height," "peak area," or "quantitative allele peak
data." Each of these terms is interchangeable.
The allele peaks or areas are calculated and analyzed using LSD. LSD then
returns the "best-fit" genotype profiles for the two individuals that
contribute to the sample. "Best-fit" refers to an assumption that the
allele peak area/height is proportional to the relative mass proportion of
the corresponding DNA allele in the mixture, the returned genotypes at the
specified mass proportions would yield a set of allele peak areas/heights
that is `closest` to the measured set of allele areas/heights, in the
least square sense (as measured by the Euclidean distance metric).
The genotype profile assigned to each individual by the LSD method can be
verified by comparing the known genotype profile of one individual that
contributed to the sample to that of one person developed by the LSD
method. The known genotype profile may be obtained from an individual that
is the victim of a crime.
A genotype profile obtained from the LSD method may also be matched to an
individual to identify the individual as potentially having contributed to
the sample. The genotype profile may be matched to the individual after
obtaining a sample from the individual. The genotype profile may also be
matched to an individual by comparing it to other genotype profiles in a
database. The database may be any public or proprietary database that
stores and/or matches genotype profiles. The database may be CODIS, which
may be used to store genotype profiles in a national, state, or regional
collection, and which may separate these profiles into disjoint parts,
such as a convicted offenders database, a forensic DNA database, or a
missing persons database.
A preferred embodiment of the invention is shown in FIG. 4 (see Original Patent).
In this embodiment, LSD is implemented using software running under a
secure web server 1 on a protected network 2 that is isolated from a
public or private network 3 by a firewall 4. A remote user located at
LSD/Database Client station 8 may access the LSD software at the web
server 1 via the public or private network. The communication may be via
the public switched telephone network (PSTN) preferably using known
encryption algorithms for confidential data but is preferably via a
private network and encrypted. The firewall 4 allows communications with
the secure web server 1 using an encrypted communications protocol such as
the Hypertext Transfer Protocol (HTTP) over a Secure Sockets Layer (SSL).
The firewall 4 connects the protected network 2 to the public or private
network 3 using either an Internet service provider (ISP), leased, or
owned telecommunications equipment/circuits 5 having appropriate bandwidth
capability (although the data may be suitably compressed via known
compression algorithms and transmitted over lower bandwidth facilities).
The connection to the firewall 4 and all connections and equipment
collocated with the protected network 2 are housed in a secure server
facility 6 that provides LSD services to a community of clients located at
forensic laboratories 7 or other organizations. Location 7, 8, 9 is shown
by way of example only and is no way intended to be limited to forensic
laboratory locations.
A client 8 located at a forensic laboratory or other organization may use
the public or private network 3 to gain access to LSD services offered by
the secure server facility 6. Preferably, the client 8 is connected to a
protected network 9 which connects to the public or private network 3
through a firewall 10, and the firewall 10, the protected network 9, and
all equipment connected to the protected network 9, such as the
LSD/Database Client 8, are housed in a secure client facility such as a
forensic laboratory 7 (or other secure facility). The firewall 10 located
at the forensic laboratory 7 connects the protected network 9 to the
public or private network 3 using either an ISP, leased, or owned
telecommunications equipment/circuits 11 having similar bandwidth
considerations as described above for equipment/circuits 5.
The client 8 may make requests to analyze data derived from DNA mixtures
on the secure LSD web server 1 by accessing the secure web server 1,
transmitting DNA mixture data to the secure web server, and receiving
analysis results. These results may then be interpreted using mixture
interpretation guidelines to obtain one or more DNA profiles that may be
associated with a suspect to a crime.
Optionally, the LSD/Database Client 8 may access a local laboratory,
state, or national DNA database 12 to search for matches to the one or
more DNA profiles formed using the results of the LSD analysis. The DNA
database 12 may be located in a separate secure facility at the state,
local, or national level and is preferentially protected by a firewall 13.
The firewall 13 is connected to the public or private network using either
an ISP, leased, or owned telecommunications equipment/circuits 14, and
preferentially allows communications with a DNA database server 12 using
only an encrypted communications protocol such as HTTP over SSL. The
firewall 13 and DNA database server 12 are connected to a protected
network 15. The connections to the firewall 13 and all connections and
equipment collocated with the protected network 15 are housed in a secure
server facility 16 that provides DNA database services to a community of
clients located at forensic laboratories 7 or other organizations.
Nothing shown in FIG. 4 (see Original Patent) or described above should be
taken to restrict the domain of the invention. For example, the DNA
database server and the secure LSD server may be connected through
firewalls to two separate and isolated public or private networks,
requiring a separate client and protected network located at a forensic
laboratory in order to communicate with each server. This is the case at
present with the FBI's National DNA Index System (NDIS), which is
connected to state and local facilities through the FBI-owned and operated
Criminal Justice Information System's Wide Area Network (CJIS-WAN), and
with the current implementation of the secure LSD server. This server is
located on a protected network within The University of Tennessee's
Laboratory for Information Technologies (LIT) and is connected through a
firewall owned and operated by LIT to the university's campus network and
thence to the public Internet. In this case, the functionality remains the
same, except that an investigator or analyst transfers results obtained by
a client from the secure LSD server to a client computer of the FBI's NDIS
facilities in order to perform a search on the national DNA database.
The invention is not restricted to operation on protected computers and
networks, nor is it restricted to require security of communications using
encryption and secure authentication protocols. However, these measures
are usually necessitated by the privacy laws of the United States and
other countries. In a similar manner, it is not required that the LSD
software, LSD/Database Client, and DNA database software operate on
separate and communicating computers. They may in fact all be installed
and operated on a single computer in some applications, or on two
computers. There may also be multiple instances of the DNA database
software running on several computers. The realities of multiple
jurisdictions and multiple ownership of and responsibility for controlled
access to data that are considered sensitive usually necessitates the use
of multiple computers under the control of independent but cooperating
agencies.
Claim 1 of 34 Claims
1. A method of resolving a mixture
comprising DNA of more than one individual into genotype profiles for
individuals in the mixture comprising: (a) obtaining quantitative allele
peak data for alleles present at a first locus in a DNA mixture comprising
DNA of more than one individual; (b) solving a best fit mass ratio
coefficient vector using data consisting of the quantitative allele peak
data obtained in step (a) for possible allele combinations that can be
contributed by the individuals of the more than one individual at the
first locus; (c) calculating residuals for the possible allele
combinations of step (b); (d) selecting an allele combination from the
possible allele combinations for the individuals at the first locus having
the smallest residual, wherein the smallest residual does not cluster with
the second smallest residual, wherein the allele combination selected
resolves the DNA in the mixture at the first locus into respective
genotype profiles for the individuals; and (e) repeating the steps of
obtaining, solving, calculating and selecting for a second locus. ____________________________________________
If you want to learn more
about this patent, please go directly to the U.S.
Patent and Trademark Office Web site to access the full
patent.
|