Mapping techniques
In 1920, German botanist Hans Winkler first used the term genome, reputedly by the fusion of GENe and chromosOME, in order to describe the complex notion of the entire set of chromosomes and all of the genes contained within an organism. A great deal of progress has since been made in the elucidation of the complex molecular interactions that underlie cellular functioning and the syntenic relationship between organisms at a nucleotide level. The basis for these advances was the characterization of the structure of DNA by Watson and Crick in 1953 (1) and the realization that DNA could be decoded to provide a guide to genetic inheritance. This underpinned the concept of genetics and provided scientists with the opportunity to explore and quantify the nature and extent of the biological information passed on from one generation to the next. The characterization of biological inheritance permitted the elucidation of what it was that was being encoded and how it could determine biochemical function. Finally, extending from elucidation of the mechanisms behind inheritance of monogenic diseases, scientists are beginning to grasp how sequence is also involved in complex interactions, occasionally under the influence of environmental factors, to contribute to many (but still not all) diseases. The speed at which the vast amount of human sequence data were generated can be attributed to the evolution of strategies and techniques developed to map and sequence organisms such as bacteria (2) yeast (3) and the nematode worm (4) The availability of such an evolutionary diverse collection of species, with the addition of the mouse (5) and other complex multicellular organisms, has also enabled comparisons to be made at a nucleotide level. The first genomes that were characterized were relatively small by current standards, bacteriophage ?X174, 5 kb (6,7) and bacteriophage ?, 48 kb (8) - but they provided the underlying techniques and strategies that are being used for the more complex organisms currently being studied. Chain termination sequencing, developed by Sanger (8) is a synthetic method in which nested sets of labeled fragments are generated in vitro by a DNA polymerase reaction. Because the method is highly sensitive and robust, it has been amenable to biochemical optimization, producing long, accurate sequence reads, and also to automation, which was necessary for large-scale application of the technique. In these respects, it differed from the method of Maxam and Gilbert (9) which necessitated production of all of the labeled material prior to chemical degradation to form the sequence ladders of nested fragments. As a result, the Sanger method has remained the technique by which the majority of genomic sequence from a variety of complex organisms is presently being generated (see Fig. 20.1). However, neither method is capable of generating single reads of greater than 2-300 nucleotides, limited in part by the sequence production itself and partly by the ability to separate the sequence by gel electrophoresis at single-base resolution (even today, sequencing read lengths approaching 1 kb are rare). The assembly of larger tracts of DNA therefore required the development of methods to reassemble a consensus sequence from multiple individual reads. Two approaches were adopted for this; first, the construction of physical maps of restriction fragments using sequence-specific restriction enzymes to order and orientate large segments of DNA from which individual units were selected for sequencing; second, the use of the information gained from each individual sequence read to order and orient each segment relative to overlapping neighbors, which required the development of advanced computer programs to make the task possible on all but the smallest scale. A further modification of the latter was made by Anderson (10) who developed the random shotgun strategy to elucidate the mitochondrial genome, involved using a random fragmentation process by partial DNAse I digestion (11). This removed the dependence on sequence-specific restriction enzymes while still relying on sequence-based assembly of contiguous tracts of overlapping reads. The random shotgun approach, in which genomic DNA is randomly sequenced in similarly sized segments and then assembled simultaneously to provide a representation of the genomic template, provided the basis of the strategies used to assemble sequences of large inserts cloned inplasmids, lambda phage, and cosmid vectors (12), and also the later bacterial artificial chromosome (BAC) and P1-derived artificial chromosome (PAC) clones (13). The same random shotgun strategy was adopted to sequence the 1.8-Mb genome of the bacterium Haemophilus influenzae (14). Although the whole-genome shotgun sequencing approach has proven itself to be a successful strategy for the rapid assembly of smaller genomes, there are doubts as to whether this strategy is suitable for assembling the sequence of complex organisms. The generation of a physical map, in which the genome is divided into bacterial clone units of 40-200 kb and assembled into contiguous stretches (contigs) of overlapping clones, is a process analogous to the sequence contig assembly process. In contrast to sequence assembly, however, the information used to compare individual clones and identify overlaps of a physical map (e.g., the Caenorhabditis elegans (4) and Saccharomyces cerevisiae (3) genome projects) use a one-dimensional fingerprint prepared by separating restriction fragments from a limit digest of each cloned DNA by electrophoresis. Overlaps between clones were detected on the basis of partially (or completely) shared fingerprint patterns. An alternative approach to identify overlapping relationships between clones was to test clones for the presence of characterized markers. Overlaps between clones could be identified on the basis that they shared a single copy sequence. The presence of the sequence was identified using a specific hybridization probe or polymerase chain reaction (PCR) assay. Given a physical map of overlapping clones, individual clones can then be selected from the map to provide maximum genomic coverage with minimal redundancy. These clones permit specific regions to be targeted for further investigation and, in particular, for the determination of the complete DNA sequence separately from other clones within the physical map. Because the source of the genomic sequence is limited to an individual clone, problems encountered with sequence assemblies are greatly reduced compared to the corresponding whole-genome assemblies. At the time of their inception, the physical maps of the C. elegans (4) and S. cerevisiae (3) genomes were constructed to enhance the molecular genetics of the respective organisms by facilitating the cloning of known genes and to serve as an archive for genomic information. However, the data associated with the construction of the clonal physical maps-even with good alignment to the genetic map-carried only a tiny proportion of information present (16) within the genome. Consequently, a minimum tile path of the 30-kb cosmid and 15-kb lambda clones, used to build the physical maps of the C. elegans and S. cerevisiae, respectively, were subcloned into M13 phage vectors (1.3-2 kb insert size) and sequenced on a per-clone basis. The physical maps of the two genomes (2)(15), and subsequently of Drosophila melanogaster (17), and human (12), used restriction enzyme fragments in various ways to overlap clonal units for the construction of genomewide physical maps. © 2008 Humana Press.