New insights into the role of intra-tumor genetic heterogeneity in carcinogenesis : identification of complex single gene variance within tumors

Aim: Present cancer hypotheses are almost all based on the concept that accumulation of specific driver gene mutations cause carcinogenesis. The discovery of intra-tumor genetic heterogeneity (ITGH), has resulted in this hypothesis being modified by assuming that most of these ITGH mutations are in passenger genes. In addition, accumulating ITGH data on driver gene mutations have revealed considerable genotype/phenotype disconnects. This study proposes to investigate this disconnect by examining the nature and degree of ITGH in breast tumors. Methods: ITGH was examined in tumors using next generation sequencing of up to 68,000 reads and analysis tools that allowed for identification of distinct minority variants within single genes, i.e., complex single gene variance (CSGV). Results: CSGV was identified in the androgen receptor genes in all breast tumors examined. Conclusion: Evidence of CSGV suggests that a selection as opposed to a mutation centric hypothesis could better explain carcinogenesis. Our hypothesis proposes that tumors develop by the selection of preexisting de novo mutations rather than just the accumulation of de novo mutations. Thus, the role of selection pressures, such as changes in tissue microenvironments will likely be critical to our understanding of tumor resistance as well as the development of more effective treatment protocols.


Current carcinogenesis hypotheses
The traditional understanding of carcinogenesis, that cancer cells accumulate somatic driver mutations that give them a growth advantage [1] is beginning to be questioned as data reveal the presence of driver gene mutations involved in carcinogenesis in normal tissues [2] .Further, a critical issue still to be elucidated is how these mutations create a gain-of-function in cells that results in them acquiring new oncogenic properties, rather than just the loss-of-function of factors that control cell growth and division.One indication as to why these properties might be more complicated than a simple case of excessive or distorted growth is that cancer genes are generally not over-expressed in the tissues from which the cancer develops [3] .For example, out of 130 highly specific-cancer genes only four are most highly expressed in the tissue from which the cancer originates [3] .Thus, other factors besides protein accumulation are likely to be involved.Compounding this conundrum is the observation that there are often different mutations in different cancer-associated genes in different cancer tissues [1] .Raising the question as to how these differences are related to the tissue specificity of certain cancer mutations.
Further, in a recent study looking for associations between specific cancer genes and specific cancer tissues some genes did not behave as expected [1] .The analyses suggested that both cell-intrinsic (i.e., genomic and epigenetic) and cell-extrinsic (i.e., environmental, both internal and external) factors could explain the differences in the cell type-specificity of cancer genes.For example, in breast cancer, specific external environmental factors have included estrogen receptor alpha (ER) activation by estradiol [4] and conversion of estrogen into genotoxic metabolites that can cause DNA double-strand breaks [5] .However, in most cases it has not been possible to associate any specific intrinsic or extrinsic factor with cancer tissue specificity.Underlying these fundamental questions is a growing awareness of substantial amounts of genetic heterogeneity not only within different types of cancer tissues [6] , but within single tumor cancer tissues as well.These latter observations have been labelled as intra-tumor genetic heterogeneity (ITGH) [7] .

Intra-tumor genetic heterogeneity
ITGH identified within breast tumors, has revealed numerous alterations in different genes, with the assumption that most mutations are in "passenger" genes [8] , including studies using single cell sequencing techniques [9] .However, such studies have also not drawn many definitive conclusions as to precise roles of many of the "driver" genes in carcinogenesis.Genes being identified as drivers: (1) if they are either oncogenes or tumor suppressor genes; (2) if they function in some aspect of cell growth; (3) if their location are close to any of these types of genes [10] .Further, a recent paper noted that passenger genes can also have damaging effects on cancer progression [11] .
We believe this confusion is partly because of a failure to investigate the nature and degree of genetic heterogeneity within single genes, a condition that we have labelled, complex single gene variance (CSGV), as opposed to just identifying mutations in different cancer-associated genes.Why this is important is that as natural selection is being increasingly identified as a critical process in cancer biology [12] , there needs to be a better understanding of the nature of the genetic variation that is being subjected to selection.

Identification of single gene genetic heterogeneity
The question as to why genetic heterogeneity within individual genes has not been studied before is partially because the approach to identifying gene variants is based on using sequence analysis algorithms and tools that make it inherently difficult to identify CSGV.Essentially, they are designed to ignore or minimize the possibility that different mutations of an individual gene can exist in a single person's tissues.The assumption being that finding multiple variants of a single gene within an individual's tissues is highly unlikely and therefore if identified is likely the result of either PCR or sequencing errors.Indeed, almost all NGS analyses rely on the use of filters and other techniques such as sequence alignment tools to remove such variants [13] .These techniques further reduce the possibility of finding multiple mutations within an individual gene, as some are likely to be at very low frequencies, and will be present in only a small minority of cells within an individual tumor, as noted in a recent review of post-zygotic somatic mosaicism [14] .Therefore, one of the challenges of the study was to develop a sequencing analysis approach that allows for the identification of CSGV.Further, an important practical consideration for identifying CSGV is that it is increasingly becoming apparent that every driver gene mutation does not produce a cancer phenotype, with some driver mutations even being present in non-cancer tissues [15,16] .In the present study, we have used a sequencing approach that makes it easier to detect multiple mutations of the androgen receptor gene (AR) within individual breast tumors.

Androgen receptor and breast cancer
In the case of breast cancer (BC), the AR is more widely expressed than either estrogen receptor (ER) alpha or progesterone receptor (PR) genes, and so it is not surprising that the AR has become a significant marker in defining BC subtypes [17] .The AR has therefore started to be singled out as a possible therapeutic target, particularly in triple-negative [ER-/PR-/herceptin receptor (HER) 2-] BC (TNBC) [18,19] .Indeed, a large cohort study reported AR expression in 32% of TNBC cases [20] .In another study examining cases of ER-positive breast carcinoma, tumor cells changed after treatment from ER-dependent to AR-dependent, possibly explaining why such cells become resistant to aromatase inhibitor treatment [21] .At present, most studies have focused on AR expression during different BC stages, and, indeed, AR expression has been identified as a possible critical marker in predicting BC survival [22] .While androgen-based therapeutics have been used for over 50 years to treat BC [23] .The authors believe that to truly exploit potential AR related mechanisms to provide clinical therapeutic benefits, a more detailed understanding of AR variant distribution and frequency in BC tissues, i.e., AR CSGV, both before and throughout carcinogenesis, will be required.
Further, examining CSGV occurrence in other critical driver genes may help resolve the genotype-phenotype disconnects between the mutational status of putative cancer-associated genes and the occurrence and progression of cancer.For, if it is assumed that somatic clonal evolution is the mechanism driving carcinogenesis, then tissue microenvironments need to be able to select from different variants of individual genes.As the presence of a single variant would not allow cells and tissues sufficient flexibility to adapt to different selection pressures produced by different tissue microenvironments.Further, the ability to collect such data about all potential driver genes may well provide new insights into resistance to treatment as well as to treatment failures.

Laser capture microdissection and DNA extraction
Frozen tumors were obtained from a breast cancer tissue bank [Table 1] that had been set up with all the required experimental permissions and vetted by the Jewish General Hospital's ethics board.Histological slides of 5-7 µm thick were prepared and stained using a standard hematoxylin/eosin protocol.To ensure the maximum purity of the cancer samples, following histo-pathological characterization by an expert pathologist, cells from cancer tumor areas were dissected by LCM using an AutoPix 100 (Molecular Devices, Sunnyvale, CA).An average of 2500 cells was dissected from each different section.Genomic DNA was extracted from the cells using a QIAamp DNA Micro kit (QIAGEN, Germantown, MD) following the manufacturer's directions.

PCR amplification
Amplification of AR exons was carried out using the Fast Start High Fidelity PCR kit (Roche, Indianapolis, IN).PCR products were generated using 36 different pairs of fused primers designed to flank the AR sequences of exons 4-8, which has been shown to be the region of the AR that contains a high proportion of mutations, including those associated with cancer [24] .The primers also included the sequence of introns 3-8 [Table 2].Each primer consisted of a 5' overhang of 19 bp, a 3 bp patient-specific barcode, and a 20-27 bp AR-specific sequence.The 5' overhang was used to facilitate emulsion PCR (em-PCR) and sequencing.The 3 bp barcode facilitated sample identification post sequencing, by allowing the pooling of different DNA samples for em-PCR.To ensure consistency three separate PCR preparations were prepared for each of the samples.

Ultra-deep pyrosequencing (next generation sequencing)
After conventional PCR amplification, the DNA from each sample was quantified by PicoGreen® dsDNA Assay (Invitrogen, Carlsbad, CA) and pooled equimolarly (em).For optimal em-PCR, the theoretical distribution ratio of beads and ssDNA is 1:1 for the clonal amplification.Based on this ratio, the initial eight em-PCR reactions were performed to determine the optimal ratio for em-PCR, based on bead recovery percentage (which was between 10%-15%).After the em-PCR reaction, the micro-reactors were broken and the beads captured by filtration.The biotin-labeled amplicon-positive beads were enriched using Streptavidin magnetic beads and then single stranded.The DNA beads were pre-incubated with DNA polymerase, sequencing primer and single strand binding protein (SSB), and then distributed into the wells on a PicoTiterPlate™ optical faceplate (454, Branford, CT), that contained 1.6 million wells.After adding the DNA beads and enzymatic beads (ATP sulfurylase and luciferase), the packing beads were layered onto the wells and the plate centrifuged for bead deposition.The signal processing and base-callings were performed using the software package from 454 (Branford, CT) [25] .
The sequence reads that passed quality control were aligned to the AR reference sequence (NM_000044.2) mRNA sequence of Homo sapiens androgen receptor, transcript variant 1 using a BLAST-based approach to determine the direction of each read; exons 4-8 were examined.To determine the likelihood of identifying PCR and sequencing errors, which is known that the 454 sequencing technology can generate [26] , special care was taken in sequencing homopolymeric regions, which can generate spontaneous insertions/deletions.However, as the study only sequenced exons 4-8 of the AR, that do not contain any homopolymeric regions, such errors were unlikely be a problem.

Sequence analysis
The sequencing data was aligned using MAFFT version 7.050, a multiple sequence alignment software.The data was then filtered by the length of each read, only reads that were the expected length were retained.The mode of the length of the total reads was used to imply expected length.Since sequencing errors are known to depend on position within the read, with more errors occurring near the end of each read, we further fil-  tered the data by retaining only the sequence between the fifth and one hundred and fiftieth bp.All variants in the data sets were then identified.

RESULTS
The samples were analyzed by ultra-deep sequencing at a depth of up to 68,000 reads for each sample [  1].

Do CSGVs really exist?
Before discussing the results, it seems reasonable to address the controversy with regards to whether intratissue genetic heterogeneity really exists, particularly as it has been identified not just within tumors, but within normal tissues as well [27,28] .Indeed, questions have been raised as to the possible role of methodological errors in generating genetic heterogeneity in both tumors [29] and tissues in general [30] .To address these questions, it is important to discuss the sequence analysis tools used in our NGS protocols.In traditional sequencing approaches, coverage is based on genome mapping approaches, which use a theoretical redundancy in coverage based on the expression LN/G, where L is the read length, N is the number of reads and G is the haploid genome length [31] .Unfortunately, many factors can result in unequal coverage that produces gaps or much lower coverage than expected [32] .Further, problems such as the choice of alignment algorithms means that even the best mapping algorithms cannot align all reads to a reference genome [33] .As the cost of sequencing has come down, so has the depth of sequencing increased, and this has had a profound effect on the sensitivity of sequencing and the ability to detect rare mutations accurately [34] .Experimental data has confirmed that the major factors that influence detection sensitivity are read depth and experimental precision [34] .Indeed, it would appear possible to accurately detect mutations at a frequency of as low as 0.1%, provided there is sufficient read depth [34] .Somewhat surprisingly, the use of filters used to eliminate false reads etc. does not necessarily prevent low frequency mutations from being detected [35] .Indeed, if used correctly they can in fact enhance the ability to detect low frequency mutations, and in cases of tumor genetic heterogeneity, such an ability is likely to be extremely important [35] .In the case of the present study we believe we have adopted a sufficiently precise sequencing technique that we can use a 0.1% cutoff value to identify the mutations present in our breast tumor samples.

Importance of identifying changing frequencies of driver gene variants during carcinogenesis
At present, identification of ITGH has solely been based on whether specific driver gene variants have been present within cancer tissues, but their frequencies have generally not been assessed.This is because it has been assumed that such variants are present in most tumor cells and are therefore responsible for the cancer phenotype, so that ITGH just reflects the complex genetic makeup of individual tumors, but that the basic mutation-centric paradigm still applies.However, evidence that driver gene mutations can also be present in normal tissues has considerably confused the role of these driver genes in carcinogenesis.We believe that identifying cases of CSGV is likely to be helpful in resolving the phenotype/genotype disconnect, because the data will reveal the actual frequency of the variants and put them in context within a tumor.In a previous study examining an AR CAG repeat length polymorphism in breast tumors, changes in the frequency of these polymorphisms in normal and cancer tissues from individual tumors, as well as in matching blood samples were investigated.This revealed the distribution frequencies of different length AR CAG repeat variants associated with carcinogenesis [6] .A similar approach applied to analyzing driver gene CSGV is likely to give further information to help elucidate the significant genetic events of carcinogenesis.Clearly, the presence of CSGV within cancer tissues clashes with our present understanding that carcinogenesis is the result of "purifying" selection pressure on single gene variants in a tumor that eventually will lead to removal of all the non-selected variants of that gene [36] .This argument in turn justifies being satisfied with the identification of a single variant per gene, and therefore to ignore any other low frequency variants within the same gene, on the assumption that they must be artifacts, possibly due to PCR or sequencing errors.The recognition that a selection of different single gene variants can remain in individual tumors, is clearly not in line with our present understanding of the occurrence and distribution of cancer mutations.However, our present results would question the validity of this understanding as CSGV were identified in the AR within all 6 breast tumors examined and suggests that the role of mutations in carcinogenesis is more complex than previously thought.

How can identifying CSGV help in understanding treatment resistance?
First, it suggests a mechanism to explain how some tumors can become rapidly resistant to treatment by proposing the existence of genetic variants that can be selected for in genes that have been targeted by chemotherapy.Indeed, the selection of such variants could be a response to ensure the survival of cells that contained the targeted gene as postulated by the atavistic model [37] , which considers resistance of cancer cells to treatment as one of their major characteristics.Second, it places much more emphasis on understanding the role of selection pressures generated by different tissue microenvironments on carcinogenesis [38,39] .It also suggests that analyzing the makeup of tissue microenvironments may facilitate the recognition of specific factors involved in the selection of cancer-associated variants.

A different paradigm to explain carcinogenesis
The principle of "parsimony" has underwritten our understanding of science since the middle of the 19th century by telling us to choose the simplest scientific explanation that fits (all) the observed evidence.In studying the genetics of cancer this has been reflected in our belief that identifying common gene mutations present in tumor tissues is one of the keys to understanding the ontology of solid tumors.However, the validity of this concept is being challenged by accumulating evidence of genetic diversity within individual tumors, which this study has further expanded by revealing evidence of AR CSGV in breast tumors.As noted previously, current cancer hypotheses are almost all based on the concept that accumulation of specific de novo individual driver mutations within specific tissues can result in carcinogenesis.However, the lack of a consistent relationship between driver mutations and cancer types and the discovery of the presence of many different driver mutant genes within the same types of cancer tissues has resulted in complex genetic profiles.These have effectively meant that many of these driver gene mutations have been reduced to risk factors, albeit with significant clinical implications, rather than gene mutations that are directly responsible for carcinogenesis.
Interestingly, such phenotype/genotype lack of precision has been found not just in multifactorial diseases such as cancer, but in locus specific genetic disorders as well.For example, in certain locus specific diseases a significant number of individuals that exhibit the disease phenotype do not have a mutation in the putative disease-causing gene, such as in the case of androgen insensitivity syndrome [24] and PKU [40] .Further, a review of genotype-phenotype relationships in a wide range of genetic diseases has revealed many cases of reduced or even zero penetrance [41] .While whole genome sequencing studies have found individuals that can have well known disease-causing gene mutations but do not exhibit the disease phenotypes [42] including cancer-associated genes in healthy individuals [43] .
Other recent evidence has further complicated the genetics of cancer, by revealing the effect on cancer phenotypes of processes such as epigenetic regulation, DNA and RNA editing, cellular differentiation hier-archies, gene expression stochasticity and protein-protein interactions [44] .However, their roles are not well defined at present, as in many cases these factors are analyzed as separate events, rather than studying their integrated effect on the selection pressures of the complete tissue microenvironment [45] .
One possible hypothesis we have previously proposed is that while intra-tissue genetic heterogeneity may provide the genetic underpinnings for carcinogenesis.It is tumor microenvironment selection pressure on preexisting de novo mutations that is the carcinogenic trigger, rather than just the accumulation of de novo mutations [46] .We have further postulated that these mutations occur early in human embryogenesis [45] , as has now been suggested in another recent study [47] .
We believe that this hypothesis is supported by the presence of genetic heterogeneity in both cancer and normal tissues, as well as by the evidence of non-genomic, often environmental factors as risk factors for cancer.Indeed, the complexity of post-zygotic variation [14] has only added to the importance of variant selection due to environmental factors within tissue microenvironments in determining cancer phenotypes [48] .A detailed examination of the arguments favoring a selection-centric paradigm has been given in a recent paper [49] , which the identification of AR CSGV in breast tumors has further strengthened.

How the identification of CSGV could affect approaches to cancer treatment
Based on many cases of individual-gene genetic heterogeneity that have recently been identified in normal as well as cancer tissue, it seems reasonable to believe that CSGV is likely to also occur in normal tissue.The presence of multiple variants within single genes at low frequencies in normal tissue and cells prior to tissue becoming cancerous would further strengthen the selection-centric paradigm of carcinogenesis.This paradigm could also better explain many observations in which, environmental factors that are clearly nonmutagenic, i.e., diet, exercise, etc., can somehow direct mutations in specific "driver" genes [50] .Thus, "healthy" lifestyle factors can result in the selection of environments that are "cancer resistant", while other environments identified as "cancer causing", that are often man-made, can lead to cancer [51] .CSGV could then simply explain a "cancer resistant" environment as one that selects for pre-existing wild-type gene variants and a "cancer causing" environment as one that selects for pre-existing oncogenic gene variants.
Based partially on the principle of parsimony discussed previously, success of species, tissues or cells, has always been considered to eventually result in a specific species, tissues or cells eliminating the competition.However, in the case of CSGV this clearly does not seem to be the case, as while gene variants may not be selected, they are not eliminated entirely either.Thus, in the case of cancer, just destroying the cancer cells and not changing the conditions that allow for them to be preferentially selected, is possibly going to allow other cancer cells with different gene variants to eventually be selected, as the environmental conditions that selected cells with oncogenic properties have not been altered.Our present approach to cancer treatment of removing cancer cells, does of course not preclude the possibility of cancer recurring.However, the presence of CSGV would suggest an approach to cancer treatment that in addition to removing the cancer would also seek to select the normal tissue and cells that are always present within cancer tissues, although normally only as a very small minority of cells.This new treatment approach would therefore require that cancer tissue microenvironments be returned to conditions that would once again select for normal cells, although this is clearly not a simple task.
Recently, more attention has started to be given to the carcinogenic role of the tumor microenvironment including in both tumorigenesis [52] and differential tissue responses to therapy [53] .These studies have begun to analyze and reveal some of the tumor micro-environmental factors that may play a critical role in carcinogenesis.Naturally, these data are also likely to help reveal the tissue micro-environmental properties within normal, non-cancer tissues.However, our understanding of what constitutes tissue-specific microenvironment conditions is still very incomplete.Also, it is highly likely that individuals will have their own set of micro-environmental, chemical and biological conditions, so it will be necessary to analyze their tissue microenvironments in considerable detail.Clearly, cells and tissues exist in complex three-dimensional environments, which include both extra-and intracellular environments.To analyze these microenvironments new technologies are being developed, including atomic force microscopy [54] , quantitative extracellular matrix proteomics [55] , and single cell multiomics [56] that are being used to create complex databases of tissue micro-environmental factors that will hopefully facilitate the identification of those significant factors that allow for the selection of normal as opposed to cancer cells.
However, at first glance there appears to be the same underlying problem with this approach as the one has characterized attempts to analyze the genomic and post-genomic events that cause cells to become oncogenic.Namely, the inability to identify the critical oncogenic events involved because we can only measure conditions before and after a cell becomes cancerous.However, the tissue micro-environmental conditions that result in normal cells being selected do not suffer from this drawback, as normal cells remain dominant in tissue over relatively long periods of time, presumably because they are subject to relatively consistent tissue micro-environmental conditions.Nevertheless, it is important to note that tissue microenvironments are likely to be highly individualized, so that even within an individual different tissue microenvironments might exist around different tissues.

Conclusion
Before the discovery of ITGH and now CSGV, the novel approach to cancer treatment that we are suggesting would have never been considered.However, if it is proven that cancer-associated genes within tumors as well as normal tissue consistently exhibit CSGV.Then a treatment approach that includes the goal of reselecting normal tissues by adjusting the tissue microenvironment, would seem to be the logical way to ensure that cancer treatments finally result in the permanent elimination of cancer.

DECARATIONS
The primers sequencing coverage includes introns from 3 to 8

Figure 1 .
Figure 1.AR exonic mutations present in each of the tumor samples.T-refers to individual tumor samples.AR refers to codon within which mutations were found

Table 3 ]
. The analyses revealed 53 exonic mutations [Table4].These included 20 mutations in exon 4, 11 mutations in exon 5, 10 mutations in exon 6, 4 mutations in exon 7, and 8 mutations in exon 8.It was noted that a significant number of the mutations (18 out of 53) had previously been identified as either associated with androgen insensitivity syndrome (AIS) (11 mutations) or prostate cancer (7 mutations).Twenty-one mutations occurred in several of the tumor samples, with 4 of the mutations occurring in at least 4 of the tumor samples.The distribution of the mutations in each tumor was unique, resulting in a different set of AR variants being present in each of the tumors [Figure