加拿大医学论文写作:Epidemiology of Squamous Cell Carcinomas of the Head and Neck

发布时间:2022-05-31 13:15:42 论文编辑:zeqian1013

本文是加拿大医学论文范例,题目是“Epidemiology of Squamous Cell Carcinomas of the Head and Neck(头颈部鳞状细胞癌的流行病学)”,以下各小节介绍了目前有关头颈部鳞状细胞癌的流行病学知识,特别是加拿大和印度,烟草、酒精消费等风险因素的作用及其代谢中涉及的特定遗传多态性,人类乳头瘤病毒(HPV)和社会经济地位(SEP),然后简要描述生命过程流行病学、病例对照研究设计、反事实因果框架和有向无环图。

1 Introduction to be written引言要写

2 Literature review

The following sub-sections present current knowledge regarding the epidemiology of squamous cell carcinomas of the head and neck (SCCHN) with special reference to Canada and India, the role of risk factors such as tobacco, alcohol consumption and specific genetic polymorphisms involved in their metabolism, human papillomaviruses (HPV) and socioeconomic position (SEP), followed by a brief description of life-course epidemiology, case-control study design, counterfactual causal framework and directed acyclic graphs.

2.1 Squamous cell carcinomas of the head and neck (SCCHN) – Definition

Malignant tumours arising from the squamous cells that line the mucosal surface of the oral cavity, pharynx and larynx [C00‐C14, C32 under the International Classification of Diseases (ICD) 10 classification], are commonly referred to as squamous cell carcinomas of the head and neck (1). Histologically, more than 90% of cancers of the oral cavity, pharynx and larynx are of squamous cell origin (2).

2.2       Epidemiology of SCCHN

SCCHN are a heterogeneous group of cancers that differ in distribution, predisposing factors, diagnostic workup and management strategies. According to Globocan 2012 statistics, SCCHN accounted for approximately 599,500 incident cases worldwide, making them the 7th most common cancers in incidence (3.8% of cases) (3). Most of these cancers affect males (70.8%) and are diagnosed above 60 years of age (4). The sub-site with the highest cancer incidence is the oral cavity (300,373), followed by the larynx (156,877) and pharynx (142,387) [Age standardized incidence rates (ASIR) per 100,000 population: oral cavity=4, pharynx=1.9, larynx=2.1]. Globally, these cancers were the 8th most common causes of cancer mortality (3.6% of cases), and were responsible for 300,000 deaths in 2012 (3).

There is wide variation in the geographic distribution of SCCHN incidence across the globe (4, 5). Approximately two-thirds of the burden of incident SCCHN cases is borne by developing countries, with India accounting for 25% of new cases and 35% of deaths occurring worldwide (3). In 2012, approximately 142,000 new SCCHN cases were reported in India, accounting for 30% of all incident cancer cases in this country (6). There has been a rapid increase in the incidence of these cancers, specifically oral cancers, in India. A comparison of Globocan 2008 and 2012 reveals that oral cancer surpassed lung cancer in a span of four years to become the 3rd most common cancer in this country after breast and cervical cancers (3, 7).

全球SCCHN发病率的地理分布差异很大(4,5)。SCCHN病例的发病负担约三分之二由发展中国家承担,印度占全球新病例的25%,死亡病例的35%(3)。2012年,印度报告了大约14.2万例新发SCCHN病例,占该国所有癌症发病病例的30%(6)。在印度,这些癌症的发病率迅速上升,特别是口腔癌。Globocan 2008年和2012年的对比显示,口腔癌在四年的时间里超过了肺癌,成为美国第三大最常见癌症,仅次于乳腺癌和宫颈癌(3,7)。

In developed countries such as Canada, SCCHN accounts for 3% of incident cancer cases (3). An increase in the incidence of SCCHN from 3,000 new cases in 1990 to an estimated 5,650 new cases in 2016 has been reported, accounting for 1,650 deaths in this country in 2016 (8). According to Canadian Cancer Statistics 2016, a significant decrease in the incidence rate of oral cavity cancers was noted in males between 1992 and 2003, after which the rates became relatively stable (8). Rates among females did not change significantly between 1992 and 2012. In contrast, the incidence rate of pharyngeal cancers has increased significantly in both males and females since the mid-1990s. In males, the incidence of pharyngeal cancers surpassed that of oral cavity cancers in 2001 while in females, the incidence of oral cavity cancers continues to be higher than that of pharyngeal cancers (8).

A comparison of SCCHN incidence between India and Canada (Table 1) based on Globocan 2012 estimates shows that the age standardised incidence rates (ASIR) for SCCHN overall and nearly all subsites for both males and females are higher in India than in Canada  (9).

Table 1: Comparison of SCCHN incidence in Canada and India (Globocan 2012)

Type of Cancer Canada India

Males Females Males Females

SCCHN incidence

(total numbers)

3,394 1,347 108,477 32,663

ASIR per 100,000 population

    SCCHN 11.8 4.2 20.9 6.1

    Oral 5.5 2.9 10.1 4.3

    Larynx 3 0.6 4.6 0.5

    Pharynx 3.2 0.8 6.3 1.3

ASIR- Age standardised incidence rates. Age standardization was performed using the direct methods and the World standard population as proposed by Segi (10) and modified by Doll et al (11).

加拿大论文怎么写

SCCHN have a significant impact on the quality of life and psychosocial health of the patients and impose a considerable economic burden on their families (12, 13). In the US, patients with SCCHN have more than three times the incidence of suicides compared to the general population (14). Most of these have been reported to occur within the first 5 years of diagnoses and has been attributed to adverse effects on patients’ quality of life and resulting psychological distress that may last for decades after successful treatment. The overall 5‐year survival rates are low for SCCHN, and vary by cancer sub‐site from 35% for oral to 65% for laryngeal cancers (6, 15). Multiple primary tumours developing at the cancer site and a high rate of secondary tumours compared to other malignancies contribute to this poor prognosis scenario, which has not changed over the past 30 years, (16-18). Although the majority of SCCHN can readily be accessed for visual and tactile examination (e.g., oral cavity cancers), 60% of patients are diagnosed at stage III and IV in North‐America (19). In India, up to 80% of patients present with advanced disease (6). This situation may be attributed to diagnostic delay (failure in recognizing early signs and symptoms of cancer by patients and/or professionals, delay in accessing professional care) and lack of diagnostic tools with high sensitivity and specificity for the early detection of clinical disease (19, 20). Severe functional and esthetic sequelae, especially for cases diagnosed at late stages, have been reported following treatment for these SCCHN. According to a 2007 study, the mean per-patient expense of managing oral cancers in the UK in the first year following diagnosis is 3,500$USD for pre-cancer and 25,000$USD for stage IV cancer patients (21). In North America, SCCHN are responsible for approximately 2.8 billion $USD per year in productivity loss (21). For these reasons, SCCHN have been recognised as a major public health problem in both developed and developing countries.

2.3       Risk factors for SCCHN

SCCHN are complex diseases with multi-factorial aetiology. The discrepancy in the geographic distribution of their incidence has been attributed to variations in the risk factors involved in different locations (5). In developed countries, approximately two-thirds of SCCHN cases are attributed to tobacco smoking and alcohol consumption (4, 22-24) and about 17%-56% of cases may be due to high risk HPV infection (4, 25-27). In developing countries, such as India and most parts of South Asia, paan chewing is the strongest risk factor (4, 5, 28). Other risk factors include social (e.g., SEP) and psychosocial variables (e.g., acute life events, work stress, depression) (12, 29-33), familial associations (34-41), diet, sexual behaviour, infection and oral/periodontal health related factors (4, 5, 42-45). The sections below describe in detail the risk factors for SCCHN; special emphasis is given to tobacco smoking, alcohol consumption, genetic variations (polymorphisms and copy number variations) and SEP because they are central to this dissertation.

2.3.1        Tobacco use and alcohol consumption

2.3.1.1       Tobacco use

Tobacco use is the strongest risk factor for SCCHN. Among the various forms of tobacco consumption [e.g., smoking, chewing and snuffing), smoking (e.g., cigarettes, pipes, cigars, bidi, hookah, chutta, chillam) is the most common (46-48). In its smoked form, tobacco was first used as pipes and cigars, and later as bidis (especially in South Asia), followed by cigarettes in the later half of the nineteenth century (49). Selected characteristics of cigarettes, cigars, pipes and bidis including nicotine content are provided in Table 2 (49-51).

Table 2: Selected physical characteristics of cigarettes, cigars, pipes and bidi

About 50% of men and 9% of women in developing countries, and 35% of men and 22% of women in developed countries smoke tobacco in the form of cigarettes (47). In 2013, the average daily cigarette consumption was 15.2 and 12.5 for male and female smokers respectively in Canada (52). Among the provinces, Quebec reported the highest daily cigarette consumption, at 15.6 overall (males=16.5, females=14.5) (52).

In India, approximately 35% of adults use tobacco in some form (47). Paan/betel quid (a combination of tobacco, areca nut and slaked lime wrapped in a betel leaf) chewing is one of the most commonly used forms of tobacco in India, in both males and females (53-56). The prevalence of tobacco smoking is around 14% in India and is much higher in males than females (24% vs 3%) (47). Bidi is the most commonly used smoking product (prevalent in 9% of adults), followed by cigarettes (6%) (47). One bidi produces more nicotine, carbon dioxide, tar, alkaloids and potential carcinogens than a regular cigarette (57-60).

在印度,大约35%的成年人使用某种形式的烟草(47)。Paan/槟榔quid(一种烟草、槟榔果和熟石灰包裹在槟榔叶中的混合物)咀嚼是印度最常用的烟草形式之一,对男性和女性都适用(53-56)。在印度,吸烟流行率约为14%,男性比女性高得多(24%比3%)(47)。比迪是最常用的吸烟产品(在9%的成年人中流行),其次是香烟(6%)(47)。一种比迪香烟会比普通香烟产生更多的尼古丁、二氧化碳、焦油、生物碱和潜在的致癌物(57-60)。

2.3.1.2       Tobacco use and risk for SCCHN

The International Agency for Research on Cancer (IARC) first reported the positive association of tobacco use and alcohol consumption with SCCHN risk in 1985 and 1988, respectively (61, 62). Approximately 69 chemicals identified in tobacco smoke contribute to tumourigenesis, including 10 that are identified as Group 1 human carcinogens by the IARC (63). The most important of these carcinogens, which have also been causally linked to SCCHN, are volatile nitrosamines [e.g., NDMA (nitrosodimethylamine), NEMA (nitrosoethylamine)], nitrosodiethanolamine (NDELA), tobacco specific nitrosamines (TSNA) [e.g., 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK) and N-nitrosonornicotine (NNN)], polycyclic aromatic hydrocarbons (PAH) (e.g., benz[a]pyrene, benz[a]anthracene), aromatic amines, benzene and volatile aldehydes (e.g., acetaldehyde, formaldehyde) (28, 63, 64).

The oral cavity, pharynx and larynx are directly exposed to tobacco smoke, compared to other sites such as the lungs (49, 51, 57). In the West, approximately 45% of SCCHN cases in men and 75% of cases in women have been attributed to tobacco smoking (24). It is independently responsible for a quarter of SCCHN cases in non‐alcohol users (24, 65) worldwide, and 60%-90% of deaths from SCCHN in North America (66). The cigarette is the most common form of smoking and thus it is the main route of delivery of tobacco related carcinogens in most countries. An IARC review documented magnitudes of average relative risks ranging between 4 and 10 for SCCHN risk for ever smokers relative to never smokers (49). However, cigar and pipe smoking may deliver equivalent or higher doses of carcinogens compared to cigarette smoking. Indeed, the largest pooled analysis thus far of 19 studies on SCCHN reported that risk estimates for individuals who smoked only cigarettes, only cigars and only pipes were 3.93, 3.49 and 3.71, respectively, compared to non-smokers (67). Individuals who smoked various combinations of these products were also at approximately 2.5 to 3.5 times the risk for these diseases (67).

The majority of SCCHN cases in India and many Asian countries are attributable to paan/betel quid chewing (4, 55, 56, 68-70). The carcinogenic effect of paan chewing is complex, as it results from an interaction between carcinogens in tobacco, arecoline, the main alkaloid in arecanut, and an increased alkalinity of the oral mucosa due to slaked lime (61, 68, 71-73). In Southern India, approximately 50% of cases in men and 90% in women are attributable to frequent and long term paan chewing (54). A recent meta-analysis reported a 5-7 times higher risk for oral cancers associated with chewers compared to non-chewers (70) in India. Bidi smoking, reported to deliver approximately 1.5 times the carcinogens of commercial cigarettes, also significantly increases the risk of SCCHN (57). Studies including meta-analytical reviews report a 2-7 times increased risk among bidi smokers compared to non-smokers (55, 56, 74, 75). However, evidence on the association between filtered cigarette smoking and SCCHN in India is mixed. Both case-control and longitudinal studies report little or no association between this exposure and outcome (45, 55, 76, 77).

在印度和许多亚洲国家,大多数SCCHN病例可归因于嚼槟榔/槟榔(4,55,56,68 -70)。嚼槟榔的致癌作用是复杂的,因为它是由烟草中的致癌物质、槟榔中的主要生物碱槟榔碱和消石灰引起的口腔黏膜碱度增加相互作用的结果(61,68,71-73)。在印度南部,大约50%的男性病例和90%的女性病例可归因于频繁和长期咀嚼薄饼(54)。最近的一项荟萃分析显示,在印度,嚼口香糖的人患口腔癌的风险比不嚼口香糖的人(70人)高出5-7倍。据报道,吸比迪香烟产生的致癌物大约是商业香烟的1.5倍,也显著增加了SCCHN的风险(57)。包括荟萃分析评论在内的研究报告称,与非吸烟者相比,bidi吸烟者的患病风险增加了2-7倍(55,56,74,75)。然而,在印度,关于过滤香烟和SCCHN之间的联系的证据是混杂的。病例对照和纵向研究都很少或没有报告这种暴露与结果之间的联系(45,56,77)。

Multiple measures of tobacco use (e.g., frequency, duration, cumulative consumption) have been associated with SCCHN risk, with studies reporting linear or non-linear dose-response relationships with these exposures (24, 55, 56, 69, 75, 78-82). A large pooled analysis in a European male population reported a monotonic increase in risk for SCCHN (from as low as 2 daily cigarette consumption) with increasing frequency of cigarette smoking relative to non-smokers (81). A similar dose-response relationship was demonstrated with a cumulative measure of paan chewing in studies from South India (56) and Taiwan (83). For the association between years since cessation of the habit and SCCHN risk, multiple studies report an inverse relationship (78, 80, 84-87).

2.3.1.3       Alcohol consumption

The World Health Organization (WHO) estimates that there are approximately two billion alcohol consumers worldwide (28). More than half of men (55%) and one third of women (34.4%) consume some form of alcoholic beverage (88) and their drinking patterns vary from occasional to habitual drinking, to alcohol abuse (28). There is wide variation in the type, quality and quantity of alcohol consumed across countries. In Canada, about three-quarters of the population (78%) drinks alcohol in the form of wine (10% ethanol), beer (5% ethanol), hard liquors (50% ethanol) and various combination of spirits (89). In 2011, Quebec reported the highest rate of consumption (82%) in the country (89), and a higher percentage of males consumed alcohol than females (83% vs 74.5%) (89).

Comparatively, the prevalence of alcohol consumption in India is much lower, with only 21% of men and 2% of women who have this habit (90). The state of Kerala located in the Southwest of India reports the highest rates of alcohol consumption in the country (91). In addition to other forms of alcohol, people in India consume high quantities of “toddy”, a beverage produced locally from the fermented and distilled sap of palm and coconut trees (approximately 8-10% ethanol), and a locally brewed liquor known as “arrack”, traditionally produced from fermented palm sap and fruit, grain, or sugarcane (approximately 40-60% ethanol) (69, 72).

2.3.1.4       Alcohol consumption and risk for SCCHN

There is a general consensus that alcohol plays the role of a promoter/cocarcinogen in carcinogenesis (36, 62, 92-94). Local exposure to ethanol, the principal type of alcohol found in most alcoholic beverages, is considered to increase the solubility of oral, pharyngeal and laryngeal mucosa, facilitating the penetrance of other carcinogens (24, 93, 95). Heavy drinking induced nutritional deficiencies and a direct toxic effect on the epithelium by alcohol beverages with high concentrations of ethanol may also contribute to alcohol associated carcinogenesis (94). In addition, certain alcoholic beverages contain low levels of carcinogenic substances (e.g., nitrosamines, urethane, polycyclic hydrocarbons) (62, 86). Furthermore, the primary metabolite of ethanol metabolism in the body, acetaldehyde, is a Group 1 human carcinogen that exerts multiple mutagenic and carcinogenic effects, qualifying alcohol as an initiator of the cancer pathway (24, 96-100). Other mechanisms are detailed in sub-section 2.3.2.15.

Alcohol consumption accounts for approximately 30% of all SCCHN cases worldwide (101). The greater risk of disease in men is attributed to their higher average alcohol consumption relative to women (98, 101). An increase in the risk of SCCHN with different levels of ethanol consumption, duration, frequency and alcohol types has been documented among never tobacco users (24, 79, 101-103). In a large pooled analysis of case-control studies, Hashibe et al. documented that among never users of tobacco, approximately 7% of SCCHN cases were attributable to alcohol drinking alone. A meta-analysis on several cancers reported between 1956 and 2012 documented risk for SCCHN with magnitudes ranging between 1.44 to 1.83, and 2.65 to 5.13 for moderate and heavy drinkers, respectively, among European and North American populations (104). In South India, approximately 26% of the risk for oral cancer is attributable to alcohol consumption, with the risk ranging from 1.2 to 2.8 times higher among moderate to heavy alcohol consumers relative to non-consumers (54, 69, 80).

Similar to tobacco products, a dose-response relationship, either linear or non-linear, has been documented for alcohol consumption and SCCHN association (101, 105-108). In a prospective study, Freedman et al. reported an increased risk of SCCHN (1.5 times for males, 2.5 times for females) for 3 drinks per day or more (65). However, a recent meta-analysis reported elevated risks at even lower levels, with risk ratios of 1.29, 3.24, 8.61, 13.2 for 10g (12ml), 50g (64ml), 100g (127ml), and 125g (160ml) of ethanol per day, respectively (101). Polesel et al. reported a non-linear dose-response relationship in a pooled European study and documented a threshold effect at 50g of ethanol consumption per day for pharyngeal and laryngeal cancers, and at 150g (191ml) for oral cancers (107). A recent meta-analysis also reported a non-linear dose-response relationship between frequency of ethanol consumed and SCCHN. However, they did not report a threshold effect for any SCCHN site (109).

2.3.1.5       Combined effect of tobacco and alcohol on risk for SCCHN烟草和酒精对SCCHN风险的联合影响

The interaction between tobacco and alcohol use in elevating the risk for SCCHN has been well demonstrated. Together they account for approximately 75-80% of SCCHN cases in North America and Europe (22-24) and 50% of oral cancer cases among males in Kerala, India (110). Several studies have considered the nature of the joint effects of smoking and alcohol on SCCHN (79, 101, 104, 105, 108, 111). Positive interactions on both additive and multiplicative scales have been reported between these exposures (98, 101, 105). A non-linear dose-response relationship has been documented for the combined effects of daily alcohol and cigarette consumption (108). For example, a 35-fold increase in risk of SCCHN was observed among those consuming 89g of ethanol and 10 cigarettes daily (108). The risk curve was steeper for increasing daily cigarette consumption among drinkers as compared to increasing alcohol consumption among smokers.

烟草和酒精使用在提高SCCHN风险方面的相互作用已得到充分证明。它们加在一起约占北美和欧洲SCCHN病例的75-80%(22-24),以及印度喀拉拉邦男性口腔癌病例的50%(110)。有几项研究考虑了吸烟和酒精对SCCHN的联合影响的性质(79,101,104,105,108,111)。据报道,这些暴露之间在可加性和乘法尺度上都存在积极的相互作用(98,101,105)。每日饮酒和吸烟的综合影响存在非线性的剂量-反应关系(108)。例如,在每天摄入89克乙醇和10支香烟的人群中,发现SCCHN的风险增加了35倍(108)。与吸烟者相比,饮酒者每天增加香烟消费量的风险曲线更为陡峭。

To summarise, it has been consistently demonstrated that tobacco use and alcohol consumption in various forms are strong risk factors for SCCHN, and several correlated measures of these exposures (frequency, duration, cumulative measures and time since cessation) are associated with SCCHN risk.

2.3.2        Genetic polymorphisms and copy number variations

Although tobacco and alcohol are strong risk factors for various cancers (e.g., SCCHN and lung cancer), only a very small proportion of tobacco users and alcohol consumers develop these diseases (33, 112, 113). For example, approximately 10%-15% of smokers develop lung cancers, and even a lesser proportion, SCCHN (112, 113), suggesting inter-individual variation in host susceptibility towards these diseases (33, 114, 115). Investigations of individual genetic makeup have shown that variations in the expression of carcinogen metabolizing enzymes due to variants of genes encoding these enzymes, structural variations in DNA segments, mutagen sensitivity, chromosomal aberrations, DNA repair and apoptosis, contribute alone or in combination to inter-individual variation in susceptibility to cancers including SCCHN (116-120). Single nucleotide polymorphisms (SNPs) are the most common form of variation in the human genome, and SNPs in key genes encoding enzymes involved in the metabolism of specific carcinogens found abundantly in tobacco smoke and alcohol have been the subject of research interest in the past two decades. These SNPs along with risk behaviours are the focus of manuscripts II and III of this thesis. Hence, I describe below the enzymatic pathways underlying the metabolism of tobacco and alcohol carcinogens, specific genes and related SNPs that could alter these pathways and contribute to individual differences in SCCHN susceptibility.

2.3.2.1       Enzymatic pathways in carcinogen metabolism

About 90% of chemical carcinogens from a variety of environmental exposures including tobacco smoke enter the human body as non‐carcinogenic pro‐carcinogens (121). They require bio‐activation into reactive molecules for further conjugation, which facilitates their elimination from the body (122, 123). The scenario is similar with constituents of alcoholic beverages. It is hypothesized that part of the susceptibility to tobacco and alcohol related cancers may be determined by inter-individual differences in the bio-activation of pro-carcinogens and detoxification of carcinogens derived from these exposures.

The bio-activation and detoxification processes are catalysed by enzymes generally known as phase I and phase II xenobiotic metabolizing enzymes (XMEs), respectively (115, 124, 125). Majorly expressed in the liver, these enzymes are also found in the mucosal lining of various organs including the upper aero-digestive tract. The phase I XMEs that activate pro-carcinogens from environmental sources (including tobacco smoke) into intermediate reactive, electrophilic metabolites belong mainly to the superfamily of cytochrome P450 (CYP) enzymes. Phase I XMEs belonging to the alcohol dehydrogenase family (ADH) oxidise ethanol to acetaldehyde. These reactive moieties (e.g., diol epoxides, arene dioxides, acetaldehyde from ethanol) are genotoxic and can form DNA adducts that may cause mutations in the DNA and result in cell transformation, transcription and translation errors (Figure 1) (122). If DNA repair or cell death does not occur, these molecular changes persist and mark the earliest events in the pathway leading to tobacco and alcohol related cancers such as SCCHN. The detoxification and elimination of these reactive moieties are facilitated by Phase II glutathione S-transferase (GST) (via conjugation by nucleophilic glutathione) and acetaldehyde dehydrogenase (ALDH) XMEs. This conjugation reaction increases the water solubility of the substrates from Phase I biotransformation and ultimately gets them eliminated through urine and sweat (122).

2.3.2.2       Phase I and phase II enzymes, associated genes and SNPs

With respect to tobacco and alcohol related cancers, several enzymes belonging to the families of CYP, GST, ADH and ALDH enzymes have been studied. Some of the most widely studied are CYP1A1, CYP2E1, CYP2A6, CYP2D6, GSTM1, GSTP1, GSTT1, ADH1B, and ALDH2 (126). The specific pro-carcinogenic and carcinogenic substrates of these enzymes are provided in Table 3 (121, 127-129).

The catalytic activity of each of these enzymes is determined by genes (DNA sequence) encoding them. For example, CYP1A1 is encoded by the CYP1A1 gene. Alternative forms of a given gene (or variants of genes) that differ in function, resulting from variations within the nucleotide sequence in DNA at a given gene locus are termed alleles. DNA sequence variations resulting in alleles that are common in the population (i.e., the least frequent/rare/minor allele occurs in more than 1% of the population due to natural selection of genetic drift) are known as genetic polymorphisms.

Table 3: XMEs and their substrates present in tobacco and alcohol

Enzyme Substrates

CYP1A1 Polycyclic aromatic hydrocarbons (PAH), heterocyclic aromatic amines (HAA),

CYP2A6 NNK, N-Nitroso-N-Diethylamine (NDEA), nicotine, cotinine, ether

CYP2D6 Amines, nicotine

CYP2E1 Benzene, acrylonitrile, N-Nitroso Diethylamine (NDEA), TSNA (NNK, NNN), ether, ethanol

GSTM1 Arene oxide, diolepoxide

GSTP1 Arene oxide, diolepoxide

ADH1B Ethanol

ALDH2 Acetaldehyde (from both alcohol and cigarette smoke)

When polymorphic DNA sequences occur due to alterations at a single nucleotide base, they are termed SNPs. Based on the combination of alleles from the maternal and paternal chromosomes, three variable types of individuals can be identified in the population. Genotypes resulting from the presence of the same allele on both chromosomes are referred to as homozygous, whereas those with a wild type allele on one chromosome and a variant allele on the other (maternal or paternal), are termed heterozygous. Homozygous wild type (wild/wild) is usually associated with a functionally normal enzyme, whereas homozygous mutant (variant/variant) or heterozygous (wild/variant) genotypes can result in a functionally different enzyme (e.g., a fast, slow, inactive enzyme). In summary, SNPs result in different group of (e.g carriers: homozygous variant + heterozygous genotypes, non-carriers: homozygous wild type genotypes) individuals with distinct traits (inter-individual variation) in a given population. These SNPs can lead to functionally different xenobiotic enzymes involved in the biotransformation of tobacco and alcohol pro-carcinogens, which, in turn, can result in differential SCCHN risks among individuals with different genotypes.

2.3.2.3       Candidate genes and SNPs associated with carcinogen metabolism and risk for SCCHN

The genes encoding phase I and phase II XMEs are highly polymorphic and various SNPs are associated with these genes (130). These SNPs can lead to enzyme products with increased, altered, decreased or no activity (129, 131). SNPs enhancing the activity of Phase1 CYP enzymes (e.g., CYP1A1*2A, CYP1A1*2C, CYP2E1c2) result in faster conversion of tobacco pro-carcinogens to reactive carcinogenic metabolites (132-135). Similar functional changes to ADH1B enzymes (e.g., ADH1B*2) lead to a higher conversion rate of ethanol to acetaldehyde (36, 136). Certain SNPs related to phase II XMEs (e.g., GSTP1Val) that cause a decreased activity of corresponding enzymes may result in decreased detoxification and excretion of these genotoxic metabolites (137). Overall, these functional changes may result in an overload of reactive carcinogens in the human body, which can lead to an increased risk of SCCHN. Other groups of SNPs that decrease the activity of phase 1 XMEs (e.g., CYP2D6null) or phase II XMEs (e.g., GSTM1null) may result in a decreased production or decreased rate of detoxification of these metabolites respectively, resulting in differential risk for SCCHN (138, 139). Furthermore, because tobacco smoke is one of the richest sources of carcinogenic chemicals that are substrates for these enzymes, the association between these SNPs and SCCHN risk can vary depending on different levels of tobacco smoking. This gene-environment interaction can result in sub-groups with differential risk for SCCHN within a population. The identification of high-risk groups can ultimately aid in targeting prevention activities. Hence, in this work, we focus on the association between several widely-studied SNPs altering the functions of phase I and phase II XMEs and SCCHN risk, alone or in interaction with tobacco smoking. We also consider the ADH1B*2 SNP associated with alcohol metabolism. A summary of characteristics of these genetic variants are provided in Table 3 and described in the sub-sections below.

Table 3: Characteristics of SNPs involved in tobacco and alcohol metabolism

a Chr: Chromosome; rs_number*- stands for reference SNP cluster ID. It is an accession number that is a stable and unique identifier for SNPs.

CYP1A1*2A, CYP1A1*2C and SCCHN risk

The CYP1A1 is a highly active CYP enzyme majorly involved in the activation of pro- carcinogens such as polycyclic aromatic hydrocarbons (e.g., benzo[a]pyrene) and aromatic amines found in tobacco smoke, environmental pollutants and smoked food (133). The enzyme is encoded by the CYP1A1 gene foundon chromosome 15. Polycyclic aromatic hydrocarbons induce expression of this gene (140). SNP’s designated as CYP2A1*2A, which was the first variant to be identified for the CYP1A1 gene, and CYP1A1*2C, are two of the widely studies polymorphisms in this gene (141). These SNP’s inherit together, resulting in a non-random association (linkage disequilibrium=LD) between them (142).  The frequencies of minor alleles (C allele for CYP1A1*2A and G allele for CYP1A1*2C) vary in different ethnicities with 5-10% reported for the C allele and 3-5% for the G allele among Caucasians (143, 144). These SNPs, which occur on the restriction sites that control enzyme activity, result in increased enzyme activity (~ 2-fold) (143, 144). Based on the hypothesis that increased enzyme activity leads to enhanced activation of pro-carcinogens to carcinogens, these SNPs are considered to increase the risk of SCCHN (142, 145-151).

Multiple meta-analytical reviews have aimed to clarify the association between the two CYP1A1 SNPs and SCCHN risk (132, 133, 146, 152-155). An increased risk association of CYP1A1*2A (113, 133, 155) and CYP1A1*2C (148, 154) with SCCHN has been reported when combining all ethnicities. However, this association is inconsistent among Caucasians (34, 35, 133, 146, 148, 150).

Meta-analytical reviews have also considered the combined effect of these polymorphisms and smoking for the risk of SCCHN (35, 132, 133, 146, 153). In a 2003 review and pooled analysis, Hashibe et al. reported no evidence for interaction between CYP2A1*2C and smoking, and suggested that this result was due to the heterogeneous ethnicity among studies (152). However, Liu et al. (2013) reported that, compared to non-smokers and non-carriers of the CYP1A1*2C allele (AA genotype), carriers (AG/GG genotype) and smokers had the highest risk (approximately 2.4 fold), followed by the non-carrier and smokers (2-fold risk) (146). Similar associations were identified for the joint effects of CYP1A1*2A and smoking on SCCHN. Relative to the non-carriers (TT genotype) and non-smokers, the carriers (TC/CC genotype) + smokers had an approximately 3-fold increase in risk, followed by the non-carrier and smokers (1.78 times the risk). Overall, they reported a positive multiplicative interaction estimate of 1.5 between risky genotype categories of both SNPs and smoking. Qin et al. (2014) reported a positive interaction (1.51 times the risk) between carriers of the CYP1A1*2C variant and smoking (132). For CYP1A1*2A, He et al. (2014) reported 2.37 times the risk for SCCHN among carriers who were smokers (133). However, all these meta-analyses pooled studies on Asian and Caucasian populations. Hence, although extant research suggests that the combined effect of these SNPs and tobacco smoking intensifies the risk for SCCHN, more studies comprehensively reporting interaction results among Caucasian populations are required (156).

CYP2E1c2 and SCCHN risk

CYP2E1 is involved in the metabolic activation of compounds such as benzene, acrylonitrile, N-dimethyl nitrosamines and ether from tobacco smoke. It is encoded by the CYPE2E1 gene, located on chromosome 10. The gene is inducible by low dose nicotine and ethanol (157). A SNP designated as CYP2E1c2 is widely studied with regards to various tobacco related cancers (134, 158-164); its minor allele (c2 or C) has a frequency of less than 10% among Caucasians (134, 143). The allele is associated with increased enzyme activity [i.e., the c2/c2 genotype (CC genotype) has almost 10 times more carcinogen activating capacity than the c1/c1 genotype (GG genotype)] and hence is hypothesised to increase the risk for SCCHN (134, 135, 147, 163, 165-167). The most recent meta-analysis, conducted on 43 studies, suggested that carriers of the c2 allele are at increased risk for SCCHN among Asians and mixed populations, but not among Caucasians (168). A previous meta-analysis reported similar findings (134, 169). However, studies considering the combined effect of CYP2E1c2 and smoking (mainly stratum specific effects) provide conflicting results (163, 170), and there is a lack of studies comprehensively analysing the possibility of an interaction between CYP2E1c2 and various tobacco smoking levels.

GSTP1Val and SCCHN risk

Belonging to a superfamily of multi-functional Phase II XME, the GSTP1 enzyme metabolize a large variety of substrates and are involved in the detoxification of activated carcinogenic compounds from tobacco (e.g., diol epoxides of polycyclic aromatic hydrocarbons) (165). It is encoded by the GSTP1 gene located on chromosome 11. A SNP in this gene designated as GSTP1105Val has been studied in relation to multiple cancers including SCCHN (144, 147, 158-160, 163). In Caucasians, the frequency of the minor allele (G) is approximately 10-40% (171-173). Compared to the wild type A allele, the G allele encodes an enzyme that is 2-3 times less stable and hence less efficient in detoxifying phase I metabolites of tobacco procarcinogen metabolism (137, 152, 174). However, three meta-analyses conducted thus far have failed to identify any association between carriers of the G allele and SCCHN risk (137, 152, 175). They also did not identify any conclusive evidence supporting interaction between ever smoking and the G allele. Nevertheless, increased risk estimates for the joint effect of carrying the G allele and smoking have been reported with increasing levels of daily cigarette consumption and pack-years (34). To complicate the scenery further, GSTP1Val is known to be very substrate specific and highly efficient in detoxifying carcinogenic epoxide of benzo(a)pyrene specifically (176, 177) and indeed, a lower risk for SCCHN among Val allele carriers has been documented (178). In summary, more studies are required to comprehensively investigate joint effects and interaction between GSTP1val and tobacco smoking.

2.3.2.4       Copy number variants

Copy number variants (CNV) or polymorphisms have been defined as DNA segments present in variable copy numbers (repeats) in comparison with a reference genome (179). These segments are 1 kilobase or larger in size (from one kilobase to several mega-bases) and include deletion, duplication, insertion, inversion or complex recombination (Figure 2) (120). These structural variants are as important as SNPs in their contribution to genome variation. Genetic variants containing 0-13 gene copies have been reported across human populations (120). CNVs in genes involved in tobacco carcinogen activation and detoxification have been identified and are reported to alter SCCHN susceptibility (120). The identification of CNVs is advantageous in estimating the risk associated with various copy numbers of a variant rather than broad categorizations such as carriers vs non-carriers of the variant. In this work, apart from the SNPs already described, we consider CNVs in two genes, one encoding a phase I (CYP2D6) and the other a phase II (GSTM1) enzyme, the null variants of which render these respective enzymes non-functional.

Gene recombination event between two genes can result in gene duplication or multiplication (n=2,3…) or gene deletion (n=0). A duplication of gene could carry mutations from the original copy (red column)

CYP2D6 non-functional (null) CNV and SCCHN risk

CYP2D6 (Debrisoquine hydrolase) is the most genetically polymorphic of metabolic enzymes, with approximately 80 variants identified. It is majorly involved in the metabolism of nearly 20-25% clinically used drugs (180, 181) and pro-carcinogens from tobacco (e.g., various amines, nicotine) (182). The XME is encoded by the CYP2D6 gene located on chromosome 22. The variants identified are comprised of SNPs, deletions and insertions, and include normal activity, reduced activity or non-functional alleles (120). There is no detectable activity for this enzyme when encoded by CYP2D6 non-functional alleles (null alleles). Approximately 6-10% of Caucasians harbouring these null alleles are termed poor metabolizers of enzyme substrates (182-185). Due to a lower activation of pro-carcinogens to carcinogens, CYP2D6 null is hypothesised to be associated with a lower risk for tobacco related cancers such as SCCHN compared with highly active functional variants. However, evidence on this association has been inconsistent (138, 186-188). CNVs exist for CYP2D6 null (120) and individuals with lower numbers of copies of the null variant could have an increased risk for SCCHN compared to those with higher numbers of these variants. However, this hypothesis has not been explored yet, nor the interaction between CYP2D6 null CNV and tobacco in the risk for SCCHN.

GSTM1 CNV and SCCHN risk

Similar to GSTP1, the GSTM1 enzyme is involved in the detoxification of a variety of activated compounds from tobacco smoke with carcinogenic potential. The GSTM1 gene on chromosome 1 encodes the GST-mu enzyme (189). Among the three polymorphisms isolated for this gene, the GSTM1 null gene renders the GST-mu enzyme inactive, and individuals with this allele do not detoxify tobacco related carcinogenic compounds efficiently (190). An accumulation of such compounds that can form DNA adducts could increase the risk for cancers such as SCCHN. The null allele has a frequency of 40-60% among Caucasians (189). Multiple meta-analytical reviews support the hypothesis that GSTM1 null is associated with an increased risk of SCCHN in various ethnicities including Caucasians (113, 139, 150, 152, 191, 192). A higher risk has also been identified among smokers, suggesting an interaction between GSTM1null and tobacco smoking (193, 194). Relative to non-smokers with normally active GSTM1 (non-null), individuals who were smokers and GSTM1 null carriers have up to 5 times greater risk for SCCHN, with the risk increasing with heavier levels of tobacco smoked (6 times for GSTM1 null + more than 20 daily cigarette consumption, 7.4 times for GSTM1 null + more than 40 pack-years of tobacco) (34). CNVs have been identified for GSTM1. Approximately 10% of Caucasians have up to 2 copies of the GSTM1 homozygous deletion (120, 195). Although studies on SCCHN (primary tumors, secondary primary tumors and recurrent tumors), bladder and prostate cancer documented no risk associated with one copy of GSTM1, presence of at least 2 copies of GSTM1 was associated with low risk for these outcomes compared to GSTM1 homozygous deletion (196-199). The interaction between CNVs for GSTM1 null and tobacco smoking has yet to be reported comprehensively among Caucasians.

2.3.2.5       SNPs associated with tobacco and alcohol risk behaviours and risk for SCCHN

The genetic variants discussed so far are hypothesised to be associated with SCCHN risk independently or in interaction with smoking. However, there are SNPs that not only have the potential to interact with risk behaviours, but are documented to affect tobacco and alcohol risk behaviours. CYP2A6*2 and ADH1B*2 are two such variants that influence tobacco and alcohol consumption behaviours respectively. These variants are the focus of manuscript III and are described in the sub-sections below.

CYP2A6*2, intensity of smoking and SCCHN risk

CYP2A6*2 and nicotine metabolismCYP2A6*2与尼古丁代谢

Tobacco smoking is a complex behaviour influenced by social, environmental, psychological and genetic risk factors (200-202). The various phases identified in the continuum of this behaviour include the preparatory stage, initial trying (initiation), repeated irregular/sporadic use (experimentation), regular use, nicotine dependence/ addiction, cessation and relapse (200, 203). Following initiation, this rewarding behaviour is strongly determined by the addictive agent in tobacco called nicotine (50). Within 10-20 seconds of its inhalation, nicotine reaches the brain and starts exerting its psychoactive effects (204). However, nicotine has a short half-life (8 minutes on average) as it is rapidly inactivated and removed from the body, lowering its levels in plasma and tissues (204). Hence, to attain and maintain optimal levels of nicotine in the brain, the individual has to smoke again. Thus, factors affecting the metabolism of nicotine may influence various phases of smoking behaviour.

吸烟是一种受社会、环境、心理和遗传风险因素影响的复杂行为(200-202)。在这种行为的连续过程中确定的各个阶段包括准备阶段、最初尝试(开始)、反复不规律/零星使用(试验)、定期使用、尼古丁依赖/成瘾、停止和复发(200,203)。开始后,这种奖励行为强烈地由烟草中的成瘾剂尼古丁决定(50)。在吸入后的10-20秒内,尼古丁到达大脑并开始发挥其精神活性作用(204)。然而,尼古丁的半衰期很短(平均8分钟),因为它会迅速被灭活并从体内排出,降低其在血浆和组织中的含量(204)。因此,为了在大脑中获得并维持最佳的尼古丁水平,个体必须再次吸烟。因此,影响尼古丁代谢的因素可能影响吸烟行为的各个阶段。

Approximately 70-80% of the nicotine entering the body is metabolized/ inactivated into cotinine through a 2-step process (204, 205): nicotine is first converted to nicotine iminium ion, which is later oxidized into cotinine. The first part of the process is the rate limiting step and is catalyzed by the phase I CYP2A6 enzyme, mainly in the liver. Overall, 80-90% of the inactivation of nicotine to cotinine is catalyzed by the CYP2A6 enzyme encoded by the CYP2A6 gene on chromosome 19 (205-207). Although several SNPs have been identified in this gene, only a few have been functionally characterised as capable of altering enzyme activity (207-209). Based on their activity, carriers of the functional SNPs have been grouped as slow nicotine metabolizers (individuals hypothesised to smoke less), intermediary metabolizers (individuals hypothesised to be moderate smokers) and normal metabolizers (individuals hypothesised to smoke heavily) (208). Of these genetic variants, the first to be characterised and one of the most widely studied is CYP2A6*2 (206), which is categorized under slow metabolizers. The homozygous variant (AA) and heterozygosity (AT) of this allele results in complete and partial inactivity of the CYP2A6 enzyme, respectively (202, 210). Consequently, relative to homozygous wild type (TT genotype), smokers who are carriers of the variant (AA or AT genotypes) of this allele exhibit higher plasma nicotine levels for a given amount of nicotine ingested (due to a lower conversion rate of nicotine to cotinine). Based on this mechanism, the CYP2A6*2 allele was hypothesised to have an inverse association with smoking behaviour (e.g., number of cigarettes smoked per day, nicotine dependence).

Association of CYP2A6*2 with cigarettes smoked per day

There is strong evidence for the association between the CYP2A6*2 allele and number of cigarettes smoked per day among Caucasian adult smokers. Inter-ethnic variation has been reported in the frequency distribution of the CYP2A6*2 allele. Although they are rarer (0-0.7%) in the Chinese, Korean and Japanese population, their frequencies range from 1% to 3% among Canadian, American and European Caucasians (206). Several (211-215) but not all (216-218) studies looking into the association between CYP2A6*2 and smoking behaviour among Caucasian adult smokers reported that the CYP2A6*2 allele (AT/AA genotype) protected smokers against becoming nicotine dependent and  that they smoked fewer cigarettes per day relative to non-carriers (TT genotype). A meta-analysis including observational studies published between 1998 and 2004 documented no overall association between the CYP2A6 gene (multiple variants) and smoking behaviour (202). However, the majority of the studies included in the review used broad definitions of smoking (e.g., ever/never/current/former smoker) which may have led to misclassification of the outcome and obscure significant differences between the groups. Given the existence of a well demonstrated biological mechanism connecting the gene, nicotine metabolism and smoking behaviour, they attributed their results mainly to a lack of methodological rigour in the studies investigated and emphasised the importance of specifically defining the smoking variable (202). Another meta-analysis in the same year provided evidence that smokers who were carriers of at least one CYP2A6*2 allele smoked significantly fewer cigarettes per day, and also had higher chances of quitting smoking (214). The first ever study on CYP2A6 poor metabolizers conducted among Canadian Caucasians in 1998 (211) was reanalysed by Rao et al. (212) using stringent analytical methods. These included: a) a new genotyping method which, unlike the original study, removed the chances of CYP2A6*2 false-positives (219), b) a precise definition of the smoking outcome through multiple indices, and c) proper control for population stratification (confounding by variation in ethnicity) by the restriction of participants to Caucasian smokers who had at least 3 grand-parents of Caucasian ethnicity. This study reported that among smokers, relative to non-carriers of CYP2A6*2 allele (TT genotype), carriers (AT or AA), smoked fewer cigarettes per day [(13.5 vs 19.5, P<0.03) overall, and at times of heavy smoking (19 vs 29, P<0.001)], had lower breath carbon monoxide levels and lower cotinine levels. A similar study conducted among another North American population reported that among dependent smokers, slow metabolizers (which included carriers of at least one *2 allele) smoked 7 fewer cigarettes per day on average relative to non-carriers (21.3 v 28.3 cigarettes per day) (213)Also, among Caucasians who smoked at least 10 cigarettes per day, those who were slow metabolizers had significantly lower mean and overall puff volume compared to normal or intermediary metabolizers (220).

A genome-wide meta-analysis conducted in 2010 which analysed 710 SNP’s on chromosomes 15, 19, and 8 among adult participants of European ancestry, documented strong association between CYP2A6*2 allele and number of cigarettes smoked per day (221). A recent meta-analysis on slow metabolizers of CYP2A6 also reported similar findings (215).

Overall, findings from observational and meta-analyses reported thus far indicate that, due to their involvement in nicotine metabolism, smokers who are homozygous (AA) or heterozygous (AT) for the CYP2A6*2 allele smoke with less intensity (cigarettes per day) relative to homozygous non-carriers (TT).

CYP2A6*2 and SCCHN riskCYP2A6*2和SCCHN风险

Based on their involvement in the activation of tobacco pro-carcinogens to carcinogens, CYP2A6 genetic variants have been implicated in the risk for SCCHN. However, the literature on this association is sparse. Three studies have investigated the role of CYP2A6*4 in the risk for tobacco-related cancers (39, 163, 222). This SNP has a lower frequency among Caucasians (0.5-1%) compared to CYP2A6*2 (206). However, similar to CYP2A6*2, CYP2A6*4 renders the enzyme inactive, resulting in decreased bio-activation of substrates such as nicotinate and NNK, NNN and NDEA pro-carcinogens found in tobacco (206). Carriers of the CYP2A6*4 allele have been associated with a significantly lower risk for tobacco related cancers including those of the upper aerodigestive tract. Furthermore, this variant is suggested to affect cancer risk solely in smokers (39, 223). Based on similarities with CYP2A6*4, it can be hypothesised that smokers who are carriers of the CYP2A6*2 allele (AT or AA) are at a lower risk for SCCHN. However, no studies have yet investigated the role of CYP2A6*2 in SCCHN risk nor its interaction with smoking.

基于CYP2A6基因在激活烟草原致癌物到致癌物中的作用,CYP2A6基因变异与SCCHN的风险有关。然而,关于这种联系的文献很少。有三项研究调查了CYP2A6*4在烟草相关癌症风险中的作用(39,163,222)。与CYP2A6*2(206)相比,该SNP在白种人中出现的频率较低(0.5-1%)。然而,与CYP2A6*2类似,CYP2A6*4使酶失活,导致烟草中发现的烟酸盐和NNK、NNN和NDEA前致癌物的生物活化降低(206)。携带CYP2A6*4等位基因的人患烟草相关癌症(包括上呼吸道癌症)的风险显著降低。此外,这种变异被认为只影响吸烟者的癌症风险(39223)。基于与CYP2A6*4基因的相似性,可以假设携带CYP2A6*2等位基因(AT或AA)的吸烟者患SCCHN的风险较低。然而,目前还没有研究CYP2A6*2在SCCHN风险中的作用,也没有研究它与吸烟的相互作用。

加拿大医学论文代写范例

ADH1B*2, alcohol consumption and SCCHN risk

Alcohol consumption patterns are influenced by social, environmental, psychological and genetic factors with inter- and intra-ethnic variability (224-227). Much of the inter-individual variability in alcohol use is attributable to factors underlying the metabolism of ethanol (226-228). Ethanol entering the human body is first metabolized into acetaldehyde and later to acetate before being removed from the body. The oxidation of ethanol to acetaldehyde is catalyzed by ADH and its isoenzymes majorly in the liver. They are also expressed in the stomach, gut and upper aerodigestive tract in detectable quantities. Similar to nicotine metabolism, the inter-individual variability in alcohol to acetaldehyde metabolism is mostly attributed to the genetic polymorphisms in ADH genes encoding ADH enzymes. Of these, SNPs related to ADH1B and ADH1C iso-enzymes of ADH, namely ADH1B*2 and ADH1C*1, are two of the most functionally polymorphic and well characterised variants in adults. These SNPs are not only associated with alcohol consumption behaviour, but also with altered risk for SCCHN among alcohol consumers in various ethnicities (229). Although they seem to be in linkage disequilibrium, studies in both Caucasian and Asian populations suggest that ADH1B*2 has a significant effect on the risk of SCCHN after adjustment for ADH1C*1 (230, 231). Also, among multiple ADH SNPs studied, ADH1B*2 has the strongest association with alcohol consumption behaviour and SCCHN (136, 228, 229). Hence, we will be focusing on the role of ADH1B*2 in relation to both SCCHN risk and alcohol consumption behaviour.

Association between ADH1B*2 and alcohol metabolism

The role of ADH1B*2 in alcohol consumption behaviour has been widely investigated (232). The frequency of this allele varies in different ethnicities [Asian: 69% (range-19%-91%), European: 5.5% (range- 1%-43%), Mexican: 3% (range-2%-7%)] (232). The homozygous variant (AA genotype) and heterozygosity (AG genotype) of this allele result in an ADH enzyme that rapidly oxidizes ethanol to acetaldehyde (an up to 50-100-fold increase in activity has been reported) (36, 224, 227, 228). Carriers of this allele (AA or AG genotype) are at decreased risk of alcohol dependence compared to non-carriers (GG genotype). This is hypothesised to be due to the prompt build-up of acetaldehyde (resulting from the rapid oxidation of ethanol), which leads to negative physiological reactions termed alcohol-induced flushing, which is characterised by cutaneous flushing, increased skin temperature, decreased blood pressure, tachycardia, dizziness, anxiety, nausea, headache and generalised weakness (233). These aversive reactions lead to decreased alcohol consumption.

Association of ADH1B*2 with alcohol consumption behaviour

The association between ADH1B*2 and alcohol consumption behaviour was first investigated in the East Asians (234-237), and then later among Europeans and other ethnicities (230, 238, 239). Among East Asians, this allele decreases the risk of alcohol dependence by about 80% relative to non- carriers (236, 238). A study on 4,597 Australian twins (3 studies combined) reported that non-carriers of the ADH1B*2 allele (GG genotype) had fewer negative reactions post alcohol consumption (p=8.2×10-7), consumed a higher number of drinks per day (p=2.7×10-6) and had a greater overall cumulative alcohol consumption (p=8.9×10-8) relative to carriers (228). On average, participants with GG, GA and AA genotypes consumed 5.1, 4.1 and 1.9 drinks per day.  A recent meta-analysis (2,298 alcohol-dependent cases and 3,334 non-dependent controls) documented that the ADH1B*2 allele was associated with a significant reduction (by 66%) of alcohol dependence and number of drinks per day among European-Americans. A meta-analysis on all studies published between 1990 and 2011 reported robust associations, also reported similar findings (232). Overall, the accumulated evidence is consistent with the hypothesis that an elevation in acetaldehyde leads to an increased sensitivity to alcohol among ADH1B*2 carriers, reducing the likelihood for alcohol dependence and number of drinks per day among Caucasian adults.

ADH1B*2 and SCCHN risk

ADH1B*2 has been strongly implicated in the risk for upper aerodigestive tract cancers among various ethnicities. Acetaldehyde, the initial metabolite of ethanol, has been suggested to exert multiple mutagenic and carcinogenic effects, qualifying alcohol as an initiator of the cancer pathway (96-99). Hence, it was hypothesised that fast metabolizers of ethanol (GA or AA genotype) have a higher exposure to acetaldehyde, increasing the risk for SCCHN (36). However, contrary to this hypothesis, the first reported study investigating this association (among Japanese alcoholics) reported an increased risk for SCCHN among the GG genotype relative to the GA or AA genotype (240). Brennan et al. reasoned this to be due to residual confounding by alcohol consumption (36). However, studies since then have consistently shown a decreased risk (up to a 50% reduction) for SCCHN among carriers of the GA or AA genotype (136, 229, 241). No association was identified among never-drinkers (229, 242) and the protective effect was significant at higher levels of alcohol. These reports hypothesize alternative mechanisms of carcinogenesis.

Combined effect of ADH1B*2 allele and alcohol consumption has also been investigated. A joint effects analysis conducted among the Japanese population reported that when compared to non-drinkers who were AA or GA/AA genotype carriers, GG genotype carriers who were drinkers were at significantly increased risk for the disease. The effect was more pronounced among heavy drinkers (9-26 times higher risk) (231, 243). A Korean study documented a higher risk for the GG genotype compared to the AA genotype within moderate and heavy drinker strata of alcohol consumption (244). Two recent studies among Caucasians did not document any interaction between ADH1B*2 and alcohol consumption levels (241, 245). However, large European studies, which documented significant lower risk among the strata of medium and heavy drinkers among carriers of this allele (GA/AA genotype), do indicate a possibility of negative interaction on an additive or multiplicative scale within this ethnicity (136, 229). Studies among Asian and Caucasian populations have consistently documented no altered risk among never-drinkers who were either carriers or non-carriers of the ADH1B*2 allele. Overall, studies investigating both main effect and stratum specific effects indicate the possibility of interaction between ADH1B*2 and measures of alcohol consumption.

Hypothesis underlying the association between ADH1B*2 and risk for SCCHN

Multiple potential pathways (not mutually exclusive) underlying the association between ADH1B*2 and SCCHN among alcohol consumers have been proposed. Most of them are based on a direct carcinogenic action of acetaldehyde. Hashibe et al. reasoned that the fast metabolism of ethanol (among GA/AA genotypes) leading to increased acetaldehyde exposure may initiate alternative mechanisms to clear off the peak of acetaldehyde. However, such mechanisms may not be activated among GG genotype carriers who have a moderate initial metabolism, leading to acetaldehyde build up, which in turn increases the risk for cancer (136). In addition, compared to ADH enzymes, the expression of acetaldehyde dehydrogenase (ALDH2) enzymes that majorly degrade acetaldehyde to acetate is extremely weak in the upper aerodigestive tract (246). The resulting inefficient degradation of acetaldehyde may also contribute to additional acetaldehyde exposure among the GG genotype, especially among those consuming moderate to high levels of alcohol (231). Furthermore, apart from ADH enzymes, certain oral microflora can also convert ethanol to acetaldehyde (247-249). Following alcohol consumption, higher levels of acetaldehyde have been found in saliva relative to other parts of the body (especially in individuals with poor oral hygiene) (99, 250, 251). This oral microflora-salivary acetaldehyde pathway can contribute to peak acetaldehyde concentrations among the GG genotype (231, 252). Another hypothesis independent of the acetaldehyde pathway is that the fast metabolism of ethanol may result in lower local exposure (136, 229). Hence, alcohol may not be able to exert its promoter effect (aiding dissolution of other carcinogens), conferring protection against neoplastic changes in the head and neck region among GA/AA genotypes.

To summarise, although most SNPs described above are associated with SCCHN based on their involvement in the bio-activation of tobacco related pro-carcinogens and detoxification of carcinogenic metabolites, a comprehensive characterisation of their interaction with different levels of smoking incorporating all aspects such as interaction on both multiplicative and additive scales, joint effects and stratum specific risks, has not been reported (156). Furthermore, since CYP2A6*2 and ADH2B*2 affect specific measures of tobacco and alcohol consumption behaviours respectively, these behaviours may not only interact but also mediate the causal pathways between these SNPs and SCCHN risk. These pathways have not been elucidated yet.

2.3.3        Human papillomavirus (HPV)

In the past decade, HPV infection has emerged as a strong risk factor for SCCHN. A trend of decreasing incidence of oral cavity cancers (consistent with a decrease in tobacco use), and an increase in the incidence of oropharyngeal cancers (tonsils, base of tongue) have been documented in may developed countries, especially among men (8, 26, 27, 253, 254). The increased incidence of oropharyngeal cancers has been attributed to HPV infection. This infection has been detected in approximately 25% of SCCHN cases worldwide (255). The majority of HPV-positive SCCHN are oropharyngeal cancers. This virus is transmitted through skin-to-skin and skin-to-mucosa contact. Hence, unprotected sexual behaviours, notably oral sex, have been identified as routes of HPV transmission with respect to anogenital cancers and SCCHN. More than 100 sub-types of HPV have been identified, among which HPV 16, 18, 31, 33 and 35 have been classified as high-risk sub-types in relation to cancer. More than two-thirds of HPV-positive SCCHN have been attributed to HPV-16 infection. Results from a 2006 meta-analysis show that the association between HPV-16 and SCCHN was strongest for tonsillar (15-fold), followed by oropharyngeal (4-fold), and oral and laryngeal cancers (2-fold) (256). A recent prospective cohort study (2016) conducted in the USA reported an up to 7-fold increase in risk associated with HPV-16 for incident SCCHN cases, with a positive association only for oropharyngeal cancers (257). The researchers also reported that HPV-16 infection preceded SCCHN incidence. HPV-positive SCCHN are clinically distinct from HPV-negative cases and their survival rates are better compared to that of HPV-negative patients (three-year survival of 84% vs. 57%, respectively) (258). Based on recent trends in the incidence of oral cavity and oropharyngeal cancers, the existence of two distinct SCCHN risk groups (tobacco and alcohol related, and HPV related) has been suggested. However, a large study from IARC reported that relative to HPV-negative/non-smokers, HPV-positive/smokers had the greatest risk for both oral cavity and oropharyngeal cancers, greater than HPV-positive/non-smokers or HPV-negative/smokers (259). Evidence from other studies also indicate interaction between risk behaviours and HPV status in the risk for SCCHN (260-262).

2.3.4        Socioeconomic position (SEP)

Similar to genetic factors, socioeconomic position (SEP) is a well-documented distal determinant of health outcomes including SCCHN (263-274). In addition, behavioural risk factors such as tobacco use and alcohol consumption are socially patterned (275-281). Hence, whether they are the primary focus or not, it is essential to consider measures of SEP in most epidemiologic studies. In this thesis, different measures of SEP are used, either as the main exposure (manuscript I) or as an important confounder between exposures and the outcome of SCCHN. Therefore, in the following sub-sections I present an overview of the complex construct of SEP, various methods to measure this exposure and their association with SCCHN risk.

2.3.4.1       Definition of SEP

The term ‘socioeconomic position’ refers to the economic and social well-being of a person assessed through components such as occupation, income, wealth, education and social status. Krieger (1997) defines SEP as an aggregate concept that includes both resource based (income, wealth, education) and prestige based (individuals’ rank or status in the social hierarchy, evaluated with reference to people’s access to and consumption of goods, services and knowledge) measures that are linked to both childhood and adult social class position (282).

2.3.4.2       Indicators of SEP

Based on theory (indicating social class, status or position), correlations with health outcomes, suitability for particular societies and availability of data across the life course, observational studies use various indicators in an attempt to measure SEP. Commonly used measures of SEP are addressed below.

Asset/wealth index

An asset or wealth index is a measure of the material endowment of an individual or household. It is considered an acceptably reliable proxy for consumption and thus SEP, particularly in low to middle income societies (283, 284). The wealth index is calculated using readily-observable household characteristics such as durable assets and household amenities (e.g., car, refrigerator, television, owning a bicycle, livestock, radio, sewing machine), housing characteristics or conditions (household floor, roof wall material, toilet facilities, water supply), access to services (e.g., electricity supply, drinking water sources), and housing tenure (status of house, land or farm ownership) (284-287). It is stated that asset index was developed based on availability and convenience especially in more agrarian societies and not on a plausible direct causal relationship between wealth or asset possession and health (284). There is also an argument that the index is unlikely to capture the broad concept of SEP (288). However, poor housing is associated with a wide range of health conditions (289). Indicators such as overcrowding in houses have been associated with sanitation and the spread of infections. Moreover, health and mortality are sensitive to fine gradations in neo-material conditions such as access to cars, home ownership, presence of a home garden and healthier food (290, 291). Furthermore, housing tenure, conditions, assets and amenities reflect an individual’s educational and occupational status and income (284). The wealth index gained popularity through its use in Demographic and Health Surveys (DHS) data sets to quantify and compare socioeconomic inequalities across approximately 35 countries which mostly included low and middle income countries (283, 292). This measure was utilized because of a lack of reliable data on income and expenditures. Also, household assets are resistant to change in response to short-term economic shocks, which are a feature of low and middle income settings. Based on its slower response to economic shocks, it is also argued that the wealth index captures long term stable aspects of economic status (288, 293). Unlike other indictors such as education and current income, information on components of the wealth index is available across life and hence is an SEP measure available at multiple periods of life.

Education

Education is one of the most widely used individual-level measures of SEP. Education marks the transition from childhood to adolescence or early adulthood and indicates an individual’s independence from parental care (294). An individual’s educational attainment could determine that individual’s health through its influence on decision-making skills, awareness about opportunities, general awareness and interactions with people, access to information and health care, choices of lifestyle behaviours, job and income levels, housing conditions, status in the society and stress coping mechanisms (31, 295). Relative to other measures of SEP such as income and occupation, education is easier to measure, can be assessed in people who are not in active labour, is equally available to both sexes especially in developed countries, has a high response rate with the exclusion of only a few members of the population and has less subjectivity to negative adult health selection. Together, these attributes make education a useful and important measure of SEP (295-297). However, education is usually acquired early in life and stable after early adulthood, and thus represents SEP only during a short window of the life course (285, 295). Commonly used markers of education include number of years of formal education and highest level of education attained in life (285, 294). However, the analysis of these markers can be complicated. The number of years of education does not convey any information regarding the quality of the education and its social and economic value. Furthermore, the meaning of a particular level of education and number of years of education are not the same everywhere, and are related to age and birth cohort, social class position, race/ethnicity and cultural norms (282). For example, significant social and educational reforms took place in the state of Kerala in India in the mid-1900s (298). Until that time, a feudalistic system existed for land ownership, wealth, access to education and privileges. Education was considered the privilege of people of the higher caste (hierarchy in the Hindu religion based on occupation) and Syrian Christians, whereas people from the backward caste and most females were denied formal education (298). Completing four years of education was a high educational attainment. However, political movements since the Indian independence (1947), especially in the late 1950s, resulted in free and compulsory education until 14 years of age (8 years of education), and education was given a higher importance in the society (299). This educational reform played an important role in lifting people out of poverty by providing the means for upward social mobility. Such features specific to societies and birth cohorts must be considered when using and analysing markers of education as measures of SEP.

Occupation and income

Occupation and income are commonly used measures of SEP. Occupational status is a direct measure of social class in most societies and is the major structural link between education and income (294). Income is a direct indicator of SEP and is the result of an individual’s occupation (300). Occupation plays an important role in positioning an individual within the social structure that directly controls access to resources, interaction with peers, exposure to job related environments and physical exposures, psychological risks and risk behaviours such as tobacco and alcohol consumption (295). Income levels impact health outcomes by influencing the material circumstances of an individual such as quality, type and location of housing, food, clothing, health care, transportation opportunities for cultural, recreational and physical activities, child care and exposure to various toxins (294). Overall, these features make occupation and income suitable measures of SEP in health research. However, occupation and income can be difficult to measure with precision, especially in low and middle income societies (268, 283, 284, 293). This can be attributed to features such as higher non-response rate, missing information on people who are not part of active labour (e.g., home makers) and fluctuations with short term economic shocks (293, 295). Furthermore, most occupational classifications have been developed and validated on working men (295). These factors pose a challenge when using occupation and income as measures of SEP.

2.3.4.3       Association of SEP with risk for SCCHN

As demonstrated with health outcomes such as cardiovascular diseases, mortality, allostatic load, multiple cancers and oral health conditions, cumulative disadvantageous SEP over the life course has been associated with increased risk for SCCHN, independent of behavioural risk factors (272, 290, 301-305). A large meta-analytical review by Conway et al (2008) on case-control studies that included 24 and 17 studies from high and low income countries, respectively, examined the association between three measures of SEP (income, occupation and education) and oral cancer risk (267). Participants with low educational attainment, low occupational status and low income had 1.85, 1.84 and 2.41 times the risk, respectively, of developing oral cancer relative to their higher SEP counterparts. In addition, disadvantageous SEP was independently associated with increased oral cancer risk in high and low income countries across the world. Most (269, 306-308) but not all studies (309) conducted subsequently in developed and developing countries have shown that a disadvantageous SEP is independently associated with an increased risk of SCCHN.

2.4       Complex exposures – Need for comprehensive conceptual and analytical frameworks

Genetic exposures such as SNPs are fixed at birth and are well defined. By contrast, exposures such as behavioural risk factors and SEP have a complex dynamic nature. An individual’s SEP may not remain the same from childhood to early to late adulthood stages of their life (276, 285, 310). The situation is similar for behavioural risk factors such as tobacco and alcohol habits, as individuals’ behavioural patterns can vary (e.g., frequency, duration, type of tobacco or beverage) over the course of life (311). Thus, these exposures are time-varying. Capturing the dynamic nature of these exposures within an epidemiologic study and addressing it in the analysis is challenging. The challenge is compounded by the bi-directional associations within these exposures at multiple time periods, and between these variables and the health outcome. For example, SEP is considered to affect risk behaviours. However, such behaviours (e.g., alcohol consumption) have also been considered as determinants of socioeconomic consequences, especially in developing societies (312). In addition, these risk behaviours are highly correlated. Hence, SEP in an earlier period of life, for example childhood, may affect risk behaviours in adolescence and early adult life, which can in turn affect social conditions in subsequent late adult life. In short, this time-varying nature produces a complex feedback loop between these variables acting as multiple confounders and mediators in the causal pathways to the health outcome (313). A further concern is the possibility of reverse causality. Based on the social causation perspective, an individual’s SEP components can influence their health positively or negatively. For example, following a low educational attainment, one could get a job that exposes them to chemicals and physical hazards including carcinogens, physical and psychological stress, noise, heat, cold, unsafe conditions, and dust, among others. These exposures lead to an increased risk of disease. The same person could also face unemployment, which increases the risk of depression, anxiety and disability, and may lead to unhealthy coping practices (e.g., cigarette smoking and alcohol consumption). In contrast, based on the selection hypothesis, healthy people may obtain and retain their occupational status. These bidirectional associations make collecting repeated data on these exposures at multiple time points and assessing their temporary relationship with the health outcome imperative. Addressing these issues requires a comprehensive theoretical study framework, a study design that is appropriate for the health outcome being investigated, a suitable analytical framework and associated techniques. In this thesis, I used the conceptual framework of life course epidemiology, a case-control study design that is advantageous to study rare disease outcomes such as SCCHN, a counterfactual causal inference analytical framework to incorporate repeated measures of exposures and causal effects of exposures on the outcome and causal diagrams. A brief overview of these elements of my thesis are presented in the subsections below.

2.4.1        Life course epidemiology – Definition and origin

Kuh and Shlomo define life course epidemiology as “the study of long-term effects on later health or disease risk of physical or social exposures during gestation, childhood, adolescence, young adulthood and later adult life” (314). Research in the 1950s by Sir Richard Doll and colleagues suggested that smoking was a strong risk factor for lung cancer (and concomitantly for laryngeal, oesophageal and bladder cancers). This marked a paradigm shift in risk factor research: the focus of chronic disease investigations shifted to an adult lifestyle approach where multiple adult life exposures were implicated in the risk for later life health outcomes (315). However, Forsdahl (1977) documented a strong correlation between infant mortality rates and mortality in middle age for the same generation in specific counties in Norway (316). Similar results linking early life events to adult health outcomes were documented in ecological studies conducted in the USA and Britain, and historical cohort studies (e.g., British birth cohorts) during the following 15 years (317-321). These observations gave rise to the concept of biological programing based on the fetal origins hypothesis. According to this hypothesis, “environmental exposures such as undernutrition during critical periods of growth and development in utero may have long term effects on adult chronic disease risk by ‘‘programming’’ the structure or function of organs, tissues, or body systems” (319). In combination, the above observations supported the importance of biological, behavioural, and psychosocial processes that may operate throughout an individual’s life course, or across generations to influence disease risk, rather than just an adult lifestyle approach to chronic diseases (322). This research became the foundation for the conceptual framework of life-course epidemiology, conceived in the late 1990s, which gives importance to time (duration) and timing of biological, behavioural and social exposures that may act independently, cumulatively or interactively to influence disease risk (314, 323).

2.4.2        Models under the life course epidemiology framework

The main aim of the life course epidemiology framework is to elucidate pathways linking exposures across the life course to later life health outcomes. To achieve this objective, various theoretical models linking exposures to health outcomes have been proposed. They are described below.

2.4.2.1       Accumulation model

The accumulation model is considered the most fundamental of all life-course models and gives importance to time (duration) of exposures (324). The model proposes that exposures clustered at different periods of life may accumulate longitudinally over the course of life, leading to differential risk for chronic disease outcomes (323). This concept is in line with the notion of allostatic load, which is the wear and tear on biological systems resulting from chronic over activity or inactivity of normal physiological systems in response to increased exposures (in number and/or duration) from the external environment (323, 325). Indeed, Kuh et al. (1997) describe an individual’s biological resources accumulated over the life course as their ‘health capital’, which describes and influences current and future health (314). Ben-Schlomo and Khu (2002) propose that risk can accumulate with independent and uncorrelated insults (no interaction between exposures), or with correlated insults (e.g., SEP, smoking, alcohol) that cluster together leading to a health outcome, or similar insults (disadvantageous SEP at different life stages) that form a chain leading to the outcome (323).

2.4.2.2       Critical period model

Stemming directly from the concept of biological programing and fetal origins hypothesis, the critical period model gives importance to the timing of exposures. In its strict sense, the critical period model posits that exposures during specific periods of life can cause irreversible biological damage and have a long-lasting effect on biological systems, irrespective of exposures in prior or later periods of life (323). The sensitive period model is a variation of the critical period model which recognizes that although periods with a higher sensitivity to the effects of an exposure may exist, the effects can be modified or even reversed with prior or later exposure profiles (322).

2.4.2.3       Mobility or pathways model

The mobility or pathways model is considered to be a variation of the accumulation model and is mostly examined in studies with SEP (326). It focuses on the cumulative effect of exposures along life trajectories and implicates differential exposure throughout the life course in adult disease causation. This model implies the interaction of exposures at multiple periods of life (e.g., SEP in childhood, early and late adulthood). Different hypotheses proposed within the pathways model posit different health effects. For example, under the natural health selection hypothesis, less healthy individuals get into a downward mobility (moving from an advantageous to a disadvantageous SEP) and healthier individuals tend to have upward mobility (moving from a disadvantageous to an advantageous SEP) (327, 328). These mobile groups are separated from the individuals who do not show any mobility across life periods as both groups are considered to have distinct traits that make them mobile or non-mobile. In contrast, under a gradient/health constraint hypothesis, mobile groups (either upward or downward mobility between different time periods) possess health traits of both the period they leave and the one they join, thus minimizing the health difference between the SEP groups (327-329). The risk associated with mobile groups will be intermediate between the two non-mobile groups (greater than the group with advantageous SEP in all time periods, and lesser than the non-mobile groups with disadvantageous SEP at all time points). Interestingly, an elevated risk for a health outcome (e.g., cardiovascular mortality) has been documented among individuals who experience deprivation in early life, followed by later life affluence (316). Forsdahl (1977) hypothesized that this was partly due to risky exposures associated with an affluent lifestyle (e.g., elevation in adult cholesterol levels) (316).

Life course epidemiology allows considerable overlap between the models specified above. Hence, the models are not mutually exclusive and empirically difficult to disentangle (330). For example, under a social mobility model, a disadvantageous SEP in childhood can interact with an advantageous or disadvantageous SEP in early adulthood to confer a particular risk for SCCHN. However, this is indeed a chain of risk described under the accumulation model. Furthermore, the critical period model with effect modification in prior or later periods (322), or sensitive period model, is reflected in various interactions of the exposure possible under the social mobility concept.

2.4.3        Suitability of the life course framework to study social, genetic and behavioural risk factors

The life course framework is particularly well suited for this work exploring genetic, behavioural and social risk factors of SCCHN, as the multiple ways in which exposures can lead to the cancer outcome can be encompassed within this framework. For example, the time-dependent aspect of SEP and associated behavioural risk factors can be effectively captured under this framework and tested under the accumulation, critical and social mobility models. Familial risk factors such as SNPs are already fixed and exert an effect throughout life, which can be visualized under an accumulation model. For example, SNPs such as CYP1A1*2A and CYP2E1c2 increase the risk of SCCHN among Asians independent of smoking. This could be an example of an independent insult causing the health outcome, as explained under the accumulation model. However, the effect of this SNP on the risk of SCCHN among Caucasians might be present only in the presence of heavy smoking (interaction). Yet again, SNPs such as ADH1B*2 and CYP2A6*2 can interact with alcohol and smoking. They also affect alcohol and risk behaviours. Hence, the effect of these SNPs on the risk of SCCHN can be partly through these risk behaviours, which is referred to as mediation. The concept of interacting and mediating causal pathways leading to a health outcome has been defined under the life course framework and is reflected in the accumulation model (322). Thus, the possible causal pathways to SCCHN involving potentially confounding, interacting and mediating factors can be tested under the life-course framework. However, this study framework needs to be complemented by a suitable study design incorporating life course epidemiology to specifically study the relatively rare outcome of SCCHN.

2.5       Study designs for observational epidemiologic studies

Two of the main observational study designs for epidemiologic research are cohort and case-control designs (331). In this thesis, we used a hospital based case-control design and novel approaches to existing analytical techniques originally developed for cohort data. Hence, the principles of these designs are described below with emphasis on case-control studies.

2.5.1        Cohort studies队列研究

In a typical cohort study, a group of individuals, sampled based on exposure to certain conditions, are identified and traced over time for the occurrence of health outcomes (332). A commonly used measure of disease frequency is the incidence rate, which is the number of new cases per population at risk in a given time period. The incidence rate can be calculated in both the exposed and unexposed group, from which both absolute and relative measures of association between exposure and outcome can be derived. The difference between the incidence rate in the exposed and that in the unexposed group provides the incidence rate difference (on an absolute scale), whereas the ratio between incidence rates in the exposed to the unexposed group gives the relative risk (RR) (on the relative scale) (333). The calculation of these measures is possible and straightforward in a cohort study, as the probability of outcome in the non-exposed is known (334). This study design is useful because it provides information on multiple exposures and outcomes and their variation over time, and ascertains temporality (cause precedes effect). However, its time-consuming nature makes it a poor choice to study rare outcomes such as cancers (under a rare disease assumption, health outcomes with a prevalence of less than 10% in the population is considered rare), as following the entire population for long periods of time would be impractical, and the sample would not yield sufficient cases to derive reasonably precise measures of association.

在一项典型的队列研究中,根据暴露于某些条件的情况对一组人进行抽样,确定并跟踪一段时间内健康结果的发生(332)。一种常用的衡量疾病频率的方法是发病率,即在给定时间内每一高危人群的新病例数。可以计算暴露组和未暴露组的发病率,由此可以推导出暴露和结果之间关联的绝对和相对措施。受照射组和未受照射组发病率之间的差异提供了发病率差异(在绝对尺度上),而受照射组发病率之间的比值则提供了相对风险(RR)(在相对尺度上)(333)。在队列研究中,这些指标的计算是可能和直接的,因为非暴露者的结果概率是已知的(334)。这项研究设计是有用的,因为它提供了多次暴露和结果及其随时间变化的信息,并确定了时间性(原因先于效果)。然而,它耗时的性质使得研究癌症等罕见结果是一个糟糕的选择(在一种罕见疾病的假设下,发病率低于10%的健康结果被认为是罕见的),因为长时间跟踪整个人口是不切实际的,而且这个样本也不能产生足够多的例子来推导出相当精确的关联度量。

2.5.2        Case-control study design

The case-control design can be advantageous compared to cohort studies, especially when investigating rare disease outcomes, because of its efficient way of sampling individuals from the source population based on the outcome (333, 335). Compared to a cohort study, a case-control design includes a larger fraction of individuals from a source population who develop the outcome (cases) and a lower proportion of those who do not (controls). This design attained significance in the 1920’s through studies on rare outcomes such as lip, oral cavity and breast cancers (336). In a case-control study, an adequate number of cases from a source population are first selected and classified as exposed or unexposed. Next, their exposure profile is compared with that of controls, who are sampled from and representative of (with respect to exposure distribution) the same source population from where the cases were recruited (337). The source population or the underlying, “hypothetical” cohort, were participants from case-control are sampled is elusive, that is, they are not from a roster nor followed to record outcomes. The controls are selected independent of their exposure status. Cancer case-control studies are usually population or hospital based, depending on the source population from which cases and controls are sampled (332).

Because the numbers of cases and controls are fixed by the investigator in a case-control study, the probability of the outcome among the source population remains unknown (334). Hence, relative risk cannot be estimated directly using this study design unless we use techniques to correct for the sampling strategy that gave rise to the data; to correct  by the probability with which cases and controls are sampled (sampling fraction) into the study from the underlying population (334, 335). However, since the counts of participants among cases and controls with and without the exposure are available, the measure of association derived from case-control studies is the odds ratio (OR) (334). Basically, the OR is defined as the ratio of odds of the exposure among cases to that among the controls (exposure OR). However, the calculation of the exposure OR and outcome OR are mathematically equivalent, making it a valid measure of association between exposure and outcome (334). For rare outcomes such as cancers (incidence less than 10% in a population), the OR approximates the RR (334, 336).

Although case-control studies are suitable for the investigation of cancer outcomes, the design itself poses challenges with respect to certain research questions. First, unlike cohort studies, data on exposures that change over time (e.g., SEP, smoking) are usually not available from case-control studies. This makes it difficult to assess exposures under a time-varying framework. Second, the estimation of association between exposure and outcome is limited to the health outcome on which the study sampling was based. Hence, a researcher might refrain from exploring research questions that require the use of analytical techniques for which a variable other than the main outcome of interest must be used as a dependent variable (e.g., mediation analysis, multi-step modelling such as inverse probability weighted marginal structural models). However, such scenarios are encountered when research questions aim to elucidate causal pathways and mechanisms underlying exposure-outcome relationships. The case-control design should be explicitly taken into account while answering these questions and appropriate study frameworks such as life course epidemiology, control sampling techniques and statistical methods are needed to mitigate these challenges (335).

2.6       Causal inference and causal effect estimation

“One commonly heard argument is that epidemiologic studies are about associations, not causations. According to this proposition, epidemiologists should not worry too much about fishy causal concepts but rather focus their efforts on estimating correct associations. This is certainly a safer strategy but also a dangerous one because it can make much of epidemiology close to irrelevant for both scientists and policy makers”.  – Hernán (2005)

Information on cause and effect relationships between exposures and health outcomes is the fundamental contribution of epidemiology to the improvement of health (338, 339). Causality and causal inference have been a subject of great interest and contentious debate since the 18th century (340). These concepts further evolved during the 19th century through pioneering works on infectious diseases (Henle-Koch postulates), social causation of disease (Rudolph Virchow), and smoking and various cancers (341). The evidence linking smoking and lung cancer in the 1950’s [and concomitantly with other health outcomes (larynx, esophagus and bladder cancers)] led to the formulation of the “Bradford Hill criteria” (1965) for causation, (which include strength of association, consistency, specificity, temporality, biological gradient, plausibility, coherence, experimental evidence, and analogy. The adaptation of the Bradford Hill criteria led to the Surgeon Generals criteria (1964 and 1982) to assess causality (342-345). Another development occurred in 1976, when Rothman conceived the causal pie model which posits that the causal mechanism (the relation between cause and effect) results from multiple interacting component causes or exposures (346). Today, causal inference is largely viewed as an exercise in the measurement of the causal effect of an exposure rather than as a process to be evaluated based on criteria or guidelines (339). This exercise majorly involves defining a clear causal question even if one thinks its unlikely to interpret estimates as causal, choosing causal diagrams, statistical parameters and analytical techniques that help address the causal question, and c) specifying the assumptions under which the statistical parameters we estimate would correspond with the answer to the causal question.

2.6.1        Causal inference and causal effects under the counterfactual/potential outcomes framework

Apart from substantive knowledge on the outcome and exposures, causal effect estimation requires appropriate causal models/frameworks, causal diagrams depicting assumed relationships between variables, and rigorous analytical techniques based on the study design (347). A statistical association between two random variables X and Y could reflect five possibilities; a) X causes Y, b) Y causes X; c) X and Y have a common cause (confounding), d) random fluctuation, and e) the association was induced by conditioning on a common effect of X and Y (347). Given these possibilities, the statistical association between exposure X and outcome Y can be defined as causal if changing the value of X would make a difference in the value of Y, provided nothing else temporally prior to or simultaneous with X changed (347). The measurement of a causal effect fundamentally requires contrasting the value of Y in the presence of a temporally prior variable X (observed) to the potential value of Y in the absence (i.e., any other value) of X (counter to the fact- unobserved) (332). This understanding, known as the counterfactual concept, originally conceived by Scottish philosopher David Hume in the 18th century, gave rise to the counterfactual/potential outcomes model for causal inference (348). Here, a counterfactual/potential outcome is defined as the outcome Y that one would have had, possibility contrary to the fact, under an exposure other than X (348). In an empirical setting, an individual is either exposed or unexposed and one potential outcome is always missing. Hence, although it is not possible to ascertain the causal effect of an exposure on an outcome for an individual, the counterfactual model allows the estimation of the average of individual causal effects in a target population as a parameter in a statistical model using observed data (349). However, this estimation is only possible if three basic identifiability assumptions are met (349): exchangeability, counterfactual consistency and positivity. Two study groups are exchangeable if the probability of the outcome in one group is the same as that of the second group, had the exposures been reversed (i.e., the potential outcome is independent of the exposure). In a well-designed randomized clinical trial (RCT), the exchangeability assumption is met as participants are randomized into groups, which essentially ascertains that the exposure is independent of other covariates and the outcome. Counterfactual consistency is the rule that allows the potential outcome to be linked to the observed outcome. It outlines that the potential outcome under the observed exposure is the observed outcome. This assumption is usually considered to be met if the exposure is well-defined and manipulable by intervention (e.g., dose of drugs, dose of specific measure of specific tobacco type rather than dose of tobacco smoking in general) and is violated in the case of under-defined, non-manipulable exposures such as social exposures (e.g., SEP). Positivity means that the probability of exposure at every level of all covariates in the model is above 0. Two types of positivity violations are: a) stochastic or chance positivity violation in which there is no probability of exposure at a certain level of a covariate due to lower sample size (e.g., genetic polymorphisms with low minor allele frequency), and b) deterministic positivity violation in which the individual has no chance of being exposed (e.g., positive exposure to alcohol among non-alcohol consumers).

2.6.2        Causal inference in observational studies

The counterfactual model of causal inference has largely dominated the scientific discourse on the estimation of causal effects in the health sciences since the last century. This model stimulated the development of the randomized trial study design by Ronald A. Fisher, and associated inferential statistics in the 1920’s by Fisher, Jerzey Neyman and Egon Pearson (348). Because this study design achieves a valid substitution for counterfactual experience, and the randomization procedure ensures that exchangeability and positivity assumptions are met, RCTs are the study design of choice to estimate causal effects of well-defined manipulable exposures/interventions on outcomes. However, not all exposures can be manipulated under experimental conditions (e.g., social risk factors) or can be randomized and assigned among humans due to ethical concerns (e.g., smoking). This limitation of RCTs created a need to infer causality utilizing non-experimental observational study designs (e.g., case-control, longitudinal), which have been the mainstay of the majority of epidemiologic studies. However, the greater probability of violating identifiability assumptions in these designs made causal inference from observational studies a challenge. To address this challenge, Rubin (1974) developed the model into a general framework for causal inference that can be applied to non-experimental studies as well, and demonstrated the feasibility of causal inference utilizing these study designs (350).

2.6.3        Causal inference from study settings with complex time-dependent feedback loops

As discussed in previous sub-sections, exposures such as SEP and risk behaviours are dynamic and time-varying. These exposures measured at one time point can affect the exposure measured at subsequent time points. Along the way, they can also affect or be affected by other covariates that may bias the causal association between the exposure and the outcome. In other words, time-varying systems are subjected to complex feed-back loops that compound the challenge of causal inference. To overcome this problem, James Robins introduced three powerful analytical methods stimulated by the counterfactual framework, under the umbrella term of G-methods: the parametric g-computation formula (1986), G-estimation of structural nested models (1989), and inverse-probability weighted marginal structural models (1998) (351-354). These methods made the estimation of causal effects under time-varying feedback conditions achievable with longitudinal data. However, these techniques have not been implemented in a case-control study including a combination of time-varying exposures and confounders. Recent advancements in inferential statistics through the work of Tyler VanderWeele, Stijn Vansteelandt, Miguel Hernan and colleagues have also made the estimation of direct and indirect effects (mediation) as well as the attribution of effects to pathways underlying causal association (e.g., 4-way decomposition) between exposures and outcome empirically possible with longitudinal data. However, their demonstration within a case-control study is limited and software codes for the easy implementation in commonly used analytical software such as Stata are lacking (335, 355).

2.7       Directed Acyclic Graphs for causal inference

“Epidemiologists are acutely conscious of the danger of over-interpreting associations as causal, and it may be as a consequence of this that they sometimes avoid thinking about the potentially causal nature of associations between exposures of interest and potential confounders. It is all too easy to fall into a purely empirical approach to analysis, where covariates are added to the model one by one and retained if they seem to make a difference.  Valid inference would be better served if, perhaps with the aid of causal diagrams, careful consideration was given to whether each factor should be in the model, particularly if the factor may have been caused in part by the exposure under study.”Weinberg (1993)

The scientific discourse on causal inference has been supported by a rapid growth in the last two decades in the availability and accessibility of concepts and tools that allow the rigorous and systematic assessment of whether statistical associations are causal. One such important methodological advancement has been the development and increasing adaptation of causal diagrams or directed acyclic graphs (DAGs). DAG is a graphical tool proposed by Judia Pearl and colleagues, which was introduced into the epidemiology literature in 1995 (356, 357). These graphs are diagrams with formal rules that majorly help in: a) designing epidemiologic studies, b) understanding the causal and non-causal relations among variables related to a specific substantive research question and, c) evaluating structural relationships that may pose a threat to study validity (e.g., confounding, selection/collider bias, information bias) (347). A confounder is most commonly defined as a variable that is ‘associated’ with both exposure and outcome and is not an intermediate variable between them. Adjusting for traditionally defined confounders when they are in fact non-confounders as revealed through DAGs would induce bias in the estimates (e.g., over adjustment, M-bias) (347). DAGs are extensively used in this thesis to demonstrate underlying causal relations (e.g., time-varying framework, mediation), facilitate various analytical decisions (e.g., identification  of confounders for assessment of total, direct and indirect effects) and explain potential biases (e.g., confounding, selection bias). To facilitate their understanding in the Methods sections of this thesis, I describe below the basic terminology, rules and concepts underlying DAGs, structural definitions of confounding and selection bias, steps to follow to estimate the total effect of an exposure on the outcome using DAGs and the special case of time-varying confounding affected by prior exposure.

2.7.1        Basic DAG terminology

A DAG consists of a set of random variables (nodes or vertices), both measured (e.g., X, Y, Z in DAG 1) and unmeasured (typically represented by U as in DAG 1), each variable pair connected by a single arrow (directed edges).

The graph is directed as each arrow has only one arrowhead and points from one variable to one other variable. It is also acyclic as no variable can cause itself either directly, or through other variables. An exception to this is a time-varying variable where an arrow from the variable measured at one point in time (e.g., SEP in childhood) can point to the same variable measured in a subsequent point in time (SEP in early or late adulthood). Unlike traditional confounder diagrams in which there is uncertainty in the meaning of the arrows used (i.e., whether the arrow represents association, prediction or causation), each arrow in a DAG depicts causation (as per the definition of cause provided in sub-section 2.5.1). The variable from which an arrow originates (parent) is a direct cause (causative or preventive) of the variable towards which the arrow head leads to (descendent). In DAG 1, X is a direct cause of Y. Similarly, Z is a cause of X, and U of Z and Y. All common causes of any pair of variables must be included in a causal graph. The arrows do not specify the magnitude or direction of causation.

2.7.2        Paths in DAGs

Each path in a DAG goes between the exposure and the outcome without passing through a node more than once. A path can be open or closed. Open paths have an expected causal association

flowing along them (e.g., paths 1 in DAGs 2).

Some paths are open naturally, that is, prior to the intervention of the researcher. Causal paths are naturally open paths in which all arrows point in the same direction from exposure to outcome either directly (e.g., X→Y in DAG 1) or through multiple intermediates (e.g., X→M→Y in DAG 2). All such causal open paths contribute to the total effect of the exposure on the outcome. Such paths can however be closed mistakenly by conditioning (restriction or matching by study design, stratification or covariate adjustment in statistical models during analysis) on the intermediates/mediators. For example, conditioning on M closes the open causal path between X and Y in DAG 2, creating a biased estimate of the total effect of X on Y). On the contrary, certain non-causal paths can be left open naturally (e.g., paths 2 and 3 in DAG 2). Such paths can be used to structurally define confounding paths and naturally include variables that are common causes of the exposure and the outcome (e.g., C2 in path 2 of DAG2). These paths create a bias and the expectation of an association between exposure and outcome that is non-causal. This bias is termed confounding and can be removed by conditioning on any variable along non-causal naturally open paths (e.g., conditioning on either C1 or C2 or C3 or C4 can block the non-causal naturally open path 2)

Closed paths are those through which no association flows; they are considered blocked either naturally or by conditioning on variables along them (e.g., conditioning on a confounder makes an open non-causal path closed). For example, path 4 (X → M ← C4 → Y) in DAG 2 is blocked at M which has arrows originating from C4 and X colliding on it. M is termed a collider on path 4. Conditioning on a collider can mistakenly open the blocked non-causal path, creating a biased association to flow between exposure and outcome, and is considered a selection bias. It is to be noted that a collider is path specific. Also, a variable can have different meanings depending on path. For example, in DAG 2, M is a collider on path 4, but not on paths 1 (X → M→ Y) and

3 (X ← C1 ← C2 → C3 → C4 → M → Y), M is a mediator on path 1 (X → M→ Y), but not on paths 3 and 4. M is a confounder on path 3, but not on paths 1 and 4.

DAGs also help us to structurally define selection bias as any bias occurring due to conditioning on the common effect of two variables, one of which is either the exposure or cause of exposure, and the other is the outcome or cause of the outcome (358).

2.7.3        Steps to estimate the total effect of an exposure on the outcome using DAGs

To estimate the total causal effect of an exposure on an outcome, 5 steps should be followed; 1) draw the best DAG; 2) find all the paths between the exposure and the outcome; 3) Separate the causal and non-causal paths; 4) Separate open and closed paths; 5) find the minimally sufficient set(s) of conditioning variables, where minimally sufficient set is a sufficient set (a sufficient set is a set which contains variables, conditioning on which leaves all causal paths open and closes all non-causal paths) of which no proper subset is sufficient.

In the case of DAG 1, the minimally sufficient set of conditioning variables that to estimate the total effect of X on Y will be {Z}. Although U, an measured variable is also in the confounding path, conditioning on Z turns a confounder U to a non-confounder. However, this same statistical model with Y fitted on X and Z cannot be used to identify the total effect of Z on Y. This is because, according to the DAG 1 (Figure 3), X would be a mediator between Z and Y, and having X in the model would block the causal path between Z and Y. In other words, DAGs inform us if a separate statistical model is needed to estimate the total effect of each exposure. For DAG 2, multiple minimally sufficient sets are possible; e.g., {C1} or {C2} or {C3} or {C4}. Figure 6 depicts conditioning on C3, leaving the only causal path 1 to be open. The selection of any of these 4 variables for conditioning depends on whether the variables have missing data, measurement error or specification error. Although M in DAG 2 is in the confounding path 3, conditioning on it will close the only open causal path between X and Y (M is a mediator in path 1) and will open a blocked non-causal path (path 4).

2.7.4        Time-varying confounding affected by prior exposure

Bias due to confounding and selection bias is compounded while attempting to estimate the effect of a time-varying exposure under conditions of time-varying confounding affected by prior exposure (i.e., covariates can act as both confounders and mediators) (358). A hypothetical time-varying situation involving SEP in childhood (CH SEP), early adulthood (EAH SEP), confounders measured during childhood (C1) and early adulthood (C2) periods under a specific temporal relation with respect to outcome (oral cancer) is depicted in Figure 7.

Any method that involves conditioning on C2a to estimate the magnitude of the blue lines may induce bias by creating a non-causal association between CH SEP and oral cancer through the path CH SEP → C2a ←C1→ oral cancer (i.e., opening this naturally blocked non-causal path). However, not adjusting for C2a results in an open non-causal path between EAH SEP ← C2a ← C1→ oral cancer and thus a confounded causal association between EAH SEP and oral cancer. This situation arises because the effect of EAH SEP on oral cancer is confounded by C2a, and C2a is effected by CH SEP (prior exposure); in other words, time-varying confounding is affected by prior exposure. Such situations can only be addressed using g-methods described in sub-section 2.6.3

3         Rationale and study objectives to be written

4         Methods

This dissertation comprises three manuscripts based on three empirical studies, each addressing one specific objective of this work. These empirical studies utilized data from an international collaborative study. Although studies at each site had a similar overall study design and data collection procedures (as they followed the same study protocol), the distribution and types of risk factors at each study site, variables used in each manuscript and statistical analyses performed to achieve the objectives were different. The overall study design, data and sample collection procedures, as well as specific methodologies for each manuscript are explained in the sub-sections below.

4.1       Overall study design

The Head and Neck Cancer (HeNCe) Life study is an international multi-center hospital based case‐control study investigating the aetiology of SCCHN focusing on social, psychosocial, lifestyle, biological and genetic factors, using the life-course framework. This collaborative study was conducted in Canada, India and Brazil. Manuscript I uses data from the Indian site where the incidence of SCCHN, especially oral cancers, is on the rise, and where large social inequalities have been reported (359, 360). Manuscripts II and III rely on data from the Canadian site, where genetic data were available and smoking and alcohol have been the strongest risk factors for SCCHN. Although study sites followed similar protocols, study instruments were culturally adapted through multiple pilot studies.

4.2       Target populations and samples

The target populations for the studies were male and female adult residents of Malabar region of Kerala in India, and Greater Montreal area in Canada. The eligibility criteria of the study in India were: (i) English, French or Malayalam (Kerala native language) speaking; (ii) to be born in India or Canada; and (iii) to live within a 150 or 50 km radius from the recruiting hospitals in Calicut (Kerala) and Montreal, respectively. In addition, the participants shouldn’t have had any: (iv) previous history of any type of cancer or cancer treatment; (v) mental or cognitive disorders; (vi) communication problems (e.g., inability to speak because of lesions); and (vii) diseases related to immuno‐compromise (e.g., HIV/AIDS). Lastly, participants who were too sick or in palliative care were not eligible to participate

In India, cases (N=350) were recruited from the oral pathology clinic at the Government Dental College, and from the cancer outpatient unit of the Government Medical College, Calicut, Kerala (both institutions catering to the same catchment area), India between 2008 and 2012. Controls (N=371) were recruited from other outpatient clinics in these intuitions during the same study period.

In Canada, cases (N=460) were recruited from Ear, Nose and Throat (ENT) and radio‐oncology clinics of four major referral hospitals in Montreal (Jewish General Hospital, Montreal General Hospital, Royal Victoria Hospital, and Notre‐Dame Hospital) between 2005 and 2013. Controls (N=458) were recruited from other clinics in the same hospitals.

4.3       Case definition and selection

Incident cases diagnosed with stage I to IV histologically confirmed squamous cell carcinomas of head and neck region, which included cancers of the tongue, gum, floor of the mouth, and other locations in the mouth, oropharynx, hypo‐pharynx and larynx (C01‐C06, C09, C10, C12‐ C14, and C32, under the International Statistical Classification of Diseases, 10 Version: 2010), were eligible for this study. Lip (C00), salivary gland (C07‐08) and nasopharyngeal (C11) cancers were excluded due to their different aetiologies (361-363). For logistic reasons, only oral cancer cases (C01-C06, and C09 under International Classification of Diseases 10 Version: 2010) were recruited at the Indian site.

4.4       Control definition and selection

Non-cancer controls were frequency matched to each identified case by 5-year age group and sex. They were randomly selected from several outpatient clinics in the same hospitals from a list of non‐chronic diseases which were not documented to be strongly associated with tobacco and alcohol consumption to mitigate Berkson’s bias (364). The participation of controls from each clinic was restricted to less than 20% to limit overrepresentation of a single diagnostic/disease group (365). The genetic profile of the participants was not known during recruitment. The list of clinics from which control participants were recruited and the distribution of controls at Indian and Canadian site are given in Table 4.

4.5       Ethics approval and informed consent

4.6       Data collection

The data collection procedures consisted of (i) questionnaire based interviews and (ii) Biological sample collection.

4.6.1        Questionnaire based interviews

or Biological sample collection

Following the interviews, biological samples were collected from each participant to perform genetic and HPV analyses (366). SNPs associated with tobacco and alcohol metabolism were the main exposures in manuscripts II and III which used the Canadian data. In addition, exposure to HPV was used as a potential confounder in these manuscripts. Hence, although sample collection was performed at both study sites, this sub-section focuses on genetic analysis at the Canadian site.

Oral epithelial cells, a reliable source of genetic material and HPV DNA, were collected through a validated and reliable protocol using mouthwash, and brush biopsies (366-368). The latter was used to collect epithelial cells from the lesion (in cases) as well as normal mucosa in the oral cavity and oro‐pharyngeal areas (both cases and controls) (details of sample collection are available in Appendix V) (368). Both mouth wash and brush biopsy methods are simple, non‐invasive, inexpensive, and have high acceptance rate among participants. Also, these methods provide great yields of both human DNA and HPV‐DNA after purification (366, 369-372). Following collection, the samples were stored at 4oC as soon as possible and at ‐20oC at the sample analysis site. For the Canadian participants, genetic analyses and HPV detection were performed at laboratories at the Albert Einstein College of Medicine in New York, and the CHUM in Montreal respectively.

口腔上皮细胞是一种可靠的遗传物质和HPV DNA来源,通过使用漱口水和刷状活组织检查,经过验证和可靠的方案收集(366-368)。后者用于从病变(在病例中)以及口腔和咽部正常粘膜(在病例和对照组中)收集上皮细胞(样本收集的详细信息见附录V)(368)。漱口和刷毛活检方法都是简单、无创、廉价的,并且在参与者中有很高的接受率。此外,这些方法在纯化后提供了大量的人类DNA和HPV - DNA(366,369 -372)。样品采集后,尽快保存在4℃,样品分析地点保存在‐20℃。对于加拿大的参与者,基因分析和HPV检测分别在纽约的阿尔伯特·爱因斯坦医学院和蒙特利尔的CHUM实验室进行。

4.6.2        Genotyping analysis for DNA polymorphism

To idHPV detection

HPV DNA detection was performed using a standardized PCR protocol (373, 374). The samples were centrifuged (at 1000 x g for 10 minutes), the DNA was extracted from the pellet with a small quantity of supernatant by a modified Gentra Purgene protocol (375). The purified DNA underwent PCR and amplification. To ascertain the integrity of DNA and that there was sufficient sample available for PCR analysis, beta-globulin testing was performed. An absence of beta- and 84 (376-378).

4.7       Data quality control and management

4.8       Measures – Manuscript I

In Manuscript I, we investigated the association between SEP collected at three periods of the participants’ lives and oral cancer risk using the accumulation, critical period and social mobility life course models. The dependent variable (oral cancer), main exposure (SEP) and potential confounders are described below.

4.8.1        Dependent (outcome) variable – Oral cancer status

4.8.1.1       Asset/wealth index and principal component analysis (PCA)

The asset/wealth index was created from a list of questions on various assets (housing characteristics, durable assets and access to services) available at the participant’s longest place of residence during three time periods: childhood (0-16 years), early adulthood (17-30 years), and late adulthood (above 30 years). As detailed in Appendix VII, Table 1, information on nine assets/items from childhood, eleven from early adulthood and twelve from late adulthood were used.

An issue in using housing indicators (which are all correlated) is that each of them could have a different relationship with SEP and may not be sufficient to differentiate household SEP when used individually (283). Hence, different indicators are aggregated to derive a uni-dimensional measure that can be further categorized to reflect different levels of SEP. Summing up the indicators is a common practice (379). However, this assumes an equal weight for each indicator.  In this study, we overcame these challenges using principal component analysis (PCA), which is an increasingly employed (e.g., by World Bank Demographic and Health Surveys data sets) data reduction method for creating uni-dimensional SEP measures from data on different assets (283, #2632, 284, 292, 293).

Principal component analysis

With PCA, multiple original variables can be summarized with relatively few dimensions that capture the maximum possible information (variation) from the original variables. Mathematically, from an initial set of n correlated variables (original), PCA creates uncorrelated components, where each component is a linear weighted combination of the original variables (380). For example, if X1, X2, … , Xn are n original indicators, then the first component (PC1) is given by,

PC1= a11X1 + a12X2 +…. + a1nXn

and mth component is given by

PCm= am1X1 + am2X2 +…. + amnXn

Where amn is the weight for the mth principal component and the nth variable.

Since PCA aims to maximize the variance, it is sensitive to scale differences in the original variables. For example, in our study, responses to some of the questions on housing were nominal (e.g., type of material for the floor, roof, wall) while others were binary (e.g., presence or absence of radio, clock, TV) or categorical. Hence, the original variables must be standardized and converted to a correlation matrix before performing a PCA (381). The weights for each component are given by eigenvectors of the correlation matrix, and the variance for each component is given by the eigenvalue of corresponding the eigenvector (380). The components are arranged so that the first component explains the largest possible amount of variation in the original data. The second component is uncorrelated with the first and explains a smaller amount additional variance, unexplained by the first component. Subsequent components are uncorrelated with first and second components and explains smaller and smaller additional, unexplained proportion of variation of the original variables (380).

4.8.1.2       Creating the asset index as a measure of SEP using PCA

To standardise the original asset indicators, first, responses to all questions on assets were binary coded into advantageous and disadvantageous SEP based on the type of material used and facilities available, according to the context of Kerala, India. Next, a tetrachoric correlation matrix (381) was created from these binary variables for each life period (Appendix VII, Tables 2,3,4).  If any variable correlated highly (|0.8|) with other variables, only one variable from the pair of correlated variables was retained for further analysis. In addition, variables were excluded in stepwise manner until a factorable correlation matrix with Kaiser-Meyer-Olkin (KMO) value > 0.7 was attained for each period separately (293). Assets with low test-retest reliability (inter-class correlation) were also removed (Appendix VI, Table 1). The final variables retained in the matrix for each period were; Childhood: crowding, floor, wall, window, piped water, bath, clock, KMO=0.832; Early adulthood: crowding, wall, window, piped water, clock, bicycle; KMO=0.771; Late adulthood:  Crowding, wall, window, piped water, clock, radio, television, phone, KMO=0.801. A PCA was conducted on the final correlation matrices to assess the dimensionality of the assets, and the component that explained the maximum variance in each life period (the first component childhood explained 65% of variance, 64% each for early and late adulthood) was extracted (283). Continuous scores were predicted out of these components. The continuous score for each life period was then dichotomized using the median of the distribution as cut-off generating respective binary variables representing the SEP exposure (0=advantageous SEP, 1=exposure to disadvantageous SEP) for childhood, early and late adulthoodperiods of life.

4.8.1.3       SEP exposure measure for critical period models

The binary variables (0-advantageous SEP, 1-disadvantageous SEP) representing SEP in childhood, early, and late adulthood were used as the main exposure in the critical period model representing each of these life periods.

4.8.1.4       SEP exposure measure for the accumulation model

A summation of the binary variables representing SEP in each life period generated a variable with four categories with increasing periods of exposure to disadvantageous SEP. This variable represented the accumulation model. The variable was coded as: 0=0 period– participants who were in advantageous SEP in all 3 periods of life; 1=1 period-participants who were exposed to disadvantageous SEP in any 1 period and non-exposed in any 2 periods of life; 2=2 periods-participants who were exposed to disadvantageous SEP in any 2 periods and non-exposed in any 1 period of life; and 3=3 periods-participants who were exposed to disadvantageous SEP in all three periods of life.

4.8.1.5SEP exposure measure for social mobility models

Two models were tested for mobility: childhood to early adulthood mobility, and early to late adulthood mobility.

Childhood to early adulthood mobility – The SEP measure representing this model was a 4-category variable. Stable advantageous SEP (0, 0): Participants who maintained a stable advantageous SEP in both childhood and early adulthood were coded as 0. Upward mobility (1, 0): Participants who were exposed to a disadvantageous SEP in childhood but went on to attain an advantageous SEP in early adulthood were coded as 1. Downward mobility (0, 1): Participants who had an advantageous SEP in childhood but disadvantageous SEP in early adulthood were coded as 2. Stable disadvantageous SEP (1, 1): Participants who maintained a stable disadvantageous SEP in both childhood and early adulthood were coded as 3; all categories were assigned irrespective of their SEP in late adulthood.

Early to late adulthood mobility – A similar strategy was adopted to create the 4-category SEP variable representing social mobility between early and late adulthood by considering participants’ SEP in these 2 periods of life.

4.8.2Covariates used as potential confounders

One of the main challenges addressed in manuscript I is the nature (both static and dynamic) of potential confounders and their temporal ordering with respect to the time-varying exposure of SEP across three time periods and oral cancer. We identified both time-invariant [age, sex, caste i.e., hierarchy in Hindu religion based on occupation, education] and time-varying factors (cigarette smoking, bidi smoking, paan chewing and alcohol consumption) as potential confounders.

4.8.2.1 Baseline confounders (time- invariant)

Age, sex and caste

Age and sex are strong risk factors for oral cancers. They can also determine an individual’s SEP at different periods of life. Hence, to mitigate confounding controls were frequency matched to cases based on 5-year age group and sex. However, there might exist differences within each age group that may result in residual confounding (333). Furthermore, age and sex stand for unknown or unmeasured potential confounders that may determine both the SEP and cancer status of an individual. Hence, these variables were further adjusted in the statistical analysis. Age was used as a continuous variable and sex was binary coded (0= females, 1= males).  Caste is a hierarchy in the Hindu religion based on occupation, and may determine an individual’s SEP as well as the outcome of cancer. In this study, we collected details on forward caste, backwards caste, other backward caste, scheduled caste scheduled tribe and others as classified by government of Kerala1. We adjusted for this variable using a categorical variable (0=higher caste, 1=middle caste comprising of backward caste, 2=other backward[1]/scheduled caste/scheduled tribe/others).

Education (time-invariant)教育(定常)

As discussed previously several indicators are used to measure SEP and they may capture different dimensions of this complex construct (please refer sub-section 2.3.4.1,). Education may capture a different dimension of SEP than the wealth index. Also, it is an independent risk factor for oral cancers, and the education an individual attains (education is mostly stable after childhood or adolescence) may determine their asset/wealth index in adulthood. Detailed information regarding education was collected from each participant (please refer to the questionnaire page…. Appendix… ) We used number of years of formal education in the form of a binary variable (0: high education; 1: low education) as an indicator. However, the measure of education is subjected to bias if the differences in birth cohorts of participants from a range of age groups included in a study are unaccounted for (285, 382, 383). With respect to the Kerala study site, considerable educational and sociopolitical reforms took place in the mid1950s, which changed the landscape of education in this state of India (as noted in sub-section 2.3.4.2, education).  This information was used to mitigate bias in the categorization of education. The participants were first divided into 2 groups: older: those born before 1950, younger: those born after 1950). For the older cohort, 0-3 years of formal education was considered low level, and 4 years and above was considered as high level of education. For the younger cohort, 8 years of formal education as used as the cut-off for this binary categorization.

如前所述,几个指标用于测量SEP,它们可能捕获这个复杂结构的不同维度(请参阅第2.3.4.1小节)。与财富指数相比,教育可能反映了SEP的不同维度。此外,它是口腔癌的一个独立风险因素,一个人所受的教育(教育在童年或青春期后基本稳定)可能决定他们成年后的资产/财富指数。关于教育的详细信息从每个参与者那里收集(请参阅问卷页面....我们使用正规教育年数作为二元变量(0:高等教育;1:教育程度低)作为一个指标。然而,如果研究中不同年龄组的参与者的出生队列的差异没有得到解释,那么教育的衡量就会受到偏差的影响(285,382,383)。关于喀拉拉邦的研究地点,在1950年代中期进行了相当大的教育和社会政治改革,改变了印度这个邦的教育状况(如第2.3.4.2节,教育)。这一信息被用来减轻教育分类中的偏见。参与者首先被分成两组:年长的:1950年以前出生的,年轻的:1950年以后出生的)。对于年龄较大的人群,0-3年的正规教育被认为是低水平的,4年及以上的正规教育被认为是高水平的。对于较年轻的一组,8年的正规教育是这个二元分类的截止时间。

4.8.2.2 Time-varying confounders

Tobacco smoking

We used

Paan / betel quid chewing

Similar to tobacco

Alcohol consumption

4.8.3Temporal relationship of confounders in relation to SEP in three periods of life and oral cancer

The temporal ordering of exposures and covariates with respect to the outcome is imperative when testing life-course models (323). Furthermore, to estimate causal effects (or when applying frameworks for causal inference or associated analytical techniques), the precedence of the causal factor in relation to its effect, is of absolute necessity. Whereas temporal ordering is easier in cohort studies (refer to sub-section 2.5 observational study designs), it is a challenge in case-control studies. But our detailed and comprehensive data collection methods and techniques to handle confounders (as described in sub-section 4.9.3.2) in our life-course based study allowed us to achieve an approximate temporal ordering of variables with respect to SEP in several periods of life and oral cancer diagnosis. As shown in the causal diagram in Figure 9, the vector C0 represented the time-invariant covariates such as age, sex and caste that temporally precede every other variable under consideration. The vector C1 represented covariates that were measured for the period between 0-16 years of age. We included education in C1 because it is usually attained during this period, and could causally affect the subsequent life events of an individual. Other variables represented in C1 and subsequent vectors C2a, C2b, C3a and C3b were time-varying risk behaviours (cigarette, bidi, paan and alcohol use). As mentioned previously in the sub-section 4.9.3.2 of confounders, the cumulative measures of these risk behaviours were calculated for 0-16 years, 17-23 years, 24-30 years, 31 -50 years, and above 50 years. Risk factors collected for the period between 0-16 years might be an effect rather than cause of SEP between 0-16 years of age and were included in C1. However, we suspected that the association between late adulthood SEP (17-30 years) and habits captured during 17-30 years, was

bi-directional, that is, SEP and habits can influence each other causally. Bidirectional arrows cannot occur in causal structures at the same time point (347, 356, 384). To overcome this, we split the habits in this period into vectors C2a (17-23 years) and C2b (24-30 years). This was done assuming that C2a would be affected by C0, C1 and CH SEP, but would influence part of SEP in 17-30 years and other subsequent variables. And C2b would be affected by C0, C1, C2a, CH SEP and EAH SEP. The choice of cut-point (i.e., 23 years) was arbitrary. A similar strategy was used with risk behaviours recorded for above 30 years of age. Risk behaviours recorded during the period 31-50 years of age were represented by C3a, and those recorded above for 50 years (the eldest participant was 88 years old) were represented by C3b. This approximate temporal ordering identified complex feed-back loops between the variables under study as any given variable/vector represented in Figure 2 had an arrow pointing from them to any other variable/vector temporally subsequent to it.

4.9 Measures- Manuscript II

In the Manuscript II, we considered the interactive effects of SNPs investigated in this study and smoking on the risk of SCCHN. Hence, the dependent (SCCHN) variable, main exposures (SNPs and smoking) and associated potential confounders are described below.

4.9.1 Dependent (outcome) variable – SCCHN

SCCHN cases were selected as described in section 3.3. Only histologically confirmed squamous cell carcinomas were included in the study. The outcome variable was treated as binary, with the presence of any oral or pharyngeal or laryngeal cancers coded as 1 (cases) and the absence of all coded as 0 (controls).

4.9.2Independent (main exposure) variable ‐ Genetic variants

The genetic variants associated with CYP450 genes coding phase I XMEs are involved in the bio-activation of a variety of tobacco smoke chemicals into electrophilic reactive moieties with carcinogenic potential. The variants associated with GST genes encoding phase II enzymes are involved in the detoxification of reactive metabolites of phase I bio transformation. The characteristics of these SNPs and their association with SCCHN have been described in detail in sub-section 2.3.2, and Tables 2 and 3. In general, I will consider all genetic exposures as binary variables, with categories coded as 0 considered as reference. The genotypes were collapsed into two categories majorly because the minor allele frequencies of these SNPs in the Caucasian population (except those related to GST enzymes) were low. Specific details on categorization of these genetic measures are given below.

4.9.2.1Single nucleotide polymorphisms in CYP and GST genes

Dominant models of inheritance were tested for CYP1A1*2A, *2C, CYP2E1c2, CYP2A6*2 and GSTP1105Val. Dominant model assumes that just the presence of the variant allele, as either homozygous variant or heterozygous variant/wild phenotypes, is enough for the effect of wild allele to be masked. Hence carriers of these variant alleles, considered as the exposed group (assuming equal risk for homozygous variant and heterozygous wild/variant groups) were compared with non-carriers, assumed unexposed. Thus, CT/CC genotypes for CYP1A1*2A, AG/GG genotypes for CYP1A1*2C, GC/CC genotypes for CYP2E1c2, AT/AA genotypes for CYP2A6*2 and AG/GG for GSTP1105Val were respectively coded 1 (carriers, exposed), and TT genotypes for CYP1A1*2A, AA genotypes for CYP1A1*2C, GG genotypes for CYP2E1c2, TT genotypes for CYP2A6*2 and AA for GSTP1105Val were respectively coded 0

4.9.2.2       Copy number variants in CYP2D6 and GSTM1 genes

In this study, we identified 1 to 9 copy numbers of CYP2D6 non-functional null allele among our sample. Individuals with lower number of these null allele are hypothesized to be at relatively higher risk for SCCHN compared to those with higher number of copies of the allele. Based on the distribution of these CNVs in this study, this genetic exposure was binary coded; 1 to 2 copies considered as exposed (coded 1) and 3 to 9 copies as unexposed (coded 0).  For GSTM1, we identified 0 to 3 copies. To ascertain sufficient numbers in the categories, the GSTM1CNV classification was limited to Null (0 copies, coded 1) and Non-null (1-3 copies, coded 0).

4.9.3        Independent (main exposure) variable ‐ Pack-years of cigarette smoking

To incorporate the effect of correlated measures such as frequency and duration of smoking and to avoided issues related to collinearity between these measures during statistical analysis, it is recommend to use cumulative measures of smoking in studies investigating the impact of this risk behaviour on cancers (385-387). Hence, in this study, we used cigarette pack-years to represent tobacco smoking history (388). Pack-years was computed as the product of the average smoking intensity over lifetime, and the total duration of smoking at the time of diagnosis for cases and at the time of interview for controls.

Cigarette pack-years was derived from information on participants’ history of cigarette (filtered or unfiltered or hand-rolled), cigar and pipe smoking along the life-course in a similar method as described in subsection 4.9.3.2, Tobacco smoking. Hand-rolled cigarettes, cigars and pipes were first converted to their commercial cigarette equivalent (20 commercial cigarettes = 4 hand-rolled cigarette = 4 cigar=5 pipes= 1 pack of commercial cigarettes) (79). This information was used to create total duration of smoking and average packs smoked per day over life time respectively. A product of these two generated a continuous measure of pack-years of cigarettes smoked over life time. Certain participants had a combination of active periods of smoking and periods of abstinence over their life-course. Periods of abstinence were excluded while calculating total duration as we assumed very low probability of misclassification (inclusion vs exclusion of such periods of abstinence gave us similar results [e.g total duration of smoking including periods of abstinence, (mean=32.25 years ±15.45) and excluding such periods (mean=31.47 years ±15.46)]) . Furthermore, from information on time since smoking cessation (age during interview minus age of cessation), we identified that participants who stopped smoking ≤ 2 years prior to recruitment had a higher risk for the outcome than actual current smokers (time since cessation=0) (Manuscript II, Supplemental material 1). Hence, to minimize probability of protopathic /reverse causality bias, we used a cut-off of 2 years’ prior interview to define ex-smokers, and excluded details of any exposure (e.g., frequency, duration) during this period for pack-year calculations.

To estimate the effect of various SNPs at different levels of smoking, we categorised the cigarette pack-year variable into 3 categories. The optimal cut-off point for categorization was informed through multiple rigorous modelling approaches.

The first step was to determine the correct functional form of pack-years using dose-response curves. For this, first an outcome model with pack-years entered as linear form was fit following guidelines proposed by Leffondre et al 2002 (386). Subsequently, I fitted multiple logistic regression models, each with pack-years in restricted cubic spline functional form determined by knots at various percentiles of its distribution (5, 50 and 95; 10, 50, 90; 25, 50, 75; 5,25, 75 as well as the modified knot positions recommended by Harrell) (389). Next, among these spline models, the best fit model was chosen by comparing Akaikes information criteria (AIC) values (390). The model with knot positions at 5, 50 and 95 percentiles had the lowest AIC value and was deemed as the best fit. Subsequently, using a likelihood ratio test, fit of this model was compared with that of the linear model under the assumption that the linear model was nested within the model with spline parameters. The spline model had a superior fit. Using this model with spline parameters, the shape of the dose-response curve between pack-years and the SCCHN outcome was constructed, and determined to be non-linear (Manuscript II, Supplemental material 2). The curve indicated that the risk for the outcome increased sharply up to approximately 70 pack-years beyond which the risk plateaued.  This informed us that the risk point (optimal cut-off) would lie anywhere between >0 and 70 pack-years.

In the second step, a parametric outcome based approach, developed to identify optimal cut-off for continuous covariates with non-linear functional form as well with respect to a binary outcome, was used to identify the optimal cut-point among smokers (391). This approach, a) maximized the difference in risk between participants in the two outcome groups, and b) bonferroni corrected for alpha=5% (to circumvent the possibility of inflation of Type 1 error in the identified cut point, due to multiple comparisons of various cut points possible over the range of >0 and 70 pack years). The optimal cut-off was identified to be at 32 pack-years (defined as smoking 32 packs of commercial cigarette per day for a year, or 16 packs/day for 2 years, or 8 packs/day for 4 years, 4 Covariates used as potential confounders

It has been recommended that while assessing interactive effects between two variables, all measured potential confounders for the relation between each exposure variable (i.e., genetic variants, and smoking) and the outcome (SCCHN) must be present in the full confounder set. Variables considered as confounders for estimating the total effect of genetic variants and health outcomes are usually limited to those that address population stratification (biased association between genetic variant and outcome due to heterogeneous ethnicity/ population sub structure), SNPs in linkage disequilibrium, and sex. However, many enzymes coded by SNPs considered in this study are induced by polycyclic aromatic hydrocarbons (CYP1A1), nicotine (CYP2A6, CYP2E1), and ethanol (CYP2E1) found in pollutants, occupational exposures, diet, tobacco smoke, alcohol among others. SNPs under study are actually noisy proxies for enzymes they code for. Hence it can be argued that the sources of these exposures may be confounders for the relation between SNPs and SCCHN. Hence, to rule out the possibility of any confounding by these exposures, we considered ethnicity, SNPs in LD (e.g., CYP1A1*2A and 2C), age, sex, alcohol (ethanol) and education (SEP proxy for occupation and diet as information on them was not available) as potential confounders for respective SNPs and SCCHN associations, as depicted in Appendix VIII, DAGs figures xxxx to XXXX.  DAGs were constructed using DAGitty software version 1.1 (392, 393). To mitigate confounding by ethnicity (population stratification), all analyses were restricted to Caucasians.

The covariates included in the final set of confounders for any or all gene-environment interaction models in this study are described below.

Alcohol consumption

The frequency of ethanol consumption (average amount of ethanol in ml consumed per day) was used as the measure of alcohol consumption. This measure was derived from detailed information on wine, beer/cider, hard liquor, aperitif, or other alcoholic beverages consumed by the participants collected using a similar method as described in subsection 4.9.3.2, Each beverage was converted to ethanol equivalents (10% ethanol in wine and aperitif, 5% in beer/cider, and 50% in hard liquor) (78). The frequency for each stable period was converted to millilitres of ethanol consumed per day for each stable period. This information was used to calculate total duration and total frequency of ethanol in ml consumed over life time. This data was used to calculate average amount of ethanol consumed per day in ml. Similar to tobacco pack-years, the correct functional form of this ethanol frequency variable was determined (by comparing fit of linear and restricted cubic spline models and fitting dose-response curves) to be non-linear (figure XXX).  The two spline parameters in continuous form were used to represent frequency of ethanol consumed per day.

Socioeconomic position – Education

Socioeconomic position (SEP) is a determinant of tobacco smoking and a distal risk factor for SCCHN. Detailed information regarding education was collected from each participant (please refer to the questionnaire page…. Appendix… ). For this specific analysis, we used the number of years of formal education as the measure of SEP, used as a continuous variable in its linear functional form.

HPV status

As described in sub-section 4.6.4, HPV status was recorded for 35 HPV types. Based on their oncogenic potential, these types were assigned into hierarchical categories: 1) HPV 16: all participants positives for HPV 16, alone or in combination with other types (coded 3); 2) High risk HPV type: all high risk HPV types except for HPV 16, i.e., HPV 18, 31, 33, 35, 39, 51 (coded 2); 3) Low risk HPV types: all other participants positive for any remaining low-risk HPV types (coded 1);  HPV-negative: participants in whom no HPV type was detected (394, 395).

4.10   Measures- Manuscript III

Manuscript III aimed at estimating the effects of CYP2A6*2 and ADH1B*2 on SCCHN through interactive and mediating pathways by smoking and alcohol intensities respectively. Hence, the dependent variable (SCCHN), exposures (CYP2A6*2 and ADH1B*2) and associated potential confounds are described below.

4.10.1    Dependent (outcome) variable – SCCHN

The dependent variable was SCCHN as described in sub-section 4.9.1

4.10.2    Independent (main exposure) variables – CYP2A6*2 and ADH1B*2

In this study, CYP2A6*2 was genotyped as TT, AA and AT (A = minor allele). Relative to carriers of this allele (AT or AA genotype), non-carriers (TT genotype) are documented to smoke with

4.10.3    Mediators – Intensity of smoking and alcohol consumption

Among the various dimensions of smoking and alcohol consumption behaviour, CY2A6*2 is strongly associated with the intensity of smoking, and ADH1B*2 with intensity of alcohol consumption (241, 396). Hence, we used the intensity measures of these behaviours as mediators.

Details of the smoking data collection are described in sub-section 4.9.3. All tobacco types were converted to a commercial cigarette equivalent based on their-nicotine content (1/9 cigar = 1/3.5 pipe=1/2 hand rolled cigarettes= 1 commercial cigarette) (50). From the total duration and frequency of a commercial cigarettes used, we calculated the average number of commercial cigarettes smoked per day over the lifetime. Using techniques described in sub-section 4.9.3, the

The data collection on alcohol consumption as well as the creation of an intensity measure of ethanol consumption were described in sub-section 4.9.4. Using a technique similar to the one employed for the categorization of smoking intensity, the optimal cut-off point to categorize the average amount of ethanol in millilitres consumed per day over the lifetime was identified to be at 25ml of ethanol. The final intensity measure for alcohol was represented by a binary variable: mild drinkers (coded 0): participants who consumed up to 25ml of ethanol per day and heavy drinkers (coded 1): participants who consumed more than 25ml of ethanol per day considered as

4.10.4Covariates used as potential confounders

Manuscript III involved analysis related to mediation and interaction based on the counterfactual causal framework. For the estimation and causal interpretation of effects in mediation studies using the counterfactual framework, four no-confounding assumptions are required along with correct model specification (335): there  is no unmeasured confounder of the effects of (i) genetic exposure on SCCHN, (ii) genetic exposure on the associated mediating risk behaviour, and (iii) mediating risk behaviour on SCCHN, and (iv) none of the mediating risk behaviour-SCCHN confounders are affected by the associated genetic exposures. We addressed (i) and (ii) by restricting our analysis to Caucasians, thus mitigating confounding due to population stratification (397). For (iii), we adjusted for potential confounders of the relationship between risk behaviours and SCCHN. For the smoking intensity-SCCHN association, we identified duration and time since cessation of smoking (continuous, mean centred, current and non-smokers recoded to zero), and intensity of alcohol (continuous, adjusted for restricted cubic spline) as confounders. For the alcohol intensity-SCCHN association, time since stoppage of use of alcohol (continuous, mean centred, current and non-users recoded to zero) of alcohol, and pack-years of commercial cigarette equivalence (as described previously in sub-section 4.9.2) were identified. Additionally, we adjusted for age (continuous), sex, number of years of education (continuous) and HPV risk types for both associations. These variables are not known to be affected by the associated genetic exposures that may potentially address the 4th no-confounding assumption. Please refer to sub-section 4.9.4 for details on these confounders.

4.11Statistical analysis

This section presents the details of general and specific statistical techniques used to analyse the data for each manuscript.

4.11.1General considerations

Descriptive statistical analysis was performed to explore the distribution of variables used in the study among cases and controls. T-Tests were used to compare means of continuous variables between the two groups, while chi-square tests based on cross-tabulations were used to describe categorical data (398). For manuscript II and III which involved genetic variants, deviations from the Hardy-Weinberg equilibrium were assessed among the control population using chi-square tests. Minor allele frequencies were estimated among controls.

The primary dependent/outcome variable investigated in each manuscript was a binary. Furthermore, exposure models used to create inverse probability weights for the marginal structural models in the 1st manuscript, and mediator models fitted in the 3rd manuscript had a binary dependent variable. Hence, all manuscripts depended on a binary logistic regression model to calculate association or effect estimates.

在每份稿件中调查的主要因变量/结果变量为二元变量。此外,曝光模型在第一稿中用于为边缘结构模型创建逆概率权重,而在第三稿中拟合的中介模型有一个二元因变量。因此,所有的手稿都依赖于二元逻辑回归模型来计算关联或效果估计。

Binary logistic regression

Binary logistic regression is a type of generalized linear model used to estimate the probability of a binary response (dependent) variable as a linear function of any number of independent predictor variables by fitting data to a logistic curve (390). If P is the probability of a disease occurring and 1-P is the probability of the disease not occurring, then P/1-P gives the odds of the disease occurring. A log transformation allows the odds of a disease to be expressed as a linear function of the independent variables as:

01* = downward mobility = unexposed in CH and exposed in EAH, irrespective of exposure in LAH, and A11* = exposed to disadvantageous SEP in both CH and EAH irrespective of exposure status in LAH.

Where A*10 = being exposed in EAH and unexposed in LAH, irrespective of exposure status in CH, A*01 = unexposed in EAH and exposed in LAH, irrespective of exposure in CH, and A*11 = exposed in both EAH and LAH irrespective of exposure status in CH.

The reference category for each mobility pattern is being unexposed at both time periods (no mobility), irrespective of exposure status in the other time period which is not included in a specific mobility testing.

每个移动模式的参考类别在两个时间段(无移动)均未暴露,而与未包括在特定移动测试中的其他时间段的暴露状态无关。

留学生论文相关专业范文素材资料,尽在本网,可以随时查阅参考。本站也提供多国留学生论文写作指导服务,如有需要可咨询本平台。

提交代写需求

如果您有论文代写需求,可以通过下面的方式联系我们。