Inclusion and ethics statement
We included researchers from the iPSYCH biobank and the PGC, who played a role in study design. This research was not restricted or prohibited in the setting of any of the included researchers. All studies were approved by local instituational research boards and ethics review committees. MVP was approved by the Veterans Affairs central instituational research board. We do not believe our results will result in stigmatization, incrimination, discrimination or personal risk to participants.
Cohorts
We used data release version 4 of the MVP. Linked and de-identified EHRs were queried using the Veterans Affairs Informatics and Computing Infrastructure to identify individuals with International Classification of Disease (ICD) codes for cannabis dependence or cannabis abuse (together, CanUD) (Supplementary Tables 2 and 3). The range of diagnosis dates was between May 1992 and December 2019. Two classifications were investigated: (1) cases identified by at least two separate outpatient visits or any number of inpatient visits to a US Veterans Affairs (VA) medical center for CanUD and (2) cases identified by at least one inpatient or outpatient visit for CanUD. Genetic correlation analysis indicated that these traits were almost identical from a genetic perspective (rG = 0.99) and SNP-based heritability (h2) was not statistically different (definition 1, h2 = 0.075, s.e. 0.0053, z = 14.1; definition 2, h2 = 0.087, s.e. 0.0062, z = 14.0; Pdiff = 0.14), so case definition per the second classification was retained for further analysis (that is, at least one inpatient or outpatient visit). All individuals diagnosed under the first disease definition were also diagnosed under the second more inclusive definition. Controls were defined as individuals without any VA EHR ICD codes for cannabis dependence, cannabis abuse or cannabis use (cannabis use codes included in ICD-9: 305.29 and included in ICD-10: F12.90, F12.920, F12.921, F12.922, F12.929, F12.93, F12.950, F12.951, F12.959, F12.980, F12.988 and F12.99). The PGC cohort was as previously described and was made up of 16 cohorts with varying phenotype definitions and ascertainments10. A leave-one-out analysis was performed to remove the iPSYCH1 sample, leaving 18,370 cases and 304,838 controls for European and African ancestries in the remaining PGC/deCODE sumstats. An updated expanded iPSYCH2 cohort was then added via meta-analysis (4,733 cases and 95,657 controls, all EUR). We also included samples from MGB Biobank (456 cases and 24,088 controls, all EUR) and new data from the Yale–Penn cohort46 beyond the individuals already included in the PGC study (an additional 310 cases and 1,471 controls for EUR, and 271 cases and 666 controls for AFR). Table 1 gives numbers for each cohort.
MVP genotyping, imputation, quality control, and GWAS and meta-analysis
Genotyping and imputation of MVP participants has been described previously11. Briefly, a customized Affymetrix Axiom Array was used for genotyping. MVP genotype data for biallelic SNPs were imputed using Minimac4 and a reference panel from the African Genome Resources panel by the Sanger Institute. Indels and complex variants were imputed independently using the 1000 Genomes (1KG) phase 3 panel and merged in an approach similar to that employed by the UK Biobank. Designation of broad ancestries was based on genetic assignment with comparison to 1KG reference panels47.
MVP GWAS was conducted using logistic regression in PLINK 2.0 using the first ten positive controls, sex and age as covariates. Variants were excluded if call missingness in the best-guess genotype exceeded 20%. Alleles with MAF <0.1% were excluded in EUR, AFR and AMR. Alleles with MAF <1% were removed from EAS due to smaller sample size. The MVP data represented the largest and most diverse cohort with 22,260 cases and 423,587 controls (EUR), 14,946 cases and 97,580 controls (AFR), 2,774 cases and 35,515 controls (AMR) and 194 cases and 6,649 controls (EAS) (Table 1). GWAS meta-analyses in the PGC datasets of the deCODE and PGC samples were conducted as previously described, although a leave-one-out analysis was conducted to remove data from iPSYCH1 so that a larger cohort could be independently analyzed10. This leave-one-out PGC meta-analysis contained 14,522 EUR cases and 298,941 controls and 3,848 AFR cases with 5,897 controls. This study includes new genotypes from iPSYCH (referred to as iPSYCH2), and all iPSYCH data (iPSYCH1 + 2) has been reprocessed. Pre-imputation quality control and imputation were performed on genotypes from the full set of genotyped individuals for iPSYCH1 and iPSYCH2 separately, using standard procedures for GWAS data. The iPSYCH1 samples were genotyped in 23 genotyping waves and thus additional steps were taken to eliminate potential batch effects. Only variants present in more than 20 waves and with no significant association with wave status were retained. Imputation was done using the pre-phasing/imputation stepwise approach implemented in EAGLE v2.3.548 and Minimac49, using the Haplotype Reference Consortium50 panel v1.0. GWAS of 4,733 EUR cases and 95,657 controls and was done on a merged set of best-guess genotypes with MAF >0.01 and imputation info score >0.8 (in both iPSYCH1 and iPSYCH2) using logistic regression with appropriate covariates (age, sex, psychiatric diagnoses (attention deficit hyperactivity disorder, autism spectrum disorder, SCZ, bipolar disorder and MDD), first ten positive controls and iPSYCH cohort of origin). A new Yale–Penn tranche was analyzed using PLINK 1.9 in unrelated individuals not previously included in any other GWAS or meta-analysis. This contributed 310 cases and 1,471 controls (EUR) and 271 cases and 666 controls (AFR). Finally, MGH Partners BioBank51 contributed 456 cases and 24,088 controls (EUR).
EUR cohorts were combined in a GWAS meta-analysis (Table 1). For AFR, we performed meta-analysis between the MVP, PGC and Yale–Penn cohorts. For AMR and EAS, only MVP included data so no meta-analysis was possible within these ancestries. GWAS meta-analyses were conducted using inverse variance weighing in METAL52 for both EUR and AFR. For within-ancestry meta-analyses, there were 42,281 EUR cases with 843,744 controls, and 19,065 AFR cases with 104,143 controls. The multi-ancestry meta-analysis53 included 1,044,620 total participants of EUR, AFR, AMR and EAS ancestries. Sex-stratified analysis was conducted in the only cohort available individual GWAS for the analysis—the MVP (Supplementary Fig. 7).
LDSC and SNP-based heritability
LDSC was used to calculate SNP-based heritability on the liability scale, using a lifetime population prevalence54 of 2% and a sample prevalence of 5% for EUR, 13.2% for AFR, and 7.2% for AMR within the MVP55. We used the lifetime population prevalence reported in the PGC/deCODE/iPSYCH1 cannabis paper10 for comparability. Typically, calculating SNP-based heritability depends on reliable reference ancestry to account for nonindependence of some variance due to LD. This is easily done for EUR, but admixed non-European ancestries pose a statistical challenge. Covariate LDSC12 uses sample covariates such as those derived from principal components analysis (a dimension reduction technique that produces eigenvalues for each variant) carried out in the study sample to adjust LD scores to enable calculation of SNP heritability in each ancestry using sample-specific LD scores. LDSC as implemented by the Complex Traits Genomics Virtual Lab56 was used to estimate genetic correlations57 to identify common genetic architecture across all 1,335 traits available for comparison. Additionally, LDSC was used to compare genetic correlations between CanUD and cannabis use (from a previously published study18).
Cross-ancestry genetic correlation
POPCORN19 was used to generate cross-ancestry covariance scores using 1KG reference panels from EUR and AFR. This method was applied to calculate genetic correlations between the AFR CanUD generated in this study against traits from Fig. 2 that had available allele frequencies and n count.
Mendelian randomization
Several traits with significant genetic correlation with CanUD and high public health importance were selected for follow-up MR analysis in EUR ancestry datasets (‘type of physical activity in the last four weeks = none’, multi-site chronic pain, Alzheimer’s disease, SCZ and lung cancer). These traits were first tested for polygenic overlap with CanUD; one trait did not survive this step (Alzheimer’s disease), and the remaining three traits moved on to MR analysis. MR was conducted using the TwoSampleMR package in R Studio58. We conducted MR Egger analysis to test for the effect of horizontal pleiotropy.
Conditional analysis
mtCOJO was carried out to study possible confounding of smoking for CanUD. GWAS summary statistics for smoking initiation and cigarettes per day from the GWAS and Sequencing Consortium of Alcohol and Nicotine use Phase 2 study of EUR ancestry were used for smoking20. The CanUD (target trait) GWAS data were conditioned on smoking initiation and cigarettes per day (covariate traits) GWAS data individually using the Genome-wide Complex Trait Analysis mtCOJO utility59. Output summary statistics from conditioned CanUD was then used to re-test the MR relationship between CanUD and lung cancer.
Transcriptome-wide association study
Transcriptome-wide association studies (TWAS) and FUSION60 software were employed to use variant–gene expression associations to enrich GWAS variant findings for genes involved with CanUD. The TWAS models were trained using prior published evidence for gene expression from adult brain cortex33 (1,695 samples; 14,750 models) and fetal brain frontal cortex34 (201 samples; 3,784 genes), with each gene having estimated positive cis-heritability at nominal P < 0.01 and the corresponding predictive model achieving five-fold cross-validation R2 > 0.01 at a nominal P < 0.01. Using a weighted burden test60, we generated a Wald-type Z score for each gene–trait association, with transcriptome-wide significance defined at P < 2.5 × 10−6, the Bonferroni-corrected significance level across 20,000 tests. To ensure proper alignment to the genetic ancestry of the eQTL and GWAS cohorts, we use a reference panel from EUR individuals in 1KG61. The TWAS samples did not include any ascertainment for CanUD in the brain tissue used for analysis.
For every transcriptome-wide significant gene–trait association, we conducted a permutation test by shuffling the SNP-gene weights in the prediction model 10,000 times60,62. This permutation generates a null distribution to compare to the original TWAS Z score to quantify the significance of the expression–trait associations conditional on the SNP–trait effects at the locus60. For genes that passed both transcriptome-wide significance and the permutation test at P < 0.05 within 1 Mb of another significant gene, we conducted probabilistic gene-level fine-mapping using FOCUS to estimate 90% credible sets of genes that explain the trait association signal at a locus63. We conducted FOCUS fine-mapping across genes detected by models trained in either adult or fetal brain tissue.
Partitioned SNP-based heritability estimation
To assess differences in enrichment of SNP-based trait heritability in the regions around eQTLs of adult and fetal expression, we employed stratified LDSC61. Genes with at least one significant eQTL were designated ‘eGenes’. We generated LD score annotations for 500-bp windows around lead eQTLs of eGenes from Genotype-Tissue Expression brain cortex (n = 205) and fetal brain frontal cortex (n = 201). We used Genotype-Tissue Expression to ensure similar sample sizes. We define the enrichment of SNP-based heritability as the proportion of heritability explained by a set of SNPs in the annotation divided by the proportion of all SNPs included in the annotation.
gSEM
gSEM64 was used to perform EFA and CFA of CanUD and 14 additional traits of interest that were genetically correlated. For EFA, factor structures composed of one to ten factors were examined. EFA model fit was evaluated by the amount of cumulative variance explained by the overall factor structure, the SS loadings (SS loading ≥1) for each included factor and the proportion of explained variance accounted for by each of the individual factors (that is, ≥10%). Traits with EFA factor loadings ≥0.20 were evaluated for optimal CFA model fit determined by conventional fit indices64. CFA models were estimated using diagonally weighted least squares estimation and a smoothed genetic covariance matrix. The 1KG phase 3 EUR reference panel was used for LD calculation47.
Multi-trait analysis of GWAS
We applied the MTAG method65 for the joint analysis of the genome-wide association statistics of CanUD (EUR meta-analysis from the present study), AUD (n = 167,721)66 and nicotine dependence (based on the FTND; n = 58,000)67. First, SNPs that were duplicated, had MAF ≤0.01 or had strand ambiguity were removed from the GWAS datasets. Of the 14,768,834 SNPs available from the GWAS meta-analysis of CanUD, 5,894,946 SNPs remained for the MTAG analysis after quality control. After the MTAG analysis with AUD and nicotine dependence, 3,540,940 SNPs remained. Significant variants were defined at P < 5 × 10−8.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.