Introduction

EraSOR is a python software for removing bias introduced from having overlapped samples between the base GWAS data and the target genotype data.

To run EraSOR, you will need LD scores, and the GWAS summary statistics obtained from the base and target samples. EraSOR will calculate the bi-variate LD scores intercepts and adjust the base summary statistics accordingly.

Download EraSOR

You can either download the source script directly from gitlab release or you can clone this repository in the command line interface (Require installation of git)

git clone https://gitlab.com/choishingwan/EraSOR.git

Warning

This document is written for unix operation system (e.g. not windows). EraSOR should in theory also works on windows but I am not family with running python software on windows.

Dependency

EraSOR is a python3 software, and will require installation of Python3. You can find instructions here.

You will also need to install the following python packages for EraSOR to work:

  • numpy
  • pandas
  • scipy

You can install them with the following commands (assuming python3 is in your environment)

python3 -m pip install numpy pandas scipy

Tips

You can check if python3 is in your path by typing which python3. This command should return a path if python3 is found.

You may use pyenv to organize multiple version of python

Before you start

Before you start, please make sure you have the followings:

  1. GWAS summary statistics from your base data

    • Sample size information of your GWAS (or a column containing sample size information)
    • Must contain the following columns:
      1. SNP ID
      2. Effective allele
      3. Effect size (either in \(\beta\), Odds Ratio, or logOR)
      4. P-value
  2. GWAS summary statistics from your target data

    • Your target data should have the same phenotype as the base
    • Must contain the following columns:
      1. SNP ID
      2. Effective allele
      3. Effect size (either in \(\beta\), Odds Ratio, or logOR)
      4. P-value
  3. LD Score calculated using samples representative of your target population

    • Can also generate a separate copy of LD score with the MHC region removed (or only contain SNPs that were genotyped in your target data)

Note

In theory, EraSOR might also work for cross-trait analyses. However, we have not performed any simulations and thus are uncertain of the potential bias of EraSOR in such scenario

Quick start

Tips

You can see all available parameters by typing

EraSOR.py --help

Assuming you have the following files

  • Base summary statistics

    • Name: phenotype.sumstat.txt
    • SNP ID Column: SNP
    • Effect Size column: BETA
    • Is beta?: true
    • Effective allele column: A1
    • Non-effective allele column: A2
    • P-value column: p
    • Sample size column: N
  • Target summary statistics

    • Name: data.sumstat
    • SNP ID Column: ID
    • Effect Size column: OR
    • Is beta?: false
    • Effective allele column: Effective
    • Non-effective allele column: NonEffective
    • P-value column: P-value
    • Sample size column: OBS_CT
  • LD scores
    • Assuming it is chromosome separated, with the following format (# represent chromosome number):
      • baseline-#.l2.ldscore.gz
      • baseline-#.l2.M
      • baseline-#.l2.M_5_50
    • And the weight scores in the following format:
      • weight-#.l2.ldscore.gz
      • weight-#.l2.M
      • weight-#.l2.M_5_50

You can run EraSOR with the following command:

python EraSOR.py \
    --base phenotype.sumstat.txt \
    --base-snp SNP \
    --base-signed-sumstats BETA,0 \
    --base-a1 A1 \
    --base-a2 A2 \
    --base-p p \
    --base-N-col N \
    --target data.sumstat 
    --target-snp ID \
    --target-signed-sumstats OR,1 \
    --target-a1 Effective \
    --target-a2 NonEffective \
    --target-p P-value \
    --target-N-col OBS_CT \
    --ref-ld-chr baseline- \
    --w-ld-chr weight- \
    --out EraSOR.adjusted

This will generate two files:

  1. EraSOR.adjusted.assoc.gz

    • This file contain the adjusted summary statistics that can be used for downstream polygenic risk score analyses
    • The Z column contains the adjusted effect size and P column contains the adjusted p-value
  2. EraSOR.adjusted.meta

    • Contain information used for adjustments, including the heritability estimates of the base and target GWAS, the intercepts, and the value of adjustment

Note

Most parameter of EraSOR are identical to LDSC with the added prefix of --base- or --target- to indicates if those parameters are for the base or target GWAS.

If the column names of the base and target GWAS are identical, you can provide the column name once with the --base-* parameter and then use --same