Emperical Bayes footprint detection#

Overview#

To call footprints jointly considering hundreds-to-thousands of datasets, we developed an Empirical Bayes approach that computes the posterior probability that a particular region of the genome is occupied by a motif.

\[p(\theta_+|X, Y) = \frac{P(\theta_+) \mathcal{L}(X,Y|\theta_+)}{ p(\theta_+) \mathcal{L}(X,Y|\theta_+) + (1-P(\theta_+)) \mathcal{L}(X,Y|\theta_-)}\]

where \(X\) corresponds to the observed cleavage rates and \(Y\) represents the expected cleavage rates. The footprint prior, \(P(\theta_+)\), is the number of datasets that a nucleotide is found within a footprint (FDR<0.05; user-defined) divided by the number datasets in which that nucleotide lays within a DNase I hypersensitive site (as defined by hotspot2). The likelihood function corresponding an unoccupied nucleotide, \(\mathcal{L}(X,Y|\theta_-)\), is the product of individual negative binomial probabilities corresponding to the dispersion model parameterized by the expected cleavage rate.

\[\begin{split}\mathcal{L}(X,Y|\theta_-) = \sum_{j=i-3}^{i+3} \left( \begin{array}{c} x_j + r_j -1 \\ x_j \end{array} \right) (1-p_j)^{r_j} {p_j}^{x_j}\end{split}\]
\begin{gather*} \theta_- = \Phi_{NB}(y_j) = (\mu_j, r_j)\\ p_j = \frac{r_j}{r_j + \mu_j} \end{gather*}

Here, \(\Phi_{NB}(y_j)\) is a function that returns the parameters \((\mu, r)\) of the fitted negative binomial at the expected cleavage rate \(y_j\) (see Building a dispersion model).

The likelihood function corresponding to an occupied nucleotide, \(\mathcal{L}(X,Y │θ_+)\), is determined similar to the unoccupied case after scaling the expected cleavage rate by the expected depletion of cleavage at occupied nucleotides:

\[\theta_+ = \Phi_{NB}(y_j \lambda_j) = (\mu^{+}_j, r^{+}_j)\]

where \(\lambda_j\) is the expected depletion of cleavage at an occupied nucleotide. We determine \(\lambda_j\) by considering all datasets with an FDR 5% footprint at position \(j\). First, for each dataset we fit a Beta distribution to the ratio of observed over expected cleavages (depletion ratio) at all FDR 5% footprints identified within individual datasets (capping the ratio values at 1.0). Then, for each nucleotide we re-estimate the depletion ratio by updating the Beta distribution (\(\alpha' = \alpha + x_i\), \(\beta’ = \beta + (y_i–x_i)\)). These updated parameters are used to generate maximum a posteriori (MAP) estimates of the depletion ratio (\(\mu_{MAP}\)) and expected variation of this ratio (\(\sigma^2_{MAP}\)) at each nucleotide. A per-nucleotide footprint depletion estimate is then finally calculated from the average of the MAP mean estimates weighted by the inverse of the MAP standard deviation considering all datasets with an identified footprint at that nucleotide.

Step-by-step guide#

Step 1: Call footprints in individual datasets#

To build the priors and likelihood functions, footprints need to first be identified in each individual dataset. Please see command_detect for this task.

Step 2: Compute the per-dataset footprint priors#

Used the command command_learn_bm to compute the hyperparameters of the Beta prior.

ftd learn_beta \
        --fdr-cutoff 0.05 --exp-cutoff 10 \
        interval.bedgraph

Note

This command is run for every dataset before continuing to Step 2.

Step 2: Create a metadata file#

Next, we create a metadata file that contains the pertinent information for each dataset. The format of this file is tab-delimited and must contain the the following columns in a header (additional columns are permissable). Lines are ignore when # is the first character.

#

Column

Description

1

id

Dataset identifier

2

dm_file

Dispersion model filepath (JSON file)

3

tabix_file

Output file from command_detect Note: must be gzipped with tabix index

4

beta_a

Beta distribution parameters Output from learn_beta

5

beta_b

Step 3: Compute posterior probabilites#

The posterior footprint probabilities are called using the command command_posterior. This command takes both the metadata file created above and a BED-formated file containing the genomic regions where footprint detection will occur. Typically, the input regions are defined by merging the DNase I hotspots across all samples.

ftd posterior sample_data.txt intervals.bed

Example output:

This script writes a bedgrah-like file. Each row consists of an individual nucleotide and columns correspond to datasets (in the same order as the input metdata file).

Note

Because this is a potentially huge operation (millions of DHS vs. hundreds of samples), we typicall split the input file (DHSs) into chunks and the parallel process the chunks on a high-performance computing cluster.

cat regions.bed | split -l 5000 -a 4 -d - regions.chunk.

regions.chunk.0000
regions.chunk.0001
regions.chunk.0002
...

Step 4: Retrieve footprints#

Footprints (per dataset) can be retrieved by thresholding on posterior probabilities

cat per-nucleotide.posterior.bedgraph \
    | awk -v OFS="\t" -v col=45 -v thresh=0.01 \ # set column to dataset column
             '$(col) >= -log(thresh) { print $1, $2-3, $3+3; }' \
    | sort-bed --max-mem 8G - \
    | bedops -m - \
> footprints.bed