The reproducibility crisis:
Enhancer-gene maps (Sakaue et al., 2024)

1 University of California, San Francisco
2 Biological and Medical Informatics Graduate Program
3 Bioengineering Graduate Program
| # | Argument name (format) | Descriptions |
|---|---|---|
| 1 | rna (sparse matrix) | A gene-by-cell count matrix from multimodal RNA-seq data. This is a raw count matrix without any normalization. The row names should be the gene names used in the peak.info file. The column names are the cell names which should be the same names used in the cell column of the dataframe specified for meta.data. Sparse matrix format is required. |
| # | Argument name (format) | Descriptions |
|---|---|---|
| 2 | atac (sparse matrix) | A peak-by-cell count matrix from multimodal ATAC-seq data. This is a raw count matrix without any normalization. The row names should be the peak names used in the peak.info file. The column names are the cell names which should be the same names used in rna and the cellcolumn of dataframe specified for meta.data. The matrix may not be binarized while it will be binarized within the function. Sparse matrix format is required. |
| # | Argument name (format) | Descriptions |
|---|---|---|
| 3 | meta.data (dataframe) | A meta data frame for cells (rows are cells, and cell names should be in the column named as “cell”; see below example). Additionally, this text should include covariates to use in the model. Examples include: % mitochondrial reads, log(nUMI), sample, and batch as covariates. Dataframe format is required. |
| # | Argument name (format) | Descriptions |
|---|---|---|
| 5 | covariates (a vector of character) | A vector of character fields that denote the covariates listed in the meta.data. For example, a set of covariates can be: %mitochondrial reads, log_nUMI, sample, and batch. Additionally the user can specify transformations to the covariates such as log transformation on nUMI counts for direct usage in the SCENT algorithm invoking poisson glm. We recommend users to at least use log(number_of_total_RNA_UMI_count_per_cell) as the base model is Poisson regression and we do not include the offset term into the default model. |
| # | Argument name (format) | Descriptions |
|---|---|---|
| 6 | celltypes (character) | User specified naming of the celltype column in the meta.data file. This column should contain the names of the celltypes you want to test in this association analysis. |
| # | Argument name (format) | Descriptions |
|---|---|---|
| 4 | peak.info (dataframe) | A table with two columns indicating which gene-peak pairs you want to test in this chunk (see below example) genes should be in the 1st column and peaks in the 2nd column. We highly recommend splitting gene-peak pairs into many chunks to increase computational efficiency (See Parallelized Jobs Info in Section 2). List(Dataframe) format which is a list of multiple data frames for parallelization is required. * |
README.md
“We usually only select peaks of which the center falls within 500 kb from the target gene (cis analysis).” (Github, 2024)
[1] "ENSG00000243485"
[1] "chr1:817095-817595"
[1] "Sample" "TSSEnrichment" "ReadsInTSS" "ReadsInPromoter"
[5] "ReadsInBlacklist" "PromoterRatio" "PassQC" "NucleosomeRatio"
[9] "nMultiFrags" "nMonoFrags" "nFrags" "nDiFrags"
[13] "BlacklistRatio" "Clusters" "ReadsInPeaks" "FRIP"
library(biomaRt)
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
# These correspond to IFNG, IL2, BCL11B, IL4, and FOXP3
gene_list <- (c("ENSG00000111537", "ENSG00000109471", "ENSG00000127152", "ENSG00000113520", "ENSG00000049768"))
gene_coords <- getBM(
attributes = c("ensembl_gene_id", "chromosome_name", "start_position", "end_position"),
filters = "ensembl_gene_id",
values = gene_list,
mart = ensembl
) |> mutate(chromosome_name = paste0("chr", chromosome_name))
atac_peak_coords <- tibble(peak = colnames(atac)) |>
separate(peak, into = c("chr", "pos"), sep = ":") |>
separate(pos, into = c("start", "end"), sep = "-", convert = TRUE)
distance <- 500000
gene_peak_candidates <- gene_coords |>
rowwise() |>
mutate(peaks = list({
peaks_df <- atac_peak_coords |>
filter(
chr == chromosome_name &
start >= start_position - distance &
start <= start_position + distance
)
if(nrow(peaks_df) == 0) tibble(chr = character(), start = integer(), end = integer()) else peaks_df
}
)) |>
ungroup() |>
unnest(peaks) |>
mutate(peak = paste0(chr, ":", start, "-", end)) |>
dplyr::select(ensembl_gene_id, peak)# A tibble: 6 × 2
ensembl_gene_id peak
<chr> <chr>
1 ENSG00000049768 chrX:48801154-48801654
2 ENSG00000049768 chrX:48801740-48802240
3 ENSG00000049768 chrX:48834850-48835350
4 ENSG00000049768 chrX:48835572-48836072
5 ENSG00000049768 chrX:48848794-48849294
6 ENSG00000049768 chrX:48891008-48891508