生信分析工作室
Bulk RNA-seq
GEO
limma
GO/GSEA
骨质疏松

GSE56815 单核细胞转录组实战(上)

从 GEO 矩阵到差异通路,一份能交付的 R 流程

12 分钟阅读
GSE56815 单核细胞转录组实战(上)

骨质疏松(osteoporosis)是绝经后女性最常见的代谢性骨病。循环单核细胞是破骨细胞的前体,并且释放调节骨代谢的因子,因此被认为是研究骨质疏松机制的最便捷外周窗口之一。GSE56815(HG-U133A 微阵列)一次性测了 80 例白人女性的外周血单核细胞,2×2 设计 —— 高/低骨密度 × 绝经前/后,每格 20 例 —— 非常适合用来演示 bulk 转录组的全套常规分析。本篇先把前 4 步走完:读入与注释、质控可视化、limma 差异分析、GO/GSEA 通路富集。下一篇会接着写 WGCNA、免疫信号打分、机器学习筛 biomarker。

第一步 · 读入矩阵 + 表型构建 + 探针注释

我们直接用 GEOquery 解析本地 series matrix,避免重复下载。表达矩阵自动检测对数与否,然后做 quantile normalization;表型从 Sample_title 文本里正则抽 BMD 与绝经状态;最后用 hgu133a.db 把 Affymetrix 探针映射到 SYMBOL,对重复基因取均值最高的那条探针。

R · 01_preprocess.R
gset <- getGEO(filename = DATA_FILE, AnnotGPL = FALSE, getGPL = FALSE)
expr <- normalizeBetweenArrays(log2(exprs(gset)), method = "quantile")

# phenotype from sample title (e.g. "postmonopausal high BMD")
bmd  <- ifelse(grepl("high BMD", title, TRUE), "High", "Low")
meno <- ifelse(grepl("postm",    title, TRUE), "Post", "Pre")

# Affymetrix probe -> SYMBOL (hgu133a.db), keep best probe per gene
map <- AnnotationDbi::select(hgu133a.db, keys = rownames(expr),
                             columns = "SYMBOL", keytype = "PROBEID")

第二步 · QC 全套:分布、PCA、相关性、聚类

微阵列数据上机批次效应可能很大,QC 必须先看清四件事:

(1)分布是否对齐;(2)PCA 上分组结构强弱;(3)样本间相关性矩阵是否有离群点;(4)聚类树有没有不该聚到一起的样本。下面的图按这四个角度依次给出。

2.1 分布对齐:箱线图 + 密度图

Per-sample expression boxplots after quantile normalization. All 80 samples are aligned, no obvious outlier requires rem
Per-sample expression boxplots after quantile normalization. All 80 samples are aligned, no obvious outlier requires removal.
Per-sample density curves overlap tightly across all four groups, indicating consistent dynamic range.
Per-sample density curves overlap tightly across all four groups, indicating consistent dynamic range.

2.2 PCA:BMD / 绝经 / 联合分组

在 top-5000 高方差基因上做 PCA。需要重点关注的是:BMD 这个我们最关心的变量在前两个主成分上是否能区分。

PCA coloured by BMD. The two clouds overlap heavily — BMD signal is subtle in monocyte transcriptome.
PCA coloured by BMD. The two clouds overlap heavily — BMD signal is subtle in monocyte transcriptome.
PCA coloured by menopausal status. Pre vs Post separates slightly more clearly than BMD does.
PCA coloured by menopausal status. Pre vs Post separates slightly more clearly than BMD does.
PCA coloured by the 2×2 group label.
PCA coloured by the 2×2 group label.

2.3 样本相关性 + 层次聚类

Sample-sample Pearson correlation heatmap with BMD / Menopause annotations. The full matrix is uniformly high (>0.94), n
Sample-sample Pearson correlation heatmap with BMD / Menopause annotations. The full matrix is uniformly high (>0.94), no sample needs to be excluded.
Hierarchical clustering (average linkage). Branches do not split cleanly by BMD — consistent with the PCA.
Hierarchical clustering (average linkage). Branches do not split cleanly by BMD — consistent with the PCA.

第三步 · limma 2×2 差异分析:五个对比一次出

用 ~ 0 + Group 的 cell-means 设计加上 makeContrasts,可以一次拿到五个生物学问题的答案:

R · 03_deg.R
design <- model.matrix(~ 0 + Group, data = pheno)
cont <- makeContrasts(
  BMD_main       = (High_Pre + High_Post)/2 - (Low_Pre + Low_Post)/2,
  Menopause_main = (Low_Post + High_Post)/2 - (Low_Pre + High_Pre)/2,
  Interaction    = (High_Post - Low_Post) - (High_Pre - Low_Pre),
  HighvsLow_Pre  = High_Pre  - Low_Pre,
  HighvsLow_Post = High_Post - Low_Post,
  levels = design)
fit2 <- eBayes(contrasts.fit(lmFit(expr, design), cont), trend = TRUE)

阈值取 adj.P < 0.05 且 |log2FC| > log2(1.2),五个对比的 DEG 数量如下:

3.1 五张火山图

Volcano — BMD main effect (High vs Low, both menopausal states).
Volcano — BMD main effect (High vs Low, both menopausal states).
Volcano — Menopausal main effect (Post vs Pre).
Volcano — Menopausal main effect (Post vs Pre).
Volcano — High vs Low BMD, premenopausal only.
Volcano — High vs Low BMD, premenopausal only.
Volcano — High vs Low BMD, postmenopausal only (by far the largest gene set).
Volcano — High vs Low BMD, postmenopausal only (by far the largest gene set).
Volcano — Interaction (BMD × Menopause). Only 1 gene survives, so the two main effects are largely additive.
Volcano — Interaction (BMD × Menopause). Only 1 gene survives, so the two main effects are largely additive.

3.2 Top BMD DEG 的样本聚类热图

Top BMD DEGs (limma BMD_main, adj.P<0.05) shown as z-score with BMD and Menopause annotations.
Top BMD DEGs (limma BMD_main, adj.P<0.05) shown as z-score with BMD and Menopause annotations.

第四步 · 功能富集:GO + GSEA

GO 用 clusterProfiler 的 enrichGO 做超几何检验;同时把 limma 输出的 logFC 排成基因列表,跑 gseGO(BP)和 MSigDB Hallmark 的 GSEA,兼顾『DEG 集合』和『全谱表达趋势』两种视角。

R · 04_enrichment.R
ego <- enrichGO(gene = entrez_deg, OrgDb = org.Hs.eg.db, ont = "BP",
                pAdjustMethod = "BH", pvalueCutoff = 0.05, readable = TRUE,
                universe = entrez_all)

# GSEA on full ranked logFC list
glist <- sort(setNames(tt$logFC, tt$ENTREZID), decreasing = TRUE)
gsea_h <- GSEA(glist, TERM2GENE = hallmark_t2g, pvalueCutoff = 0.25)

4.1 GO 三本体气泡图(BP / MF / CC)

GO enrichment — top 10 terms per ontology. BP is dominated by TNF / cytokine production.
GO enrichment — top 10 terms per ontology. BP is dominated by TNF / cytokine production.

4.2 GSEA · GO BP 山脊图

GSEA on GO BP, top pathways shown as ridges coloured by NES.
GSEA on GO BP, top pathways shown as ridges coloured by NES.

4.3 GSEA · Hallmark 通路条形图

Hallmark GSEA bar plot. TNFα signaling via NF-κB shows the strongest positive NES.
Hallmark GSEA bar plot. TNFα signaling via NF-κB shows the strongest positive NES.

上篇小结 + 下篇预告

到这里上篇完成了从原始矩阵到通路解释的闭环:BMD 在外周单核细胞上的转录组效应虽然总体偏弱,但分层后绝经后差异显著放大,且高度聚焦于 TNFα/NF-κB 轴 —— 这是一条与破骨细胞分化、骨重塑息息相关的经典通路。

下一篇会接着写 WGCNA 共表达模块、免疫与单核细胞亚群信号打分、以及 LASSO + 随机森林 + SVM-RFE 三重共识筛 biomarker + ROC 验证。

本文相关服务

Bulk 转录组分析

想把类似的分析跑在你自己的数据上?可以直接看服务详情或发起咨询。

更多案例