정리 조금/Bioinformatics & Biostatistics

Weighted Gene Co-expression Network Analysis

Turtle0105 2023. 10. 26. 22:48

예전에 정리해둔 노션을 옮겨왔습니다. 중간중간 옮겨오며 오류가 나는 부분이 있을 수 있습니다.

내용이 너무 많아 한글로 다시 정리하기 힘들어 거의 그냥 가져다놓았습니다 ㅠㅠ 


Intro

  WGCNA는 유전자간의 correlation을 기반으로 유전자 군집화를 진행 할 수 있게하는 알고리즘이다. 이 방법은 상당히 직관적인 아이디어로 구성되어있으며, 필요에 따라 biweight midcorrelation과 dynamic tree cut과 같은 독자적인 방법론을 만들어 적용하였다. 개선가능해보이는 부분 또는 풀리지 않은 질문은 이렇게 하이라이트 해놓았다. 총 네 단계로 이루어져 있으며 아래와 같이 작동한다.

왼쪽 위부터 오른 아래순서

이제부터 정리 시작!

 

Weighted Gene Coexpression Network Analysis (WGCNA) is a guilt by association approach.

  • Encourages hypotheses about genes based on their close network neighbors.
📖 Guilt by association means the implicative system that any person suffers unfavorable treatment or punishment on account of an act not of his own doing but committed by a relative within a certain degree. (연좌제/連(緣)坐制 in Korean)

WGCNA

1. Weighted correlation network of genes

Input data

$X=[x_{ij}]=\left[ {\begin{array}{cc} \underline{x}_{1} \\ ... \\ \underline{x}_{n} \\ \end{array} } \right] \in \mathbb{R}^{n \times m}$

$n$: # of network nodes (genes), $m$: # of sample

$i^{th}$ row vector: $\underline{x_i} \in \mathbb{R}^{1 \times m}$

Correlation

$s_{ij}=|cor(\underline{x}_i, \underline{x}_j)|$

💡 Correlations implemented in WGCNA

Pearson
- Linear correlation
- Sensitive to the outliers

Spearman, Kendall
- Robust to the outliers
- Includes non-linear correlation
- Less sensitive to the gene expression differences

Biweight midcorrelation
- Really robust to outliers
    - The outlier larger than $med(x)$ more than $9mad(x)$ is considered as 0
    - Since this coefficient excludes the outliers, I think this coefficient is more robust than the other rank-based correlation coefficients.
- Author recommended

아래는 Biweight midcorrelation 조금 정리해놓은 문서
Biweight midcorrelation.html
0.10MB


Correlations should be selected with overall situation. After reading the part 1 you can understnad below Q&A
 

WGCNA pearson or bicor? Also, should I adjust module correlation Pvalues by Bonferroni?

WGCNA pearson or bicor? Also, should I adjust module correlation Pvalues by Bonferroni? 1 Hello! I am performing a network analysis on WGCNA. I had to separate the samples into 2 networks (depending on the tissue) due to the high variation between tissues.

www.biostars.org

Power term

$a_{ij}=s_{ij}^{\beta}$

$\beta$ is a power term parameter which amplified disparity (fold differences) between strong and weak corrlations

eg)

$cor(\underline{x}_i, \underline{x}_j)=0.8$ & $cor(\underline{x}_k, \underline{x}_l)=0.2$

$\beta=4$ $→$ $0.8/0.2=4$ vs. $0.8^{4}/0.2^{4}=256$

💡 As $\beta$ increases, the difference between weights increases, ultimately resulting in a "pronounced" (laeger contrast) overall graph connectivity. However, if $\beta$ becomes too large, small weights converge to 0 fastly, making analysis impossible. Therefore, one should choose a small $\beta$ that allows for finding an appropriate scale-free network. This task is typically carried out by examining the R-squared and mean connectivity values of the scale-free topology model based on $\beta$ . With a little consideration, it becomes apparent that both the quality of the scale-free network and the mean connectivity of the entire network are indicators of a "pronounced" network.

There are several hyperparameter optimization approaches. Since the range is provided (where?), the grid search is good enough to apply.


$R^2$ index

The $R^2$ index for this plot is not for the original complex graph model, but for the log model ($log(p(k))$ & $log(k)$).

Why the $R^2$ of linear regression is important to check goodness of fit of scale free network. How does the log transformed Poisson data looks like? Isn’t it linear??

There are three distibutions which satisfy the gene co-expression networks approximately.

1. Barabasi (1999), Scale Free
- $log(p(k))=c_0+c_1log(k)$
2. Csanyi-Szendroi (2004), Exponentially Truncated SFT
- $log(p(k))=c_0+c_1log(k)+c_2k$
3. Horvath, Dong (2005), Log Log SFT
- $log(p(k))=c_0+c_1log(k)+c_2log(log(k))$

$→$ Scatter plot

Connectivity index

Connectivity measurement of a gene is a sum of the weights of all edges connecting to this gene
$k_i=\sum a_{ij}$

Network

Construct a fully connected network with adjacency matrix whose elements are $a_{ij}$

Then, $a_{ij}$ become the weight of edges

💡 3 types of networks

Unsigned (Default)
- $a_{ij}=|cor(\underline{x}_i, \underline{x}_j)|^{\beta}$
- No differences between positive and negative correlations
  - Use this if negative correlation are also of interest

Signed
- Direction does matter (interpretability, only same directions)
    - eg) If profiles go up together, or …
- $a_{ij}=({1+cor(x_i, x_j)\over{2}})^{\beta}$
- Negative correlations are considered unconnected.

Signed Hybrid
- Signed sometimes produces non zero for real zero correlations
  - So author utilizes the hard thresholding concept (that’s why we call this “hybrid”)
- $a_{ij}=|cor(\underline{x}_i, \underline{x}_j)|^{\beta}$ for $cor(\underline{x}_i, \underline{x}_j)>0$
- $a_{ij}=0$ for $otherwise$
- Only positive correlations are taken into account
- Negative correlations are set to 0
  - Use this if negative correlations are NOT of interest
- This makes optimal $\beta$ much smaller than signed network


Author recommended

Signed
- Can consider the (positive) direction

Hybrid Signed
- Good for selecting the small optimal $\beta$

We can also use the fuzzy measurements of negatively connected genes to the eigengenes.


⚠️ Unsigned?? Signed??

For our analysis, I think we need to check the negatively correlated pairs, too. Since all we have to do is to find the reversely expressed genes, after the drug treatment… If we check only positively correlated gene pairs, there could be information loss. However, the loss could not be that much..! For example, we can think about the situation like below,

(Middle) Module with Unsigned Network & (Right) Modules with Signed Network

We need to think about the trade-off. If we lose some informations about the connectivities, we can get some interpretability (i.e. We can perform analysis on the module level, like up/down regulated module (GSEA))

⚠️ Think About the Triplet!!

If we use unsigned network, one more problem occures calculating TOM….! This will be discussed below (TOM section)

References
Unsigned vs. Signed       Signed vs. Signed hybrid

2. Modules of genes with similar profile

I. Compute dissimilarity between genes

Generalized version to the weighted networks (Hovath & Zhang, 2005)

Hovath & Zhang suggested generalized version for weighted graph whose edges are real numbers between 0 and 1. The main idea of TOM is “reinforcement” of connectivity by “neighbor”

$$TOM_{ij}={{\sum_u{a_{iu}a_{uj}+a_{ij}}}\over{min(k_i,k_j)+1-a_{ij}}} = \omega_{ij}$$

$$DistTOM_{ij}=1-TOM_{ij}$$

if $TOM_{ij}=1$, $i, j$ have identical set of neighbor

if $TOM_{ij}\neq 1$, $i, j$ have no set of neighbor

 

Practically, the denominator can be used mean value alternative to min value.

 

This can serve as a filter that decreases the effect of spurious or weak connections and it can lead to more robust networks. (Hovath, 2017)

 

💡 TOM based connectivity measure

$\omega_i=\sum_{j=1}^n\omega_{ij}$

This can be a good alternative to the connectivity measure

Topological Overlap Measure (TOM) is a pairwise similarity between genes. (Ravasz, 2002)

The idea of TOM is origined from unweighted graph so that $a_{ij}$s’ are binary entries. The below equation is same as right v.s. equation.
$$ \omega_{ij}=\left\{\begin{matrix} {{|N_1(i)\cap N_1(j)|+a_{ij}}\over{min{(|N_1(i)|, |N_1(j)|)+1-a_{ij}}}}&& if & i\neq j\\ 1 && if & i=j \end{matrix}\right. $$
Where $N_1(i)$ denotes the set of neighbors of $i$ excluding $i$ itself and $|\cdot|$ denotes the number of elements (cardinality) in its argument. The quantitiy $|N_1(i)\cap N_1(j)|$ measures the number of common neighbors that nodes $i$ and $j$ share whereas $|N_1(i)|$ gives the number of neighbors of $i$.


Generalized (path length) version (Hovarth & Yip, 2007)

The generalized version of TOM is motivated by the observation that original TOM formula can be expressed v.s. (the original one) This version can consider the range of neighborhood. By denoting $N_m(i)$ the set of nodes (excluding $i$ itself) that are reachable from $i$ within a path of length $m$, i.e.,

$$ N_m(i)=\{j\neq i | dist(i,j)\leq m\} $$

where $dist(i,j)$ is the geodesic distance between $i$ and $j$, we obtain a very natural generalization of the TOM, which reads as follows

$$ \omega_{ij}^{[m]}=\left\{\begin{matrix} {{|N_m(i)\cap N_m(j)|+a_{ij}}\over{min{(|N_m(i)|, |N_m(j)|)+1-a_{ij}}}}&& if & i\neq j\\ 1 && if & i=j \end{matrix}\right. $$

We call the matrix $\Omega^{[m]}=[\omega_{ij}^{[m]}]$ the $m^{th}$ order generalized topological overlap matrix.

$$ \omega_{ij}^{[m]}={{|N_m(i)\cap N_m(j)|+a_{ij}+I_{i=j}}\over{min{(|N_m(i)|, |N_m(j)|)+1-a_{ij}}}} $$

GTOM0 is an adjcency matrix, GTOM1 is TOM and GTOM2 is generalized TOM

Signed version (Langfelder, 2013)

In the case of triplet, the weights of neighbor from unsigned network would not reinforce the connectivity. Problematic cases like (+, +, −) or (−, −, −) could be occured by noise. The connectivity of neighbor is called “anti-reinforced” the connectivity. TOM should distinguish the anti-reinforcing connections from reinforcing ones. To be able to do this, we should resurrect the signs of connections. This can be achieved by defining a modified equation below,

$$ TOM_{ij}^{signed}={{|\sum_u{\tilde a_{iu}\tilde a_{uj}+a_{ij}|}}\over{min(k_i,k_j)+1-|a_{ij}|}} $$

where $\tilde a$ is signed value of $a$.

In this context, the author recommended signed network, since these situations would not be occured in the Signed network.

References
A General Framework for Weighted Gene Co-Expression Network Analysis (Zhang & Horvath, 2005)
In the motivated example, they provided in the paper, show that GTOM2 represents more biological meaningful results than GTOM1. GTOM2 generates pronounced modules and larger module size, and GTOM1 generates smaller modules.

Signed vs. Unsigned TOM
TOM R library documentation

⚠️ I couldn’t find the Weighted + Generalized + (Un)Signed TOM yet…

II. Perfrom hierarchical clustering of genes

Hierarchical clustering using the average linkage with TOM matrix

We can check the MDS plot using the TOM or correlation matrix as a part of EDA. Just perform spectral decomposition on the matrics would produces eigen vectors and values to reduce the dimensionality for visualizing (2D or 3D)

III. Divide clusterd genes into modules

Cutting the tree branches procudes clusters (modules)

💡 About the module size…

There is no way to specify the number of each module. However, we can make lower bound of the module size setting the parameter “minModuleSize” from blockwiseModules function.

On the other hand, we can adjusting the size of module with the cutoff level (parameter “mergeCutHeight”) in the dendrogram. For instance, If the level is high then the cluster becomes larger. As the author commented, we can set the float value of “mergeCutHeight” with the sample level. For the sample size 50-100, we can set the float value 0.25-0.3

References
How to select mergeCutHeight in WGCNA?
Insights from a billion correlations

IV. Merge very similar modules

Use module “eigengenes”. Eigengene is the first principal component of expression data in the module which could explain the greatest variance of the module. Why didn’t they utilize other distance matrix, say TOM? Like classical MDS plotting, we can get eigen value and vector by spectral decompositioning on distance matrix. I think, there could be pitfall I never know… This one vector which has the same dimentionality as each gene becomes eigengene. The eigengene represents the module.

Since each eigengene represents each module, performing hierarchical clustering on the eigengenes would preduce the relationships between modules.

3. Correlation of modules with clinical traits

With the eigengene, we can measure the correlations between modules and the traits.

💡 If there are binary traits, we can consider the Pearson corelation (Point-Biserial Correlation), but the
Biweight midcorrelation should be modified. (How?)

4. Identification of potential driver genes

Driver (key) genes

  • Influence the expression or function of other genes
  • Related to factors for a trait of interest

We can identify the driver genes which are hub genes and most strongly correlated with a trait of interest.

  • Hub gene
    • genes with highest intramodular connectivity
      • Sum of in-module edge weights
    • genes with highest module membership
      • Correlation of a gene to module eigengene

Potential driver genes

5. Evaluate module quality

Assess overall quality of modules for the correct analysis, or for the further analysis

Module sizes

Too large

  • This makes biological sense, but hard to handle

Too small

  • Level of merging is not sufficient

We can compare our network to the random clustered network using statistical test.

Connectivity

  • Mean intra-module connectivity
  • Mean ratio of intra-module / total connectivity

Trait correlations

  • Strong correlation between module eigengenes and traits of interest.
  • Strong correlation between gene module membership gene-trait correlation

Functional enrichment

  • Many functionally related genes in the same module

$→$ We can apply bootstrap or permutation test to obtain p-values.

Caveats

  1. Should receive normalized data
  2. Basic hierarchical clustering is applied, but this is not always the best approach.
  3. Unlike the literature, only undirected graph structures are used.
  4. The key genes are not necessarily causations

References

I really appreciated to the reference authors!!

WGCNA: an R package for weighted correlation network analysis - BMC Bioinformatics

Tutorials for WGCNA R package   “Blockwise” network analysis of large data

PPT1   PPT2

Outro

본문의 내용은 상당히 직관적으로, 계산과 기초통계를 중심으로 gene과 그 network에 대해 컴퓨터가 인식할 수 있도록 직관적으로 잘 표현했다. 이런 semantic gap이 적은 방법론이 아무래도 해석하기 용이하다보니 과학자들, 특히 생물정보분야에서 선호하는 것 같다. 그래서 인지 피인용수가 현재 15K가 넘어간다. 제법 직관적인 방법론이다보니, 이해하기에 재미도 있었고, 크게 어렵지는 않았다. 다만 마지막 계층적 군집화의 대안으로, 저자가 구현한 Dynamic tree cut 알고리즘을 원활히 이해하기 어려웠다. 감사하게도 저자중 한분인 Peter Langfelder 님이 내 질문이 담긴 메일에 답변을 잘 해주셔서 어느정도 이해가 되었다. 하지만, 아직까지 뭔가 찝찝하다. 해당 알고리즘에대한 확신이 내게는 없어서인듯 하다.

그리고 다시봐도 필요에 따라 새로운 방법론을 만들고, 논문으로 출판하고... 엄청난 열정인것 같다.

 

혹시 잘못 정리한 내용이나, 빠진 부분있다면 알려주세요! 감사합니다~

'정리 조금 > Bioinformatics & Biostatistics' 카테고리의 다른 글

Survival Analysis  (1) 2023.12.01
Gene Set Enrichment Analysis  (0) 2023.08.29
The Characteristic Direction  (0) 2023.08.28
ZIBseq  (0) 2023.01.12
edgeR & DESeq2  (0) 2023.01.11