SEEK uses a hubbiness correction algorithm to prevent retrieving generally hubby genes
(i.e., well connected genes, see Barabasi et al
, Han et al
, Xulvi-Brunet et al
) that might have high coexpression to the query regardless of the query composition.
For each gene in the retrieved list (such a gene is known as the target
), it subtracts the average coexpression score of the target gene calculated from the coexpression of the target to all genes in the genome.
The effect of this correction is that a highly connected target gene will be brought down in the ranking due to subtracting its higher average coexpression score, so that the degree of the genes will be balanced out in the coexpression network, and the search result will reflect genes that are more specifically correlated with the query.
Evaluation and example
We tested this on a group of 344 GO Biological Process slim terms, retrieving co-annotated genes from each slim term. This hubbiness correction brought improvement to 219 GO terms, with the average performance improvement being 124%.
In the other 125 GO terms where performance did not significantly improve or perform worse, the correction procedure was able to retain >83% of the original performance.
The performance is measured in terms of the precision at 10% recall.
In another evaluation, we sought to evaluate whether SEEK successfully downweight frequently retrieved genes.
Specifically, we checked the rank difference that the correction makes on specific genes. We searched 1000 randomly selected queries. The Table below shows the frequency that the hubby genes appear in the top 100 rank positions before and after the correction procedure.