Home » Supplementary Material
  • SEEK hubbiness correction
    SEEK uses a hubbiness correction algorithm to prevent retrieving generally hubby genes (i.e., well connected genes, see Barabasi et al, Han et al, Xulvi-Brunet et al) that might have high coexpression to the query regardless of the query composition. For each gene in the retrieved list (such a gene is known as the target), it subtracts the average coexpression score of the target gene calculated from the coexpression of the target to all genes in the genome.

    The effect of this correction is that a highly connected target gene will be brought down in the ranking due to subtracting its higher average coexpression score, so that the degree of the genes will be balanced out in the coexpression network, and the search result will reflect genes that are more specifically correlated with the query.

    Evaluation and example

    We tested this on a group of 344 GO Biological Process slim terms, retrieving co-annotated genes from each slim term. This hubbiness correction brought improvement to 219 GO terms, with the average performance improvement being 124%.

    In the other 125 GO terms where performance did not significantly improve or perform worse, the correction procedure was able to retain >83% of the original performance. The performance is measured in terms of the precision at 10% recall. In another evaluation, we sought to evaluate whether SEEK successfully downweight frequently retrieved genes.

    Specifically, we checked the rank difference that the correction makes on specific genes. We searched 1000 randomly selected queries. The Table below shows the frequency that the hubby genes appear in the top 100 rank positions before and after the correction procedure.

  • SEEK vs SPELL comparison
    SPELL (Hibbs et al) is a previously developed algorithm designed to search for coexpressed genes in the yeast expression compendium. While this algorithm was helpful for yeast, it was insufficient for searching the large human compendium, which is 20-times greater than yeast (~5,000 datasets compared to 300), and the number of genes in human is also 4-times greater (25,000 compared to 6,000).

    We found SEEK to be better than SPELL in terms of tackling the dramatic increase in the size of the human data, where the human genes also exhibit substantially more heterogeneous expression patterns. In SEEK, we have made many data-structure changes, optimizations to the system and implementations, using the Sleipnir library. The search algorithm is also fundamentally different from SPELL. The Figure below shows that SEEK beats SPELL in 248 out of 344 evaluated GO biologial processes (when we searched a subset of each process' genes to retrieve the rest).

    The average performance improvement is 154% in precision at 10% recall. Much of the improvement comes from the cross-validated dataset weighting algorithm that is flexible to detect partial coexpression between the query genes using a robust rank-based framework. In the Figure, n1 is the number of GO terms where SPELL outperforms SEEK; n2 is the count of the reverse.