- What is SEEK?
SEEK stands for Search-based Exploration of Expression Compendium
. It is a gene-based human
co-expression search system. Given a query gene-set, the system prioritizes thousands of expression datasets (deposited in the public repository GEO
) in order to find those that may be relevant to the query. Additionally, SEEK integrates datasets to identify other genes that are co-expressed with the query genes.
- What is SEEK used for?
SEEK has a number of usage scenarios. A different way to look at this question is: what are the scenarios in which finding co-expressions could be useful? Here they are:
- When users define a query of a single-gene, SEEK can retrieve co-expressed genes to reveal insights about the function of the query gene.
Biologists might have a small set of candidate genes from genetic screens, or other genomic studies. When users input them as a query gene-set, SEEK can retrieve other genes as a part of the common biological theme underlying the query gene-set (a biological process, pathway, molecular function, common miRNA or TF regulator, etc).
- The co-expressed genes may also identify possible gene-interactions involving the query.
Because SEEK prioritizes datasets, SEEK also helps to establish associations between the query gene-set and tissues, diseases, and cell-types (which are described in the dataset metadata). Users can ask questions such as:
- What are the datasets in the compendium where my query genes are co-expressed?
- Are these datasets with query co-expression seem to be associated with a particular disease or tissue type?
- What are SEEK’s advantages?
The advantages include:
robust and cross-platform co-expressed gene integration, which means that co-expressed genes from multiple platforms can be added together to give a robust gene ranking
a large collection of expression datasets being used for integration (5500 datasets with 155,000 arrays, and include RNASeq datasets)
global or area-specific co-expression search
- attractive visualization of expression patterns with flexible attribute-based condition display and clustering
- What is the dataset weighting algorithm used by SEEK?
The weight of each dataset is calculated at the search time and uses the query genes.
The rationale is to up-weight datasets where the query genes are co-expressed 
. So, the more co-expressed they are in a dataset, the more relevant is the dataset, and the higher the weight will be.
A cross-validation based algorithm is being used to give robust dataset weights. This divides the query into several parts, chooses one part as a sub-query, then evaluates how well the dataset retrieves the remaining query parts.
Frequently, the query genes are only partially co-expressed even in the most informative datasets. As a result, the correlations between the non-coexpressed parts of the query can hurt the weight of dataset that is actually calculated from the co-expressed, informative part of the query. To solve this challenge, SEEK utilizes a rank-based procedure, inspired by rank-biased precision  from information retrieval, to give emphasis on the high correlations between genes in the query.
Since the weighting of dataset is based on the similarity of the query genes, those datasets where query genes have incoherent expression will be automatically ignored in integration (these could be low quality datasets or datasets with spurious correlations related to the query, or irrelevant datasets). Thus this algorithm achieves automatic data quality control.
- How does SEEK compute significance for dataset weight?
The significance P-value is computed from a background distribution of random coexpression edges made from a random set of genes with the same size as a real query. Such a background distribution is specific to each dataset and to each query size. A random trials made up of 1000 random queries were used and a generalized pareto distribution 
was fitted to extract parameters of the background distribution for easy computation of the P-value.
- How is the score of each gene computed?
Computing the final gene score uses the dataset weights (previously discussed
) in order to reflect the co-expressions that are located in the top relevant datasets. For each gene g
, the final score is:
is the set of datasets that contain g
. In the equation, the score of g
in each dataset sd(g)
, is given by
is the correlation and Q
is the query. To reduce the bias caused by those genes with insufficient dataset coverage, we discard genes that are covered by less than 50% of the compendium. These genes automatically have the lowest score.
- How do I know if the co-expressed genes retrieved by SEEK are significant?
In order to assess the significance of the retrieved genes, we adopt a null model
where we assume that the query is random
(i.e., a random set of genes). We generated 10,000 random queries
of size ranging from 1 to 100 genes. We searched all random queries in SEEK and produced a set of gene-rankings. Given a true query, to estimate the significance of gene x
in the true query’s ranking, we estimate the fraction of random queries where the rank of x
is higher than the rank of x
in the true query. We note that the null model is generally very similar between different query sizes beyond the query size of 10 genes. So we can use a size-free estimation for these query sizes.
- How do I know if my query is co-expressed or not?
Since the dataset weight is calculated by query co-expression, the dataset weight can directly answer this question. In general, the query would be considered co-expressed if there is a subset of datasets in the compendium with sufficiently high dataset weight.
The significance of the dataset weight can indicate how query coexpression is compared to random. The number of datasets with significant dataset weight (given some P-value threshold) can indicate whether this query co-expression is widely occurring in the compendium or restricted to a subset of datasets.
- What is a dataset keyword?
is a curated term (in a controlled vocabulary) that describes a dataset. In SEEK, keywords come from the UMLS controlled vocabulary
, which specifies a comprehensive set of tissue, disease types. To determine what keywords are annotated to each dataset, SEEK uses a semi-automatic strategy that involves text-mining followed by manual curation. The text-mining mines for controlled vocabulary terms within dataset description and sample description texts associated with the dataset. In manual curation, we review and correct the mappings for those commonly mismapped keywords.
- How do I narrow down the scope of datasets used in the search of a query?
SEEK by default utilizes ALL of the thousands of datasets in the compendium for the query search. Users can however restrict to datasets with particular characteristics, such as disease-type, tissue-type, etc. To do so,
- Search the query genes globally (simply enter the query and click "Search").
On the result page, look for the link "Refine Search" and proceed. (Can't find it?)
Once datasets are selected, click "Refine".
SEEK will now weight only those datasets within the selection and use these for returning query’s co-expressed genes.
- How do I get the complete list of genes or datasets prioritized to the given query?
On the result page of the query, scroll down to the bottom and click on the link "See the complete gene-list ranked by co-expression score
" or "See the complete dataset-list ranked by query-relevance
". (Can't find it?
- How can I check the rank for a gene or dataset of interest?
Get the complete list of co-expressed genes or datasets (see previous question
). Then, in the new window that is opened with text results, use the browser search function to look where the gene/dataset of interest is ranked.
- How can I visualize the expression for a particular gene of interest?
First, check where the gene of interest is ranked (see previous question
). Then go back to the main Expression View
, use the gene navigation box
to move to the page containing the gene of interest. For example, if BRCA1 is your interested gene and BRCA1 is ranked at position 150 (after checking in the full gene-ranking), then go to page "Genes (101-200)
" in the navigation box.
- How can I see the genes that are anti-correlated with the query?
In the Expression/Co-expression View
, find the gene navigation box
, and then go to the rank positions that are at the bottom (e.g., Genes: 17901 – 17920).
- How can I remember the dataset selection for subsequent queries?
At the top of the window beside the query box, click on "Options
" (Can't find it?
), and then select "True
" for "Remember dataset selection for subsequent queries
- Where do I see the retrieved genes' significance?
The main Expression/Co-expression View only shows the co-expression score. Click on "See the complete gene-list ranked by co-expression score" located at the bottom. The new page will show the significance of each gene.
- What are the datasets used in searching my query?
SEEK by default considers all of the thousands of datasets in the compendium for the query search. Datasets are weighted differently, according to what the query genes are. To see those datasets used for your query, look at the dataset weight.
- How large a query can SEEK handle?
SEEK can accept both single-gene and multi-gene queries. While queries involving several hundreds of genes are technically feasible, we do not recommend using such large queries, because they are likely to have heterogeneous expression patterns, which can contribute to a poor result. Such queries also consume incredible amount of resource. We therefore recommend queries with less than around 200 genes. Please contact us if a larger query is needed.
- How much time does searching a query take?
The time depends on the size of the query and the volume of traffic. If the server is not busy, the search speed is approximately 3 seconds per query gene and the time scales up linearly for larger queries. For example, searching a 3-gene query takes about 9 seconds.
- What if my query contains one or more unrecognized gene names?
SEEK ignores these unrecognized genes, and reports them as "Gene X is not found" in the progress box.