This installation instruction is written for the Linux/Unix environment.
Installing SEEK on a local computer will enable you to use the system on a user-specified collection of datasets, or on a batch of queries. Please note the commit version (2329130) when you obtain SEEK from Bitbucket (see below).
Installation
- Prerequisites for installing Sleipnir
- GCC 4.5+
- Log4cpp (ubuntu packages: liblog4cpp5, liblog4cpp5-dev) (source)
- GNU GSL library (ubuntu packages: gsl-bin, libgsl0-dbg, libgsl0-dev, libgsl0ldbl) (source)
- GNU Gengetopt (ubuntu package: gengetopt) (source)
- Compile and install SEEK from the source code
-
SEEK's code is deposited in Git Hub under sleipnir. Please check Sleipnir documentation at https://functionlab.github.io/sleipnir-docs/:
git clone https://github.com/FunctionLab/sleipnir.git
The above will obtain the latest development branch of SEEK. We recommend that you use the stable branch, which is in an earlier commit (2329130) and you can check-out using the following (if you obtained SEEK after 2/16/2015, you likely need to do this step):
hg update -r 2329130 -C
Verify that you have the correct commit version with hg summary .
- Auto-create make files.
./gen_auto
- Configure step.
If you are using Ubuntu, there is no need to specify GSL, log4cpp, etc, because these packages are available in standard locations system-wide:
./configure --prefix=/home/qzhu/sleipnir_build
For server without root access and where the log4cpp, gengetopt and gsl are installed in custom locations:
./configure --prefix=/r03/qzhu/sleipnir_build
--with-log4cpp=/r03/qzhu/usr
--with-gengetopt=/r03/qzhu/usr
--with-gsl=/r03/qzhu/usr
Change the directories to fit your own paths. Once you have entered the configure command, you will see a list of what packages are found or not found, and which Sleipnir tools will not be build. You should see "found installed" for log4cpp, pthread, gsl, OpenMP, gengetopt. The rest of sleipnir tools such as SMILE, SVMPerf can be ignored since we are not building the entire Sleipnir toolset.
- We will next compile and install SeekMiner, SeekEvaluator, SeekPrep, Distancer, Data2DB, PCL2Bin packages which are required for SEEK. In short:
SeekMiner is the main search program.
SeekEvaluator is the program that displays the search results.
SeekPrep is the initial preparation program (computes gene-based z-average for hubbiness correction).
Distancer computes the correlation matrix.
Data2DB splits correlation matrix into gene-based correlation vectors for ease of searching.
PCL2Bin converts PCL to a binary format.
To compile and install, enter:
cd src && make
cd ../tools/SeekMiner && make && make install
cd ../SeekEvaluator && make && make install
cd ../SeekPrep && make && make install
cd ../Distancer && make && make install
cd ../Data2DB && make && make install
cd ../PCL2Bin && make && make install
-
Verify that Sleipnir is installed.
Try testing by:
/home/qzhu/sleipnir_build/bin/SeekMiner -h
On Ubuntu, you should not get any error.
On a custom install, usually on machine without root access, you might get some error about certain pre-requisite library not found, because the pre-requisites are installed in user-specified directories. In this case, please set up the environment correctly:
GSL_PATH=<your path to GSL lib directory>
LOG4CPP_PATH=<your path to LOG4CPP lib directory>
GENGETOPT_PATH=<your path to Gengetopt lib directory>
export LD_LIBRARY_PATH=$GSL_PATH:$LOG4CPP_PATH:
$GENGETOPT_PATH:$LD_LIBRARY_PATH
Then try testing SeekMiner -h again. It is a good idea to put these lines of code in an initialization bash script to be executed anytime before running SEEK.
-
Congratulations. You have set up Sleipnir. Move on to the SEEK setup tutorial.
Setting up SEEK
- Define a set of datasets as your compendium
Required format of each dataset:
Each dataset file must be a tab-delimited matrix: the first row should be condition names, the first column should be gene names, and the matrix entries should be expression values. For simplicity, genes must be in Entrez ID's (we select Entrez because it is the gene naming system widely mapped across different human platforms). Datasets may originate from any of the gene expression hybridization platforms or sequencing platforms, though in the latter case users should collapse read abundances into gene-based values. It is expected that the conditions within each dataset should be comparable in their expression - this can be achieved by standard normalization procedure such as quantile normalization, variance-stabilizing normalization. Generally speaking, expressions need not be comparable across datasets as SEEK will examine the correlation structures within each dataset and appropriately normalize at the level of correlations.
Though each dataset must have >3 conditions to make possible Pearson correlation computation, having more conditions per dataset leads to better search performance. More conditions can be introduced by replicates or having a diverse set of conditions in each dataset. For datasets with fold-change differences, these are also acceptable inputs so long that each condition's expression fold-change values follow an approximately normal distribution. Expression values must be log-normalized, or log(1+counts) normalized in the case of sequencing data.
Create a dataset list:
Once datasets have been prepared, next prepare a dataset description file, which defines datasets.
This is a 3-column tab-delimited file with columns as: file_name , dataset_name , platform_name . For your reference, below is a sample dataset description. The platform column is a GEO specific code indicating the experimental technology. In this case, this is GPL570, the popular Affymetrix HGU133-Plus 2 microarray platform. It can also be any value that the user defines.
GSE13494.GPL570.pcl GSE13494.GPL570 GPL570
GSE17907.GPL570.pcl GSE17907.GPL570 GPL570
GSE45584.GPL6480.pcl GSE45584.GPL6480 GPL6480
Examples
A collection of datasets: breast_cancer_dset.tar.gz
The corresponding dataset.description.txt file: dataset.description.txt
- Create the compendium and all necessary files
Creating the compendium entails several tedious but necessary steps, including: calculating the correlation matrix, normalizing correlations, and data organizations for efficiency purposes. In order to make the process of creating the compendium easier, we provide below a set of scripts which automate these tasks.
Download: scripts.tar.gz
Please ensure that you have enough space. A 10-dataset compendium needs 12GB for building it.
Extract: tar -zxf scripts.tar.gz
Next, run the prepare_seek.py in the scripts:
./prepare_seek.py <dataset directory>
<dataset list>
<setting directory>
<path to sleipnir binary>
<output directory>
where dataset directory contains the dataset matrix files; dataset list is the dataset listing; setting directory contains the setting files (gene_map.txt and quant2); path to sleipnir is where sleipnir is installed to; output directory is where compendium will be created
Example workflow:
mkdir test.seek
cd test.seek
tar -zxf ~/Downloads/breast_cancer_dset.tar.gz
cp ~/Downloads/dataset.description.txt .
~/scripts/prepare_seek.py pcl dataset.description.txt
~/scripts ~/sleipnir_build/bin .
What this script will do:
- Calculate correlation matrix
- Calculate gene z-score average for gene hubbiness correction
- Join correlation matrices followed by splitting correlation vectors by genes to faciliate searching
- Calculate platform-wide gene z-score average for normalization
A number of decisions have been made by the script, and which should work best for most encountered situations. These decisions are next described.
For the correlation measure, SEEK chooses the Pearson correlation coefficient followed by Fisher's transform and then a z-score normalization of the resulting normal distribution. It is found that Pearson works well in most datasets that have been properly normalized (RMA or MAS5). It has a very comparable performance to Spearman correlation. SEEK bounds all correlation z-score values within the range [-5, 5] in order to reduce the effect of extreme z-score values. During the gene-hubbiness computation stage, SEEK calculates an average connectivity score or average z-score for each gene to correct for hubby genes (these can have the undesirable effect of overwhelming the coexpressed gene ranking).
The entire process of completing the set of tasks in the script will take ~3 hours/10-dataset, with 5GB of RAM
needed.
Please note that the bottleneck is in the Data2DB step (joining and splitting correlation matrices). Memory usage is quite heavy. But it can be adjusted.
For a large compendium with 100-300 datasets, the current setting in the script will process 50
datasets at a time, consuming 25GB of physical memory in the process. If you do not have
enough memory, adjust 50 to some lower number (eg. 20) (see -B parameter in Data2DB line).
Afterward, you will have set up SEEK compendium. You will find the following new directories created: sinfo , plat , prep , dab , db , pclbin .
Running SEEK
- Define your queries
The main search functions are performed by the SeekMiner tool. SeekMiner can perform many many queries sequentially (even tens of thousands of queries).
The first step is to define a query file containing all your queries of interest. The query file needs to have one query per line, can have as many lines as you want, space-delimited between the genes in each line, and Entrez genes only.
Example (three queries focusing on: GLI1 (2735), GLI2 (2736), PTCH1 (5727)):
2735 2736 5727
2736 5727
2735 5727
Note that single-gene queries and multi-gene queries cannot be mixed together. They need to be separate search instances.
Examples:
single-gene queries: queries.1, multi-gene queries: queries.2
- Run SeekMiner
The process of performing the query is automated by the script run_seek.py . This is the wrapper for SeekMiner. We recommend that this script be used instead of SeekMiner, since most users need not be concerned about most of the parameters in SeekMiner. Nonetheless, it is important to be aware of the default options.
The dataset weighting algorithm (-V) that is employed is query cross-validated weighting (CV) for multiple-gene queries. Essentially this employs a cross-validation scheme, in this case, use 1 query gene, and see how the remaining query genes are retrieved in each dataset. When the procedure is repeated for all 1-gene subsets of the query, we consider all possible scenarios in which query genes may be coexpressed or partially coexpressed between each other. The cumulative rank-biased precision measures the overall accuracy (and becomes the weight of the dataset).
We found that this weighting could accurately prioritize datasets and capture partial coexpression.
The equal weighting algorithm (EQUAL) is the default policy for the single-gene queries. To prioritize datasets for this type of query, a master coexpressed gene ranking is first computed from adding all datasets. Prioritization is next performed by comparing the similarity of each dataset's coexpresed gene ranking to the master ranking.
Another important option is (-C), the fraction of query genes (0 - 1.0) required to be present in a dataset. When this option is set and a dataset does not meet the fraction, the dataset is skipped (or assigned with 0 weight). Some users may prefer to set this to 1.0, because they may wish to strictly look at datasets with perfect query gene coverage. In SEEK, the most relaxed policy of -C 0 is adopted in the default case (this means no restriction, or a minimum of two query genes present is sufficient to weight the dataset).
Additionally, hubbiness correction can be turned on or off using the (-m) flag. Platform-specific correction is controlled by the (-M) flag. Both are enabled by SEEK in the default case.
SeekMiner naturally runs in the multi-threading mode (-T) utilizing 8 threads in the system to cut running time.
SeekMiner is much more powerful than we can describe in this limited space here. See the SeekMiner manual (./SeekMiner -h ).
To start the wrapper script,
~/scripts/run_seek.py <path to sleipnir>
<compendium directory>
<query file>
<output directory>
Example:
cd test.seek
~/scripts/run_seek.py ~/sleipnir_build/bin .
queries.1 results
The output directory will be called results.single if the queries are single-gene, or results.multi if multiple genes.
(Optional): subsetting compendium for query search
SeekMiner by default will utilize all datasets in the compendium, weight each dataset, and integrate them. But you can manually override this setting to specifically subset the compendium for search of your queries.
To enable compendium subsetting,
add another argument which points to the file containing user-selected datasets in the compendium:
~/scripts/run_seek.py ~/sleipnir_build/bin .
queries.1 results selected_datasets.txt
Note that if this option is selected, then the dataset selection must be provided for each query in your query list.
See selected_datasets.txt for an example of the needed input file. (This file needs to be N lines for N queries, and each i-th line should specify user-selected datasets for that i-th query. Each line is space-delimited of dataset names)
- Display search results
~/scripts/show_seek_results.py <path to sleipnir>
<setting directory> <query file>
<datasets/genes>
For the larst argument, choose "datasets" to display dataset ranking for a particular query, or "genes" to display integrated coexpressed genes.
Example:
cd test.seek
~/scripts/show_seek_results.py ~/sleipnir_build/bin .
results.single/0.query datasets
~/scripts/show_seek_results.py ~/sleipnir_build/bin .
results.single/0.query genes
|