HOW TO USE SiGN-SSM

SiGN-SSM estimates SSM model parameters from time series (time course, or temporal) gene expression data. The estimated parameters can be used for simulating, predicting, and analyzing temporal patterns of gene expressions. SiGN-SSM can also handle multiple replicate data, missing time points and/or missing single observed values. It is recommended that the input time series data are measured for equally interval time points, but this is a not necessary condition. SiGN-SSM can handle time series data with irregular time intervals.

This page is an online tutorial explaining how to use SiGN-SSM. For installation and/or compilation, please go to DOWNLOAD page. For the full description of the options (arguments) are described in MANUAL page.

Input Time Series Data File Format

SiGN-SSM accepts a tab-separated plain text file as an input time series dataset. The first two header rows in a file define time poins and replicate IDs of columns. For specifying time points, the left-most column needs to be"@time", and for replicate IDs, the left-most column, "@replicate". The order of these two headers are exchangeable. Time points and replicate IDs need to be integer values starting from one (1). Therefore, if the dataset has different intervals, then you need to normalize such intervals so that each time value becomes an integer multiple of the minimum interval. Replicate ID header can be omitted if you have a single replicate data set. See below for example.

From the third line, each row represents the time series expression profiles for a gene (or a probe). The leftmost column represents the gene names of rows.

Observed time points are not necessary to be the same in a single replicate. In other words, each replicate can have the different number of observed time points from others.

Here is an example of the duplicate data. Assume that you have a time series dataset observed in duplicate for time points 0hr, 30min, 1hr, 2hr, but 1hr is missing for the 2nd replicate.

@replicate1111 222
@time1235 125
Gene_11.0560.1241.554 1.7321.0620.1251.734
Gene_22.1342.1442.1562.211 2.0342.0442.111
......

By default, SiGN-SSM performs mean shifting for each gene in the input time series data before running the EM algorithm. This is important and effective for the parameter estimation to be done correctly. Mean shifting can be canceled by "--shift 0" option.

SiGN-SSM can accept missing values and handles them appropriately during the estimation. To represent a missing value, use "NAN" in the input file.

^ Go to Top

How to Run

SiGN-SSM is command-line user interface (CUI) application. You need to give options (arguments) to the program in order to specify the input file and to change the various settings of the SSM estimation by typing on the terminal (command prompt) application. You need to be familiar with doing this. No GUI is provided at this moment.

By default, SiGN-SSM performs the EM algorithm 100 times with different initial values and selects the best parameter set for a 4-dimensional (modules) system model. You can change the number of times of performing the EM algorithm for a single estimation by -n option. Longer execution may possibly result in better estimation result. For example, repeating 1,000 times (-n 1000) is preferred if you have enough computational resources.

By specifying the region of dimensions (-d option), SiGN-SSM automatically performs the estimation for the particular specified dimensions. SiGN-SSM does not automatically select the best model and the best size of dimensions from the specified region. Selecting the best model is up to users. See ABOUT SSM for details. As described there, BIC is a way of selecting the best dimension and model. See later how to see BICs of the estimated parameters.

Also, as described in ABOUT SSM, SiGN-SSM can estimate multiple models (set of parameters) for the same size of dimensions. By default, SiGN-SSM estimates only one model for one size of dimensions. By specifying the number of sets (-s option), SiGN-SSM automatically repeats the network estimation for a single size of dimensions.

See MANUAL for the full description of the options.

About parallel execution support

SiGN-SSM supports various parallel execution environments. For the PC cluster system, the above sets can be computed in parallel by bulk (array) jobs via job dispatch system. The special script for Sun (Oracle) grid engine is prepared and SiGN-SSM provides an easy solution to do this. (Combination of -i option and --sge option.)

For massively parallel computers, SiGN-SSM supports parallel execution using MPI (Message Passing Interface). SiGN-SSM is confirmed to run scalably with up to 256 CPU cores at this moment (Fig. 3). With MPI, SiGN-SSM can perform the every single EM algorithm execution in parallel. On the other hand, on PC clusters, only sets can be computed in parallel as described above.

For running on your PC, if your PC has multiple cores, them SiGN-SSM can run on these cores as multi-threaded application. Use --thread option to tell the number of cores to use them simultaneously. Like MPI support, SiGN-SSM can perform the every single EM algorithm execution in parallel.


Fig. 3: Scalability (parallelization efficiency) to the number of processes (cores) by MPI parallelization. In the MPI parallelization, because one process is used only for load balancing in the program, the parallelization efficiency is slightly low with less processes. We confirmed that the parallelization efficiency reached 0.980 with 256 cores. This is observed on RIKEN RICC with "--state off -d 4-8 -n 2000 -s 10" option.

On the HGC supercomputer system

On the HGC supercomputer system, you do not need to install the pre-compiled binary or compile the source code on your directory. It is already installed on the system.

To run SiGN-SSM, submit the Grid Engine (GE) array job using the special shell script signssm_sge.sh in /usr/local/bin/ or ~tamada/sign/.

$ qsub -t 1-N [other GE options] /usr/local/bin/signssm_sge.sh --sge [options] input_file

or

$ qsub -t 1-N [other GE options] ~tamada/sign/signssm_sge.sh --dir ~tamada/sign --sge [options] input_file

where N needs to be (# of dimensions) x (# of sets). You can copy and modify this shell script for your job submission if you require.

On your PC or other supercomputers

For the single process and multi-thread execution, simple use signssm command with options.

$ signssm [options] input_file

By --thread option you can specify the number of cores (threads) to use.

$ signssm --threads N [other options] input_file

For MPI execution, follow the instruction of your MPI implementation to run MPI application. Typically, mpirun command is used to run MPI application and -np option specifies the number of processes. On a general supercomputer system, you may need to prepare a shell script to run the program and then need to submit it the job dispatch system.

$ mpirun -np N $INSTALLPATH/signssm [options] input_file

where $INSTALLPATH represents the place of the binary.

To execute as an array (bulk) job on your supercomputer system, use -i to tell the index (ID) of the array (bulk) job task and --sge option to run the appropriate job (task) based on the task index (ID). With --sge option, a single execution of SiGN-SSM generates only one model and quits. Dimensions and set ID are determined by the value of -i option.

^ Go to Top

Output Files

By default, the single estimated result of model parameters (set of matrices and vectors) are stored in a single file. The file name would be "prefix.D000.S000.A.dat" where prefix is specified by -o option, D000 is the size of dimensions, and S000 represents the set ID (index) of the estimated models. By default "result" is used as the prefix. The prefix can contain the directory path. Therefore, users can specify the output directory as well as the output file name prefix. If you estimate more than one set and/or size of dimensions, each result of trail/dimension is stored in a single file with the appropriate set ID in its file name.

The output summary file is stored in "prefix.D000.S000.B.dat" containing the size of dimensions, the set ID, the log-likelihood, the BIC, the number of loops until converged, whether or not the likelihood increased monotone during the EM algorithm, and whether or not the algorithm converged within the loop limitation. These values are separated by a tab and output in a single line. By using the cat command for multiple results, it is easy to compare them. Especially, the BIC column is useful when you choose the best model among results with different sizes of dimensions.

In addition to the above two files, 7 different matrices per replicate are output in "prefix.D000.S000.K.dat". The first six matrices are N time point collections of one-step-ahead prediction, filtering, smoothing vectors of the estimated state variables and observation variables, where N represents the total number of time points. Each row corresponds to a vector for a time point. The last matrix is the mean shifted (by default) input data. If the input data set is multiple replicated data, these matrices are calculated for all the replicate. Thus, if the dataset is measured in triplicate for 7 time points, 21 (= 7 x 3) matrices are output in the file. See ABOUT SSM for details about the first 6 matrices. This file is mainly intended for plotting the result by Gnuplot. The shell script signssm_plot.sh can be used for easily processing this output file. This shell script generates a PDF file consisting of the plot of one-ahead prediction of gene expressions for all the genes. The usage of signssm_plot.sh is

$ signssm_plot.sh #reps state_file input_file output.pdf

where #reps represents the number of replicates, state_file the output file described above (*.K.dat), and input_file the input time series data file used for the estimation. By looking at the generated PDF file, you can check the estimation result. If you put a letter "f" ("s") as the last argument for the script, it generates filtering (smoothing) plot instead of one-step-ahead prediction plot.

If you want to use the estimated parameters, and calculated state and observation variables in different software, SiGN-SSM can output each matrix/vector in a single file. This can be specified by "--each on" option. These single files can be easily read by, for example, R (free statistical computing software). The output files are "prefix.D000.S000.*.dat" where "*" corresponds to the following matrices/vectors:

In addition, "prefix.Y.r.dat" is output and contains the mean shifted (by default) input data of the r -th replicate.

If the --pvalues option is set to on (default), two additional files are generated: "prefix.D000.S000.P.dat" and "prefix.D000.S000.m.dat" files. The former file containts a (p, n)-matrix whose (i, j) element corresponds the p value of the one-step ahead prediction for the i-th gene at the j-th observed time point, where p is the number of genes and n the total number of observed time points (= data columns of the input file. Not the total time points in a replicate). Thus, the p values in the matrix corresponds to the input data value at the same position of the matrix in the input data file exclucing headers and gene names. The latter file is a single column, (p)-vector whose i-th element (row) corresponds to the integrated p value of the above p values for all the observed time points for the i-th gene, calculated by the statistical meta analysis. These p values are useful especially when the esitmated model parameters are applied to the different data from one used for estimating parameters. See the next section for details.

^ Go to Top

Applying to Different Data

The estimated model parameters can be applied to different time series data. By doing this, for example, you can evaluate whether the model can predict another dataset. Only the condition to do this is that the number of genes in the input data set file is identical to the original one used to estimate the mode parameters. To perform this, use --ssm ssm_file option where ssm_file is the output file (*.A.dat file) containing the estimated SSM model parameters, and specify the time series data to be applied as the input file. Basically, this is done very quickly. So you do not need supercomputers.

$ signssm --ssm ssm_file -o output_prefix new_input_file

The calculated p values described above represents the statistical significance of genes whose expressions are differentially regulated in the data based on the given model. For example, assume that you have two types of data: one for the case observed with some stimulus (e.g. dosing a drug) and the other for the control observed without the stimulus. After estimating parameters with the control data, apply it to the case data. By doing this, you can extract differentially expressed (regulated) genes to the control based on the estimated dynamical model. See Yamaguchi et al. (2008) [3] in REFERENCE for details of this type of analysis.

^ Go to Top

Permutation Test

By performing permutation test, elements of gene-to-gene coefficient matrix Ψ = D ' L F D  can be tested to determine if particular gene-to-gene relationships exist. This means that, by permutation test, the gene network structure can be determined by statistical test.

Permutation test iterates the SSM estimation many times and compares the estimated results with the target model to be tested. Therefore, permutation test requires relatively heavy computation, and you need to use high performance computers such as supercomputers or PC cluster systems. The number of iteration determines the minimum p values of the particular gene-to-gene relationships. Therefore, we recommend to iterate the network estimation more than 1,000 times (at least 100 times). To specify to perform permutation test, use --perm ssm_file where ssm_file represents the file containing the estimated model to be tested (*.A.dat file).

Currently SiGN-SSM does not support the parallel execution using multi-threads (multi-cores) and/or MPI. You need to run multiple processes, or, in other words, execute the program many times, by yourself to iterate the required execution of permutation test. If you use the HGC supercomputer system or some PC cluster system that has Orcle/Univa Grid Engine, it helps such a multiple execution as an array job. You can specify the number of iterations as the number of tasks of the array job of the Grid Engine. In summary, execute the following command on the HGC supercomputer system:

$ qsub -t 1-N [other GE options] /usr/local/bin/signssm_sge.sh --perm ssm_file -o path/prefix [other options] input_file

or

$ qsub -t 1-N [other GE options] ~tamada/sign/signssm_sge.sh --dir ~tamada/sign --perm ssm_file -o path/prefix [other options] input_file

where N represents the number of iterations.

If you perform permutation test by yourself not on the HGC system, do not forget to give the -i option. It determines the random number generation and the output file names. Repeat the execution N times with the different increasing number to the -i option for N time interations of the permutation test.

By the above execution, in total N files will be generated. Each file contains the result of single permutation test. To generate a list of statistically significant gene-to-gene relationships from these output files, use the --ssmperm option.

$ signssm --ssmperm prefix=dir/prefix,ed=N,ssm=ssm_file,th=threshold -o output_file input_file

Note: From rel 1.1.0, you do not need signproc for this process.

A tab separated file output_file is created. Each row in the file represents a statistically significant edge (gene-to-gene relationship). Each column represents parent gene name, its child, p value of the permutation test, and gene-to-gene coefficient. Only edges whose p values are less than or equal to threshold are stored in the file.

To output in CSML format, use SiGN-Proc software with --csml.

$ signproc --ssmperm prefix=dir/prefix,ed=N,ssm=ssm_file,th=threshold,input=input_file --csml output_file

CSML format files can be opened in Cell Illustrator Online 5.0. In CSML files, some additional information is stored compared to plain text output files. These two options can be used together. If so, two files are created separately.

If the number of edges in the processed final network is very large, then the number of iterations may be insufficient. By giving -s X option with --perm option for signssm, you can specify to repeat the execution X times within a single execution of the program (single task of the array job of GE). Therefore, the final number of iterations of the permutation test becomes X x N times. This can suppress the number of generated files. Each generated (output) file will contain X results of the permutation test. For example,

$ qsub -t 1-1000 [other GE options] /usr/local/bin/signssm_sge.sh --perm ssm_file -s 100 -o path/prefix [other options] input_file

performs 1,000 x 100 = 100,000 iterations of the permutation test.

Also consider to use the smaller significant level such as 0.001 or less using the th= argument.

If the number of iterations is very large, the compilation process often takes very long time. If you use the HGC supercomputer system, please do not run the compilation process in the log-in (gate way) node, and consider to do it as a Grid Engine job. See below for the example.

^ Go to Top

Example with a sample data

In this section, let's try to estimate a gene network from the sample short time series data. The file can be downloaded from DOWNLOAD. The sample file consists of expression profiles of 100 genes measured for 7 time points at 1, 3, 6, 12, 18, 24, and 48 hours after stimulating cells.

At first let's try 4 different sizes of dimensions with few times iterations of the EM algorithm so that it finishes quickly.

$ signssm -d 4-12 -n 10 -o result sample.tsv

If your computer is equipped with multi-cores, try --threads option to run faster.

$ signssm -d 4-12 -n 10 -o result --threads 8 sample.tsv

In the HGC supercomputer system, you need to run with Grid Engine (job dispatch system).

$ qsub -t 1-9 -o so -e se ~tamada/sign/signssm_sge.sh --dir ~tamada/sign --sge -d 4-12 -n 10 -o result sample.tsv

After the estimation, make sure that the files are properly generated. Let's compare the results by BICs.

$ cat result.*.B.dat
004	001	1830.443713	-2096.002893	2751	Monotone	Converged
005	001	2040.100298	-2192.596684	4246	Monotone	Converged
006	001	2204.849962	-2196.332112	10543	Monotone	Converged
007	001	2373.999855	-2205.823475	22857	Monotone	Converged
008	001	2545.571095	-2217.113009	6379	Monotone	Converged
009	001	2531.265104	-1853.603559	11318	Monotone	Converged
010	001	2810.734627	-2074.600615	5228	Monotone	Converged
011	001	3035.270766	-2182.686379	1022	Monotone	Converged
012	001	3177.805567	-2123.724946	1187	Monotone	Converged

The fourth column represents BICs. A model with smaller BIC means the better estimation of the parameters. Thus, in this example, the 8 dimension model is the best model within the estimated models.

Next, let's look at the plot of one-step-ahead prediction by the selected model.

$ signssm_plot.sh 3 result.D008.S001.K.dat sample.tsv output.pdf

A PDF file output.pdf is generated. The PDF consists of 100 pages. Each page corresponds to a plot of each gene. (Note that signssm_plot.sh does not work properly in the HGC supercomputer system Shirokane 2.) If it looks good, then let's try permutation test on the HGC supercomputer system.

$ mkdir perm

$ qsub -t 1-1000 -o so -e se ~tamada/sign/signssm_sge.sh --dir ~tamada/sign --perm result.D008.S001.A.dat -o perm/result sample.tsv

After finishing the job, compile the generated files into a single network file.

$ signproc --ssmperm prefix=perm/result,ssm=result.D008.S001.A.dat,ed=1000,th=0.05,input=sample.tsv --ssmtxt final.txt

You can also use signssm to do this (rel 1.10.0 or later):

$ signssm --ssmperm prefix=perm/result,ssm=result.D008.S001.A.dat,ed=1000,th=0.05 -o final.txt sample.tsv

A file final.txt is a plain text file containing the statistically significant gene-to-gene relationships.

If the compilation of the generated files takes a very long time (e.g. more than an hour), then please execute it as a GE job. This is very important for other users sharing the gateway nodes of the supercomputer system. To do so, execute the qsub command with -b y -o so -e se -cwd option.

$ qsub -b y -o so -e se -cwd signproc --ssmperm prefix=perm/result,ssm=result.D008.S001.A.dat,ed=1000,th=0.05,input=sample.tsv --ssmtxt final.txt

The messages from signproc command will be output in the file "se".

^ Go to Top