SiGN-SSM estimates SSM model parameters from time series (time course, or temporal) gene expression data. The estimated parameters can be used for simulating, predicting, and analyzing temporal patterns of gene expressions. SiGN-SSM can also handle multiple replicate data, missing time points and/or missing single observed values. It is recommended that the input time series data are measured for equally interval time points, but this is a not necessary condition. SiGN-SSM can handle time series data with irregular time intervals.

This page is an online tutorial explaining how to use SiGN-SSM. For installation and/or compilation, please go to DOWNLOAD page. For the full description of the options (arguments) are described in MANUAL page.

Links inside this page:

Input Time Series Data File Format

How to Run

Output Files

Applying to Different Data

Permutation Test

Example with a sample data

SiGN-SSM accepts a tab-separated plain text file as an input time series dataset.
The first two header rows in a file define time poins and replicate IDs of columns.
For specifying time points, the left-most column needs to be"`@time`",
and for replicate IDs, the left-most column, "`@replicate`".
The order of these two headers are exchangeable.
Time points and replicate IDs need to be integer values starting from one (1).
Therefore, if the dataset has different intervals, then you need to normalize
such intervals so that each time value becomes an integer multiple of the minimum
interval. Replicate ID header can be omitted if you have a single replicate data set.
See below for example.

From the third line, each row represents the time series expression profiles for a gene (or a probe). The leftmost column represents the gene names of rows.

Observed time points are not necessary to be the same in a single replicate. In other words, each replicate can have the different number of observed time points from others.

Here is an example of the duplicate data. Assume that you have a time series dataset observed in duplicate for time points 0hr, 30min, 1hr, 2hr, but 1hr is missing for the 2nd replicate.

@replicate | 1 | 1 | 1 | 1 | 2 | 2 | 2 |

@time | 1 | 2 | 3 | 5 | 1 | 2 | 5 |

Gene_1 | 1.056 | 0.124 | 1.554 | 1.732 | 1.062 | 0.125 | 1.734 |

Gene_2 | 2.134 | 2.144 | 2.156 | 2.211 | 2.034 | 2.044 | 2.111 |

... | ... |

By default, SiGN-SSM performs *mean shifting* for each gene in the input time
series data before running the EM algorithm. This is important and effective for the
parameter estimation to be done correctly. Mean shifting can be canceled by
"`--shift 0`" option.

SiGN-SSM can accept missing values and handles them appropriately
during the estimation. To represent a missing value, use
"`NAN`" in the input file.

SiGN-SSM is command-line user interface (CUI) application. You need to give options (arguments) to the program in order to specify the input file and to change the various settings of the SSM estimation by typing on the terminal (command prompt) application. You need to be familiar with doing this. No GUI is provided at this moment.

By default, SiGN-SSM performs the EM algorithm 100 times with different
initial values and selects the best parameter set for a 4-dimensional (modules)
system model.
You can change the number of times of performing the EM algorithm for a single
estimation by `-n` option. Longer execution may possibly result in
better estimation result. For example, repeating 1,000 times (`-n 1000`)
is preferred if you have enough computational resources.

By specifying the region of dimensions (`-d` option), SiGN-SSM
automatically performs the estimation for the particular specified dimensions.
SiGN-SSM does not automatically select the best model and the best size of
dimensions from the specified region. Selecting the best model is up to users.
See ABOUT SSM for details. As described there, BIC is a
way of selecting the best dimension and model. See later how to see BICs of the
estimated parameters.

Also, as described in ABOUT SSM, SiGN-SSM can estimate
multiple models (set of parameters) for the same size of dimensions.
By default, SiGN-SSM
estimates only one model for one size of dimensions. By specifying the number
of *sets* (`-s` option), SiGN-SSM automatically repeats the network
estimation for a single size of dimensions.

See MANUAL for the full description of the options.

SiGN-SSM supports various parallel execution environments.
For the PC cluster system, the above sets can be
computed in parallel by bulk (array) jobs via job dispatch system. The special
script for Sun (Oracle) grid engine is prepared and SiGN-SSM provides an easy
solution to do this. (Combination of `-i` option and `--sge`
option.)

For massively parallel computers, SiGN-SSM supports parallel execution using MPI (Message Passing Interface). SiGN-SSM is confirmed to run scalably with up to 256 CPU cores at this moment (Fig. 3). With MPI, SiGN-SSM can perform the every single EM algorithm execution in parallel. On the other hand, on PC clusters, only sets can be computed in parallel as described above.

For running on your PC, if your PC has multiple cores, them SiGN-SSM can
run on these cores as multi-threaded application. Use `--thread` option
to tell the number of cores to use them simultaneously.
Like MPI support, SiGN-SSM can perform the
every single EM algorithm execution in parallel.

Fig. 3: Scalability (parallelization efficiency) to the number of processes (cores) by MPI parallelization.
In the MPI parallelization, because one process is used only for load balancing in the program,
the parallelization efficiency is slightly low with less processes.
We confirmed that the parallelization efficiency reached 0.980 with 256 cores.
This is observed on RIKEN RICC with "`--state off -d 4-8 -n 2000 -s 10`" option.

On the HGC supercomputer system, you do not need to install the pre-compiled binary or compile the source code on your directory. It is already installed on the system.

To run SiGN-SSM, submit a Grid Engine (GE) array job using the special shell script
`signssm.sh` in `~tamada/sign/`.

$ qsub -t 1-*N* [other *GE options*] ~tamada/sign/signssm.sh --sge [*options*] *input_file*

where ` N` needs to be (# of dimensions) x (# of sets).
You can copy and modify this shell script for your job submission if you require.

For the single process and multi-thread execution,
simple use `signssm` command with options.

$ signssm [*options*] *input_file*

By `--thread` option you can specify the number of cores
(threads) to use.

$ signssm --threads *N* [*other options*] *input_file*

For MPI execution, follow the instruction of your MPI implementation to
run MPI application. Typically, `mpirun` command is used to run MPI application
and `-np` option specifies the number of processes.
On a general supercomputer system, you may need to prepare a shell script
to run the program and then need to submit it the job dispatch system.

$ mpirun -np *N* *$INSTALLPATH*/signssm [*options*] *input_file*

where ` $INSTALLPATH` represents the place of the binary.

To execute as an array (bulk) job on your supercomputer system,
use `-i` to tell the index (ID) of the array (bulk) job task and
`--sge` option to run the appropriate job (task) based on the task index (ID).
With `--sge` option, a single execution of SiGN-SSM generates only one model
and quits. Dimensions and set ID are determined by the value of `-i` option.

By default, the single estimated result of model parameters (set of matrices and
vectors) are stored in a single file. The file name would be
"` prefix.D000.S000.A.dat`" where

The output summary file is stored in
"` prefix.D000.S000.B.dat`"
containing the size of dimensions, the set ID, the log-likelihood,
the BIC, the number of loops until converged, whether or not the likelihood
increased monotone during the EM algorithm, and whether or not the algorithm
converged within the loop limitation. These values are separated by a tab and
output in a single line. By using the

In addition to the above two files, 7 different matrices per replicate
are output in
"` prefix.D000.S000.K.dat`".
The first six matrices are N time point
collections of one-step-ahead prediction, filtering, smoothing vectors of the
estimated state variables and observation variables, where N
represents the total number of time points. Each row corresponds to
a vector for a time point. The last matrix is the mean shifted (by default)
input data. If the input data set is multiple replicated data, these matrices
are calculated for all the replicate. Thus, if the dataset is measured in
triplicate for 7 time points, 21 (= 7 x 3) matrices are output in the file.
See ABOUT SSM for details about the first 6 matrices.
This file is mainly intended for plotting the result by Gnuplot. The shell script

$ signssm_plot.sh *#reps* *state_file* *input_file* *output.pdf*

where ` #reps` represents the number of replicates,

If you want to use the estimated parameters, and calculated state and observation
variables in different software,
SiGN-SSM can output each matrix/vector in a single file.
This can be specified by "`--each on`" option.
These single files can be easily read by, for example, R (free statistical
computing software). The output files are
"` prefix.D000.S000.*.dat`" where
"

`H`: observation matrix H.`R`: observation noise vector (diagonal elements of) R.`F`: system transition matrix F.`D`: gene-to-module projection matrix D.`L`: diagonal elements of*L*=*H*'*R*^{−1}*H*.`x`: initial state variable*x*_{0}.`Xp.`: one-ahead-prediction of the state variables for the*r*-th replicate.*r*`Xf.`: filtering of the state variables.*r*`Xs.`: smoothing of the state variables.*r*`Yp.`: one-ahead prediction of the observation variables.*r*`Yf.`: filtering of the observation variables.*r*`Ys.`: smoothing of the observation variables.*r*

If the `--pvalues ` option is set to `on` (default),
two additional files are generated:
"` prefix.D000.S000.P.dat`" and
"

The estimated model parameters can be applied to different time series data.
By doing this, for example, you can evaluate whether the model can predict
another dataset. Only the condition to do this is that the number of genes
in the input data set file is identical to the original one used to estimate
the mode parameters. To perform this, use `--ssm ssm_file` option
where

$ signssm --ssm *ssm_file* -o *output_prefix* *new_input_file*

The calculated p values described above represents the statistical significance of genes whose expressions are differentially regulated in the data based on the given model. For example, assume that you have two types of data: one for the case observed with some stimulus (e.g. dosing a drug) and the other for the control observed without the stimulus. After estimating parameters with the control data, apply it to the case data. By doing this, you can extract differentially expressed (regulated) genes to the control based on the estimated dynamical model. See Yamaguchi et al. (2008) [3] in REFERENCE for details of this type of analysis.

By performing permutation test, elements of gene-to-gene coefficient
matrix *Ψ* =
*D ' L F D*
can be tested to determine if particular gene-to-gene relationships exist.
This means that, by permutation test, the gene network structure can be
determined by statistical test.

Permutation test iterates the SSM estimation many times and compares the
estimated results with the target model to be tested.
Therefore, permutation test requires relatively heavy computation, and
you need to use high performance computers such as supercomputers
or PC cluster systems.
The number of iteration determines the minimum p values of the particular
gene-to-gene relationships. Therefore, we recommend to iterate the network
estimation more than 1,000 times (at least 100 times).
To specify to perform permutation test, use `--perm ssm_file`
where

Currently SiGN-SSM does not support the parallel execution using multi-threads (multi-cores) and/or MPI. You need to run multiple processes, or, in other words, execute the program many times, by yourself to iterate the required execution of permutation test. If you use the HGC supercomputer system or some PC cluster system that has Orcle/Univa Grid Engine, it helps such a multiple execution as an array job. You can specify the number of iterations as the number of tasks of the array job of the Grid Engine. In summary, execute the following command on the HGC supercomputer system:

$ qsub -t 1-*N* [*other GE options*] ~tamada/sign/signssm.sh
--perm *ssm_file* -o *path/prefix* [*other options*] *input_file*

where ` N` represents the number of iterations.

If you perform permutation test by yourself not on the HGC system,
do not forget to give the `-i` option.
It determines the random number generation and the output file names.
Repeat the execution *N* times with the different increasing number to
the `-i` option for *N* time interations of the permutation test.

By the above execution, in total ` N` files will be generated.
Each file contains the result of single permutation test.
To generate a list of statistically significant gene-to-gene relationships from
these output files, use the

$ ~tamada/sign/signssm --ssmperm prefix=*dir/prefix*,ed=*N*,ssm=*ssm_file*,th=*threshold*
-o *output_file* *input_file*

Note: From rel 1.1.0, you do not need `signproc` for this process.

A tab separated file *output_file* is created. Each row in the file represents
a statistically significant edge (gene-to-gene relationship). Each column
represents parent gene name, its child, p value of the permutation test,
and gene-to-gene coefficient.
Only edges whose p values are less than or equal to ` threshold`
are stored in the file.

To output in CSML format, use SiGN-Proc software with `--csml`.

$ ~/tamada/sign/signproc --ssmperm prefix=*dir/prefix*,ed=*N*,ssm=*ssm_file*,th=*threshold*,input=*input_file* --csml *output_file*

CSML format files can be opened in Cell Illustrator Online 5.0. In CSML files, some additional information is stored compared to plain text output files. These two options can be used together. If so, two files are created separately.

If the number of edges in the processed final network is very large, then
the number of iterations may be insufficient. By giving `-s` ` X`
option with

$ qsub -t 1-**1000** [*other GE options*] ~tamada/sign/signssm.sh
--perm *ssm_file* **-s 100** -o *path/prefix* [*other options*] *input_file*

performs 1,000 x 100 = 100,000 iterations of the permutation test.

Also consider to use the smaller significant level such as 0.001 or less using the `th=`
argument.

If the number of iterations is very large, the compilation process often takes very long time. If you use the HGC supercomputer system, please do not run the compilation process in the log-in (gate way) node, and consider to do it as a Grid Engine job. See below for the example.

In this section, let's try to estimate a gene network from the sample short time series data. The file can be downloaded from DOWNLOAD. The sample file consists of expression profiles of 100 genes measured for 7 time points at 1, 3, 6, 12, 18, 24, and 48 hours after stimulating cells.

At first let's try 4 different sizes of dimensions with few times iterations of the EM algorithm so that it finishes quickly.

$ signssm -d 4-12 -n 10 -o result sample.tsv

If your computer is equipped with multi-cores, try `--threads` option to
run faster.

$ signssm -d 4-12 -n 10 -o result --threads 8 sample.tsv

In the HGC supercomputer system, you need to run with Grid Engine (job dispatch system).

$ qsub -t 1-9 -o so -e se ~tamada/sign/signssm.sh --sge -d 4-12 -n 10 -o result sample.tsv

After the estimation, make sure that the files are properly generated. Let's compare the results by BICs.

$ cat result.*.B.dat

004 001 1830.443713 -2096.002893 2751 Monotone Converged 005 001 2040.100298 -2192.596684 4246 Monotone Converged 006 001 2204.849962 -2196.332112 10543 Monotone Converged 007 001 2373.999855 -2205.823475 22857 Monotone Converged 008 001 2545.571095 -2217.113009 6379 Monotone Converged 009 001 2531.265104 -1853.603559 11318 Monotone Converged 010 001 2810.734627 -2074.600615 5228 Monotone Converged 011 001 3035.270766 -2182.686379 1022 Monotone Converged 012 001 3177.805567 -2123.724946 1187 Monotone Converged

The fourth column represents BICs.
A model with smaller BIC means the *better* estimation of the parameters.
Thus, in this example, the 8 dimension model is the best model within the estimated
models.

Next, let's look at the plot of one-step-ahead prediction by the selected model.

$ ~tamada/sign/signssm_plot.sh 3 result.D008.S001.K.dat sample.tsv output.pdf

A PDF file `output.pdf` is generated. The PDF consists of 100 pages.
Each page corresponds to a plot of each gene.
If it looks good, then let's try permutation test on the HGC supercomputer system.

$ mkdir perm

$ qsub -t 1-1000 -o so -e se ~tamada/sign/signssm.sh --perm result.D008.S001.A.dat -o perm/result sample.tsv

After finishing the job, compile the generated files into a single network file.

$ ~tamada/sign/signproc --ssmperm prefix=perm/result,ssm=result.D008.S001.A.dat,ed=1000,th=0.05,input=sample.tsv --ssmtxt final.txt

You can also use `signssm` to do this (rel 1.10.0 or later):

$ ~tamada/sign/signssm --ssmperm prefix=perm/result,ssm=result.D008.S001.A.dat,ed=1000,th=0.05 -o final.txt sample.tsv

A file `final.txt` is a tab separated plain text file containing the statistically significant gene-to-gene relationships. The file can be analyzed by your own program or can be read using Cytoscape etc.

If the compilation of the generated files takes a very long time (e.g. more than an hour),
then please execute it as a GE job.
This is very important for other users sharing the gateway nodes of the supercomputer system.
To do so, execute the `qsub` command with
`-b y -o so -e se -cwd` option.

$ qsub -b y -o so -e se -cwd ~tamada/sign/signproc --ssmperm prefix=perm/result,ssm=result.D008.S001.A.dat,ed=1000,th=0.05,input=sample.tsv --ssmtxt final.txt

The messages from `signproc` command will be output in the file "`se`".