SiGN-SSM MANUAL

Name

signssm -- SiGN-SSM: Gene Network Estimation with State Space Model.

Synopsis

Single process-single/multi thread

signssm [ options ] input_file

Parallel execution via MPI

mpirun [ MPI options ] INSTALLPATH/signssm [ options ] input_file

Parallel execution via Grid Engine in HGC Shirokane1/2

qsub -t 1-N [ GE options ] INSTALLPATH/signssm.sh [ options ] input_file
^ Go to Top

Output files

After finished the estimation, SiGN-SSM produces three output files per single model.

where prefix is the file name prefix specified by -o option (see below), D000 is the size of dimensions, and S000 represents the set index (ID) of the estimated models.

The first file ("*.A.dat") contains the estimated SSM model parameters. The file consists of 9 matrices, each starts with a header line, followed by a tab-separated matrix. A header line consists of 3 tab-separated columns. Each column represents the matrix (parameter) name, the number of rows of the matrix, and the number of columns of the matrix. A vector is represented as a single column matrix. The matrices are separated by an empty line. The matrices names are as follows:

This file can be used for --ssm option.

The second file ("*.B.dat") consists of one tab-separated line, each column represents the size of dimensions, the set ID, the log-likelihood, the BIC, the number of loops, whether or not the likelihood during the EM algorithm decreased monotone, and whether or the algorithm converged or not within the loop limitation, respectively.

The third file ("*.K.dat") is the state variables and observation variables calculated from the estimated model parameters. This file is intended to be read by Gnuplot. The file contains 7 matrices. The order and meaning of the matrices are explained in Output Files in HOW TO USE.

In addition to these file, if the --pvalues on is specified (default), the following two files are generated.

The former file contains a matrix of the p values. Each p value corresponds to the statistical significance (the result of the statistical test) of the value in the input data at the same position (excluding header rows and gene names). The latter file contains the integrated p values for genes by the statistical meta analysis. The calculation and generation of these files can be suppressed by "--pvalues off" option.

^ Go to Top

Options

Basic options

(alphabetical order)

-d X [ -Y [:Z ]], ...

Region of dimensions. X, Y, and Z are integer values. (default: 4)

If you specify only X, then the program estimates an SSM for the single specified dimension. Specifying X-Y means to estimate multiple SSMs for multiple dimensions ranging from X to Y. If you specify Z, it represents the increment step from X to Y. If you omit it, Z=1 is assumed. For example, "-d 4-6" represents the program to estimate SSMs for dimensions 4, 5, and 6. The multiple region can be specified by concatenating them with commas. For example, "-d 4,8-10" represents to specify dimensions of 4, 8, 9, and 10. The program does not select the best dimension automatically. Users may select one model by comparing BIC (Bayesian Information Criteria) of the estimated models.

-h

Show the help message and quit.

-i N

Integer ID to distinguish concurrent executions. (default: N=1)

This value is used to initialize a random number generator. Therefore, the program will produce the same result if you specify the same ID and the same random seed (see -r option). Use this option to avoid to produce the same results when you run in parallel on a job dispatch/queueing system such as Grid Engine. Do not specify this if you run the program with signssm.sh on Grid Engine.

-L { 0 | 1 | 2 }

Log file mode (default: 0).

  • 0: Automatic mode. When running as a single process (multi-thread) program, the log message is output to the standard error. When running with MPI, only the root process generates the log file named "prefix.log" where prefix is specified by -o option. If --sge or --perm option is specified, each process produces the log file named prefix.log.XXXXXX where XXXXXX is a 6-digit number that represents the task ID.
  • 1: Force each process to output a log file named "prefix.log.XXXXXX".
  • 2: Redirect all the messages to the standard error. The log files are not produced.
  • Other: Suppress all the log message.

--perm ssm_file

Permutation test mode. (default: not specified)

This mode reads the SSM model parameters ssm_file, and then performs a single execution of permutation test. To perform permutation test, you need to run many times and compile the results into a single file, using signproc program. This outputs a single file "prefix.XXXXXX" where XXXXXXX is a six digit ID number specified by -i option. If this is specified, -d option is ignored. If -s N option is specified, SiGN-SSM performs N tests and outputs N test results in a file. This is useful to reduce the number of output files when performing the test with many iterations.

--ssmperm key1=value1,key2=value2,...

Compilation of permutation test result mode. (default: not specified)

This compiles files of permulation test results generated by permuation test mode (--perm option) into a single network and output it as a tab seperated text file. The output file name can be given by the -o option. You need to specify arguments by the key=value style format. Available arguments are listed below:

    prefix=prefix | The prefix of the files to be processed.

    ssm=ssm_file | File containing the SSM model parameters (*.A.dat file).

    bg=N | The first index of the suffix of the processing files. (default: 1)

    ed=N | The last index of the suffix of the processing files. (default: 1000)

    th=threshold | The significance level of the p value left in the final network. (default: 0.05)

-r N

Integer random seed. (default: N=38)

This value is used to initialize the random number generator together with -i option.

-s N

Number of sets (or executions) for a single dimension. (default: N=1)

The program produce N results for a single size of dimensions specified by -d option. If --perm is specified with this option, the single job (process) performs N tests and output N test results in a single output file.

--shift { 0 | 1 | 2 }

Mean shift mode. (default: 1)

  • 0: Do not perform mean shift of the input data.
  • 1: Perform mean shift for each replicate in the input data before estimation.
  • 2: Perform mean shift for the entire time point in the input data before estimation.

--sge

Grid Engine mode. (default: not specified)

If this is specified, the program runs for only 1 set with the iterations given by -n option regardless of -s and -d options. That is, the program estimates for the i-th set where i represents the ID specified by -i option. The total number of sets are (sets) x (dimensions), and each execution corresponds to one of these sets. This is useful when you execute via Grid Engine.

--ssm ssm_file

Read the SSM file and apply it to the input data. (default: not specified)

If this is specified, the program reads the SSM model parameters from a file and does not estimate them from the input data. This is useful when you want to apply the estimated model parameters to the different input data set to calculate the state and observation variables from the model and the input data.

--threads N

Number of threads. (default: N=1) AVAILABLE ONLY FOR SINGLE PROCESS EXECUTION.

This specifies the number of threads to be used when it runs as a single process. Specify a value less than or equal to the number of CPU cores in your computer.

^ Go to Top

Output related options

(alphabetical order)

-e EXT

The suffix (extension) of the output file names. (default: "dat")

--each { on | off }

Output matrices and vectors of the result into separate files. (default: not specified)

The estimated SSM model parameters ("*.A.dat" file) are stored in files named "prefix.D000.S000.*.dat" where "*" corresponds to the following matrices/vectors:

  • H : observation matrix H.
  • R : observation noise vector (diagonal elements of) R.
  • F : system transition matrix F.
  • D : gene-to-module projection matrix D.
  • L : diagonal elements of L = H ' R -1 H.
  • x : initial state variable x 0.

The state and observation variables ("*.K.dat" file) are store in files named "prefix.D000.S000.*.dat" where "*" corresponds to the following matrices/vectors:

  • Xp.r : one-ahead-prediction of the state variables for the r -th replicate.
  • Xf.r : filtering of the state variables.
  • Xs.r : smoothing of the state variables.
  • Yp.r : one-ahead prediction of the observation variables.
  • Yf.r : filtering of the observation variables.
  • Ys.r : smoothing of the observation variables.

In addition, "prefix.Y.r.dat" is output and contains the mean shifted (by default) input data of the r -th replicate.

-o PREFIX

The prefix of the output file names. (default: "result")

--proc-file
-P

Insert the number of processes to the output file prefix. (default: not specified) AVAILABLE FOR MPI EXECUTION ONLY.

If this is specified, letters ".P0000" is added at the end of the prefix (-o option) where 0000 represents the 4-digit number representing the number of processes (MPI size).

--pvalues { on | off }

Calculate and output the p values of the input time series data. (default: on)

--state { on | off }

Output the estimated state and observation variables. (default: on)

^ Go to Top

EM algorithm options

(alphabetical order)

--em-loop N
-l N

Number of maximum loops of the single EM algorithm execution until converged. (default: N=40000)

-F { on | off }
--constrain-F { on | off }

Apply constraint on diagonal elements of F (system coefficient matrix). (default: on)

-g X
--constrain-Fg X

Strength of constraint on F. (default: X=0.8)

A real value ranging from 0.0 to 1.0 can be specified. Strong constraint may cause difficulty in model parameter estimation to fit to the input data. This is used only when "-F on" is specified (default).

-n N

Number of iterations (number of initial values) for a single EM algorithm execution. (default: N=100)

The program chooses the best result from N executions of the EM algorithm.

--retry N

Maximum retry count. -1 for unlimited retry. (default N=-1)

The EM algorithm somethimes fails when the initial values are bad. If SiGN-SSM detects the estimation failure, then it automatically retries the estimation with different initial values. This specifies the maximum retry count.

--RrI { yes | no }
-R { yes | no }

Whether or not assume that the observation noise R = r I. (default: no)

If yes is specified, then R = r I is assumed, and if no, then R = diag(r1, ..., rp ) is assumed.

--update-mu { on | off }

Whether or not update μ0 (= x0). (default: on)

If no is specified, then the initial state variable x0 is fixed and not updated during the EM algorithm.

^ Go to Top

Initial value related options

(alphabetical order)

--F-max X

Upper bound of random values for initializing F. (default: 1.5)

--F-min X

Lower bound of random values for initializing F. (default: -1.5)

--H-mean X

Mean of normally distributed random values for initializing H. (default: 0.0)

--H-SD X

Standard deviation of normally distributed random values for initializing H. (default: 1.0)

--mu X

Mean of normally distributed random values for initializing x0. (default: 0.0)

--SD X

Standard deviation of normally distributed random values for initializing x0. (default: 1.0)

^ Go to Top