signssm -- SiGN-SSM: Gene Network Estimation with State Space Model.
Single process-single/multi thread
Parallel execution via MPI
Parallel execution via Grid Engine in HGC Shirokane1/2
After finished the estimation, SiGN-SSM produces three output files per single model.
The first file ("*.A.dat") contains the estimated SSM model parameters. The file consists of 9 matrices, each starts with a header line, followed by a tab-separated matrix. A header line consists of 3 tab-separated columns. Each column represents the matrix (parameter) name, the number of rows of the matrix, and the number of columns of the matrix. A vector is represented as a single column matrix. The matrices are separated by an empty line. The matrices names are as follows:
The second file ("*.B.dat") consists of one tab-separated line, each column represents the size of dimensions, the set ID, the log-likelihood, the BIC, the number of loops, whether or not the likelihood during the EM algorithm decreased monotone, and whether or the algorithm converged or not within the loop limitation, respectively.
The third file ("*.K.dat") is the state variables and observation variables calculated from the estimated model parameters. This file is intended to be read by Gnuplot. The file contains 7 matrices. The order and meaning of the matrices are explained in Output Files in HOW TO USE.
In addition to these file, if the --pvalues on is specified (default), the following two files are generated.
The former file contains a matrix of the p values. Each p value corresponds to the statistical significance (the result of the statistical test) of the value in the input data at the same position (excluding header rows and gene names). The latter file contains the integrated p values for genes by the statistical meta analysis. The calculation and generation of these files can be suppressed by "--pvalues off" option.
(alphabetical order)
-d X [ -Y [:Z ]], ...
Region of dimensions. X, Y, and Z are integer values. (default: 4)
If you specify only X, then the program estimates an SSM for the single specified dimension. Specifying X-Y means to estimate multiple SSMs for multiple dimensions ranging from X to Y. If you specify Z, it represents the increment step from X to Y. If you omit it, Z=1 is assumed. For example, "-d 4-6" represents the program to estimate SSMs for dimensions 4, 5, and 6. The multiple region can be specified by concatenating them with commas. For example, "-d 4,8-10" represents to specify dimensions of 4, 8, 9, and 10. The program does not select the best dimension automatically. Users may select one model by comparing BIC (Bayesian Information Criteria) of the estimated models.
-h
Show the help message and quit.
-i N
Integer ID to distinguish concurrent executions. (default: N=1)
This value is used to initialize a random number generator. Therefore, the program will produce the same result if you specify the same ID and the same random seed (see -r option). Use this option to avoid to produce the same results when you run in parallel on a job dispatch/queueing system such as Grid Engine. Do not specify this if you run the program with signssm.sh on Grid Engine.
-L { 0 | 1 | 2 }
Log file mode (default: 0).
--perm ssm_file
Permutation test mode. (default: not specified)
This mode reads the SSM model parameters ssm_file, and then performs a single execution of permutation test. To perform permutation test, you need to run many times and compile the results into a single file, using signproc program. This outputs a single file "prefix.XXXXXX" where XXXXXXX is a six digit ID number specified by -i option. If this is specified, -d option is ignored. If -s N option is specified, SiGN-SSM performs N tests and outputs N test results in a file. This is useful to reduce the number of output files when performing the test with many iterations.
--ssmperm key1=value1,key2=value2,...
Compilation of permutation test result mode. (default: not specified)
This compiles files of permulation test results generated by permuation test mode (--perm option) into a single network and output it as a tab seperated text file. The output file name can be given by the -o option. You need to specify arguments by the key=value style format. Available arguments are listed below:
prefix=prefix | The prefix of the files to be processed.
ssm=ssm_file | File containing the SSM model parameters (*.A.dat file).
bg=N | The first index of the suffix of the processing files. (default: 1)
ed=N | The last index of the suffix of the processing files. (default: 1000)
th=threshold | The significance level of the p value left in the final network. (default: 0.05)
-r N
Integer random seed. (default: N=38)
This value is used to initialize the random number generator together with -i option.
-s N
Number of sets (or executions) for a single dimension. (default: N=1)
The program produce N results for a single size of dimensions specified by -d option. If --perm is specified with this option, the single job (process) performs N tests and output N test results in a single output file.
--shift { 0 | 1 | 2 }
Mean shift mode. (default: 1)
--sge
Grid Engine mode. (default: not specified)
If this is specified, the program runs for only 1 set with the iterations given by -n option regardless of -s and -d options. That is, the program estimates for the i-th set where i represents the ID specified by -i option. The total number of sets are (sets) x (dimensions), and each execution corresponds to one of these sets. This is useful when you execute via Grid Engine.
--ssm ssm_file
Read the SSM file and apply it to the input data. (default: not specified)
If this is specified, the program reads the SSM model parameters from a file and does not estimate them from the input data. This is useful when you want to apply the estimated model parameters to the different input data set to calculate the state and observation variables from the model and the input data.
--threads N
Number of threads. (default: N=1) AVAILABLE ONLY FOR SINGLE PROCESS EXECUTION.
This specifies the number of threads to be used when it runs as a single process. Specify a value less than or equal to the number of CPU cores in your computer.
(alphabetical order)
-e EXT
The suffix (extension) of the output file names. (default: "dat")
--each { on | off }
Output matrices and vectors of the result into separate files. (default: not specified)
The estimated SSM model parameters ("*.A.dat" file) are stored in files named "prefix.D000.S000.*.dat" where "*" corresponds to the following matrices/vectors:
The state and observation variables ("*.K.dat" file) are store in files named "prefix.D000.S000.*.dat" where "*" corresponds to the following matrices/vectors:
In addition, "prefix.Y.r.dat" is output and contains the mean shifted (by default) input data of the r -th replicate.
-o PREFIX
The prefix of the output file names. (default: "result")
--proc-file
-P
Insert the number of processes to the output file prefix. (default: not specified) AVAILABLE FOR MPI EXECUTION ONLY.
If this is specified, letters ".P0000" is added at the end of the prefix (-o option) where 0000 represents the 4-digit number representing the number of processes (MPI size).
--pvalues { on | off }
Calculate and output the p values of the input time series data. (default: on)
--state { on | off }
Output the estimated state and observation variables. (default: on)
(alphabetical order)
--em-loop N
-l N
Number of maximum loops of the single EM algorithm execution until converged. (default: N=40000)
-F { on | off }
--constrain-F { on | off }
Apply constraint on diagonal elements of F (system coefficient matrix). (default: on)
-g X
--constrain-Fg X
Strength of constraint on F. (default: X=0.8)
A real value ranging from 0.0 to 1.0 can be specified. Strong constraint may cause difficulty in model parameter estimation to fit to the input data. This is used only when "-F on" is specified (default).
-n N
Number of iterations (number of initial values) for a single EM algorithm execution. (default: N=100)
The program chooses the best result from N executions of the EM algorithm.
--retry N
Maximum retry count. -1 for unlimited retry. (default N=-1)
The EM algorithm somethimes fails when the initial values are bad. If SiGN-SSM detects the estimation failure, then it automatically retries the estimation with different initial values. This specifies the maximum retry count.
--RrI { yes | no }
-R { yes | no }
Whether or not assume that the observation noise R = r I. (default: no)
If yes is specified, then R = r I is assumed, and if no, then R = diag(r1, ..., rp ) is assumed.
--update-mu { on | off }
Whether or not update μ0 (= x0). (default: on)
If no is specified, then the initial state variable x0 is fixed and not updated during the EM algorithm.
(alphabetical order)
--F-max X
Upper bound of random values for initializing F. (default: 1.5)
--F-min X
Lower bound of random values for initializing F. (default: -1.5)
--H-mean X
Mean of normally distributed random values for initializing H. (default: 0.0)
--H-SD X
Standard deviation of normally distributed random values for initializing H. (default: 1.0)
--mu X
Mean of normally distributed random values for initializing x0. (default: 0.0)
--SD X
Standard deviation of normally distributed random values for initializing x0. (default: 1.0)