SiGN-L1 MANUAL

Name

signl1 -- SiGN-L1 : Gene Network Estimation with L1 regularization

Synopsis

Single process-single/multi thread

[ OMP_NUM_THREADS=threads ] INSTALL_PATH/signl1 [ SiGN-L1 options ] input_file

Parallel execution via MPI

[ OMP_NUM_THREADS=threads ] { mpirun | mpiexec } [ MPI options ] INSTALL_PATH/signl1 [ SiGN-L1 options ] input_file

Description

SiGN-L1 estimates gene networks from gene expression data. For the simple network estimation, the structural equation model is available. SiGN-L1 can estimate the network structure using L1-regularizing sparse learning algorithms such as lasso. The network profiler estimates the multiple network structures based on the extra data called modulator that characterizes the individual samples.

In the structural equation model, the network structure can be output in the edge list file, the coefficients matrix or the CSML format. In the network profiler mode, the network structures can be obtained in the edge list files, or coefficient matrices. Please be careful that the network profiler estimates the network for every sample in your input data matrix.

Parallel Execution

SiGN-L1 supports single process-multi thread execution and multi process multi thread execution with MPI (hybrid parallelization). The thread parallelization is realized by OpenMP. Therefore you can control the number of threads by the environment variable available for OpenMP, e.g. OMP_NUM_THREADS for the number of threads per process.

For the network estimation with the structural equation model, the parallelization is realized by splitting the target (child) genes. Thus the degree of the parallelization (the number of concurrent execution) is limited to up to the number of children.

For the network profiler, it performs one-leave-out cross validation and structure estimation for every sample in the input data. Thus, the higher degree of parallelization is achieved and virtually you do not need to mind the upper limitation of the degree of parallelization.

Input Files

SiGN-L1 accepts an EDF file as an input data matrix. For other list files, a text file in which a gene name or an 1-origin (1-based) index number is written in a line is acceptable (-x, -y, -z, --select-sample-cv, --select-sample-final options).

Options

-m { semlasso | npflasso | npfenet | npfrenet }

Method (mode, or algorithm) of the program. The semlasso mode performs the network estimation by the structural equation model with lasso. The npflasso method perfoms the network profiler with lasso. The npfenet method performs the network profiler with elastic net. The npfrenet method performs the network profiler with recursive elastic net. By default, npflasso is assumed.

-x file

Parent candidate (regulator) list file. The file needs to be a text file where each line contains the single gene name of a regulator.

-y file

Children (target) list file. The file needs to be a text file where each line contains the single gene name of a target.

-z file

Modulator list file for the network profiler mode. Available for the network profiler mode (-m npflass and -m npfenet) only. The file needs to be a text file where each line contains the single gene name of a modulator.

-Z file

Modulator data file for the network profiler mode. Available for the network profiler mode (-m npflass and -m npfenet) only. If this is not specified, the input data file is used for the modulator data. The file can be a tab-separated matrix file or an EDF file. The file type can be specified by the --Z-type option. By default, an EDF format file is assumed. The file needs to contain the same number of samples as in the input file.

--Z-type { edf | matrix }

The file format type of the modulator data file for the -Z option.

list: The file contains the list of gene names. Each line has one gene name.

matrix: The file contains the tab separated text file of a matrix representing the modulator values.

--Z-args key=value,...

Options for the modulator data file given by the -Z option. See File Formats for available options.

--out-list prefix

Output the estimated networks as an edge list. The multiple files will be generated based on the parallelization. There are several formats for the list format output. See the --out-list-type options.

--out-list-type { 1 | 2 | 3 }

Type of the list format for the --out-list option. 3 is the most smallest format.

--out-B prefix
-B prefix

Output the estimated coefficient matrix. The coefficients for a single child are stored in a single file. The file will have n rows and p columns (n-by-p matrix) where n represents the number of samples and p the number of parent genes (regulators). If the semlasso mode is performed, then the file has only one row. The files can be distinguished by the file postfix number. By default, the 1-based (1-origin) index of the children list given by the -y option in the input file is used as a postfix number. Specify the --out-B-name option to use the gene name as the file postfix.

--out-CSML file

Output the estimated network in the CSML format. This is available for the semlasso mode only.

-H value,...

List of comma-separated real values of the hyperparameter candidates. Available only for the network profiler mode. The cross validation is performed to determine the hyperparamter, and the best value is chosen from the list. The default list is 0.01,0.03,0.05,0.10,0.20. The number of values of the list affects the computational time linearly.

-G value,...

List of comma-separated real values of λ2 candidates. This is available for the network profiler mode with elastic net model only (-m npfenet). The cross validation is performed to determin λ2, and it is chosen from the list. The default list is 0.00001,0.0001,0.001,0.1,1.0. The number of values of the list affects the computational time linearly.

--select-sample-cv file

List of 1-origin sample indexes (IDs) that are used for the cross validation. One index per each line in the file.

--select-sample-final file

List of 1-origin sample indexes (IDs) that are used for the final parameter estimation. One index per each line in the file.

--fix-cv-mui

Do not distribute (parallelize) samples during the cross validation.

--fix-final-mui

Do not distribute (parallelize) samples during the final parameter estimation.

--log-mode n
-L n

--help
-h

Show the help message and quit.

-v n

Verbose mode. By default, 0 is assumed.