SiGN-BN MANUAL

SiGN-BN HC+BOOTSTRAP
SIGN-BN NNSR
SIGN-BN Para-OS

SiGN-BN HC+Bootstrap

Name

signbn-hcbs.sh -- SiGN-BN: Shell script for executing SiGN-BN HC+Bootstrap on HGC Shirokane.

Synopsis

Parallel execution via Grid Engine on HGC supercomputer system

qsub -t 1-N [ Grid Engine options ] ~tamada/sign/signbn-hcbs.sh [ options ] input_file

Note: N corresponds to the number of iterations of the bootstrap method.

Compiling the bootstrapped networks into a single gene network

qsub [ Grid Engine options ] ~tamada/sign/signbn-hcbs.sh --bin signproc --bs prefix=file_prefix[,other options] --output type=file_type,file=output_file

Parallel execution with MPI

OMP_NUM_THREADS=n { mpiexec | mpirun } [ MPI options ] signbn [ options ] input_file

Note: Use with a specific MPI execution program and a job script available and/or acceptable on your system.

Description

Signbn-hcbs.sh is a shell script to run SiGN-BN (HC+Bootstrap) on the HGC supercomputer system Shirokane 3. The program is executed as an array job on Grid Engine, which is the job queuing (dispatch) system installed on Shirokane 3. Therefore, the number of tasks of the array job specified by the -t option of Grid Engine corresponds to the number of iterations of the bootstrap method. Each bootstrap execution estimates a gene network from resampled data set, and outputs the estimated gene network as a single file. The array job tasks can run in parallel via Grid Engine. To obtain a final gene network from these bootstrapped network files, you need to use SiGN-Proc tool which can also be executed through the signbn-hcbs.sh script. An EDF file can be specified as an input file. See File Formats for details of EDF.

SiGN-Proc is a program which can be used to compile output files generated by executing signbn-hcbs.sh into a single gene network file. It can generate a resultant network in CSML, plain text format, and so on. See SiGN-Proc Manual for the details.

For the multi-node parallel execution, SiGN-BN supports MPI. The multi-thread execution is also supported. To specify the number of threads per node, set the appropriate number to the OMP_NUM_THREADS environment variable. When running with MPI, the compiling process of the estimated networks are automatically performed.

Grid Engine options

In signbn-hcbs.sh for Shirokane 3, "-e se -o so -cwd" (and other some minor options) is set for the qsub command. Therefore, the standard output and error are written in se and so files respectively at the current directory where the user submit the Grid Engine job.

In Shirokane 3, the "-e" and "-o" options are not set in the script by default. Thus, the Grid Engine generates many files for storing messages which are output to the standard out and the standard error. The generated file names will be signbn-hcbs.sh.X.<job_id>.<task_id> where X is o or e which represents the standard out or the standard error, respectively. You can stop this by changing these file names with the "-e" and "-o" options for the qsub command.

By specifying Grid Engine options before the job script, these default Grid Engine options can be overwritten.

Special options available only for signbn-hcbs.sh

--bs

Perform the bootstrap method. If the dynamic model is specified (-y), the pseudo bootstrap method is performed. The number of bootstrap iterations is determined by the number of tasks of the array job when submitting the job to Grid Engine. If this is specified, the -B option is induced with the appropriate ID numbers.

-o file_prefix

The prefix of the output file names. It can contain directories. The actual output file will have the 6-digit ID number as its suffix. The number corresponds to a Grid Engine array job task ID. By default, "--log-mode 4 --log-file file_prefix.log" is implied for the first task.

--rel release_number

Use the specified release number of SiGN-BN instead of the lastest stable release.

--dir path

The path where SiGN programs are installed. The slash ("/") is not required at the end of the path. This is for using the beta or older version installed on the other directory.
^ Go to Top

Options for SiGN-BN HC+Bootstrap

-y

Dynamic model. If the input file is not in the EDF file format, then the users need to specify the number of replicates for each time point by the --replicates option.

-m n

The maximum number of parents that each gene can have. By default n = 10. Specify a small value to avoid overfitting if you have relatively less data samples.

-p n

The number of parent candidates of the greedy algorithm.

-o file

Output file name. The file name can contain directories (relative path).

-O output_type

Output file format. See File Format for details. Do not specify this if you perform the bootstrap method.

--replicates v1,v2,...

The list of the numbers of replicates of time points. This is required if the input data file is not an EDF file.

--blocks n

The number of consecutive time point blocks of the pseudo bootstrap method. The final number of samples used becomes (# of time points) x n.

-s score

The name of the score function. By default, BNRC is used.

-S key=value,...

The score specific options in Key=value style format. See below for the available options.

--algo-args key=value,...
-A key=value,...

The algorithm specific options.

-r seed

The integer random seed value.

--select-nodes file

The network estimation is performed for genes in file. The specified file file is a line-by-line tab separated text file. By default, the first column is read and used as gene names to be selected for the network estimation. To change the column position to read, use the --select-nodes-col option.

--select-nodes-col n

The 1-based (1-origin) column position for the --select-node option.

--algo algorithm

The name of the structure learning algorithm. By default, "hc2" is used, that is the greedy hill-climbing algorithm implementation version 2.0. You can specify "tshc" for the Two Step HC algorithm. See Two Step HC for more details.

--log-mode n
-L n

Log mode. By default, only 1 file is generated by the first job (Grid Engine) or the root process (MPI).

--log-file file

If specified, the log message is written in file.

--cache n

Specifies the cache algorithm. By default, 3 is used.

-N n
--iteration n

[MPI] The number of iterations of the estimation. Specify the number of bootstrap iterations by this option. By default, 1 is assumed. If the value other than 1 is specified, the -B option is implied. This is available for the parallel execution with MPI.

--compile { on | off }

[MPI] If on is given, compile the estimated networks into a single network. If the value other than 1 is specified to the -N or --iteration option, on is assumed by default.

--threshold th
-T th

[MPI] Threshold used for compiling the estimated bootstrapped networks. By default, th = 0.05.

--local-output { on | off }

[MPI] Save the estimated bootstrap networks in files. Each process (MPI rank) produces a single file named file.000000 where file is a file name specified by the -o option and 000000 is a six digit number corresponding to the rank ID, and stores the networks into it. All the networks estimatied by the same process are stored in the same file. The resultant files can be processed (compiled) into a single network by the signproc tool. By default, "off" is assumed.

--hybrid

[MPI] Enables the hybrid parallelization mode that performs the single network estimation with multiple threads in an MPI process. This is basically not efficient but effective if the single network estimation of the bootstrap method is not finished within the limitation of the elapsed time that is set in your computation environment.
^ Go to Top

Options of the score functions

The following comma-concatenated key=value style extra arguments are available for the -S option. A white space can be inserted after the camma. The available options are different depending on the score function specified by the -s option.

The BNRC score function

hyper_num=n
hn=n

The number of hyperparameters to search.

hyper_bg=x
hb=x

The initial value of the sequence of hypereparameters to search.

hyper_inc=y
hi=y

linear

Linear mode. This is actually an alias for "hn=2,hb=2.0,hi=1.0".

level=n

Pre-calculation level.

Options of the search algorithms

The following comma-concatenated key=value style extra arguments are available for the --algo-args or -A option. A white space can be inserted after the camma. The available options are different depending on the algorithm.

The HC algorithm

trials=n
t=n

The number of trials. The HC algorithm performs the greedy algorithm n times and returns the best scored network as the result of the algorithm. By default, n = 10.

max_loops=n

^ Go to Top

SiGN-BN NNSR

Name

signmpi.sh -- Shell script for Grid Engine on HGC for executing SiGN-BN NNSR algorithm.

Synopsis

In HGC Supercomputer System Shirokane 3

qsub -pe { mpi-fillup | mpi | mpi_8 | mpi_4} N ~tamada/sign/signmpi.sh ~tamada/sign/signbnnnsr.X.X.X [ Options ] input_file

Note: N corresponds to the number of processes used simultaneously.

Note: Use signmpi.sh and signbnnnsr.X.X.X under ~tamada/sign for the latest release.

Description

Signmpi.sh is a shell script to run SiGN-BN NNSR on the HGC supercomputer system Shirokane 3. The program is parallelized with MPI (Message Passing Interface), which is a standard way of parallelizing the program. Therefore you need to run it as an MPI job. The Grid Engine on Shirokane supports the parallel execution of the program with MPI. Unlike an array job of the Grid Engine, an MPI job requires the specified number of multiple CPU cores simultaneously during its execution. The required time to finish the calculation depends on the number of CPU cores you specified. The more CPU cores you specify, the faster SiGN-BN NNSR runs under the same input data and parameters. However, if the supercomputer is very crowded then the job with many CPU cores gets less chances to be executed.

Similar to other SiGN programs, SiGN-BN NNSR accepts an EDF format gene expression file as its input.

Grid Engine Options

In the signmpi.sh script, "-e se -o so -cwd" (and other minor options) is assumed by default. As noted above, SiGN-BN NNSR requores MPI. Therefore, you have to specify the "-pe" option to choose an MPI environment and the number of CPU cores (Grid Engine slots). The available MPI enviroments are mpi-fillup, mpi, mpi_8, and mpi_4. The recommendation is mpi-fillup where the Grid Engine tries to execute as many processes as possible on the same computation node. On the other hand, mpi tries to execute as less processes as possible on the same computation node. mpi_8 and mpi_4 guarantees that exactly 8 or 4 process are executed per single computation node. Therefore, with these environments, N (the number of processes) have to be a multiple of 8 or 4.

Options for SiGN-BN NNSR

-o, -O, -L, -s, -S, --algo, -A, -y, --blocks

These options are the same as SiGN-BN HC+Bootstrap. See above for details.

-T n

The number of iterations of the subnetwork estimation by the neighbor node sampling and repeat algorithm.

-t n

The number of iterations of the Random Sampling phase. By default, n = 0. Basically, you do not need to change the value by this option. This is prepared for reproduce the Random Sampling phase appearing in our paper.
^ Go to Top

SiGN-BN Para-OS

paraos.X.Y.Z -- SiGN-BN Para-OS algorithm for optimal gene network estimation.

Synopsis

Parallel execution via Grid Engine on HGC supercomputer system

qsub -pe MPI_environment N job_script ~tamada/sign/paraos.X.Y.Z [ Options ] input_file

Parameters

Description

SiGN-BN Para-OS calculates the optimal structure of a gene network from the data. Because the calculation of the optimal structure is difficult, you need lots of computational resources, i.e., many CPUs or computation nodes. It is implemented with MPI. Also the binary supports multi-threaded execution. Therefore, the number of CPU cores used by the program is equal to the number of processes × the number of threads.

SiGN-BN Para-OS supports only the BNRC score function and static Bayesian network model. It does not support dynamic model that uses time-series data.

Note that, the computation time becomes longer exponentially as the number of genes in the dataset. That is, if the number of genes becomes 1 larger than the some data set, it takes twice longer than the data set. In Shirokane 3, the computational time for the sample data GN-16-50.edf.txt is about 8 minutes using 16 computation nodes (processes) with 1 thread for each process.

Options

-o, -S, -L

These options are the same as SiGN-BN HC+Bootstrap. See above for details.

--log file

If specified, the log message is written in file.

^ Go to Top