HOW TO USE SiGN-BN

Introduction
Prepare the Input Expression Data File
Choose the Algorithm Appropriate for Your Data
Running SiGN-BN
Analyzing Estimated Gene Networks with Cell Illustrator Online or Cytoscape
Example with a Sample Data Set

Introduction

SiGN-BN is gene network estimation software. It can estimate gene-to-gene regulatory dependencies as a gene network from gene expression data using Bayesian network. SiGN-BN is basically designed to work on supercomputers. This is because the estimation of the Bayesian network structure from observed data requires a huge amount of computational time.

The executable binary is available in DOWNLOAD. It is also installed on Human Genome Center's supercomputer system SHIROKANE of Institute of Medical Sicence, The University of Tokyo.

SiGN-BN requires a huge amount of computational resources (CPUs). The required time to estimate gene networks is dependent on the size of your data, settings, algorithms and available resources on the system. Thus it is difficult to tell the exact time to finish the calculation. Please start with the small size and understand how SiGN-BN works on the system with your data and settings. Also please refer to our papers for the details of the method. See REFERENCE for the list of our SiGN-BN papers. Understanding the theoretical details of the algorithms is very important to utilize them for your research.

This tutorial is written for users of SHIROKANE.

^ Go to Top

Prepare the Input Expression Data File

To estimate gene networks with SiGN-BN, you first need to prepare your expression data in the Expression Data Format (EDF). See File Formats for the details of EDF.

For SiGN-BN, the input data set cannot include missing values whereas the EDF can have them in the specification.

It is very difficult to tell the required numbers of the samples (microarrays). It is completely dependent on data and algorithms. Please refer to our previous research results to know how many micorarrays we use in our research.

^ Go to Top

Choose the Algorithm Appropriate for Your Data

Next, select the algorithm. The available choices are the HC+Bootstrap algorithm, the NNSR algorithm, and the Para-OS algorithm. The HC+Bootstrap algorithm has been used in our many research projects for a relatively long time. However, it is applicable to up to 1000 genes. Therefore, you need to select genes included in the network estimation in advance. You need to carefully choose genes for the network estimation. With this algorithm, the selection of genes is an essential factor of the gene network estimation.

The NNSR algorithm is applicable to the whole genes, i.e., more than 20,000 genes. However, the estimated network may be more brief than that of the HC+Bootstrap algorithm. The third choice is the Para-OS algorithm. Unlike above two algorithms, it can estimate literally the best (optimal) scored structure of the Bayesian network. However, it can be applicable to up to only 32 genes and unfortunately we do not have good results in terms of the biological application.

^ Go to Top

Running SiGN-BN

Before running the SiGN-BN, we recommend to make a directory on the log-in node of the HGC supercomputer for your analysis or data set. Then, transfer your EDF file to that directory. If you choose to use the HC+Bootstrap algorithm, then make one more directory to store bootstrapped networks on the prepared directory. At least 1000 files will be generated by SiGN-BN. So it is better to prepare the different directory.

Log-into the log-in node of SHIROKANE. Submit a Grid Engine job of your gene network estimation as shown below.

HC+Bootstrap

Suppose that the EDF file is input.EDF.txt, and you made directories ~/dir for this analysis and ~/dir/bs for storing the bootstrapped networks. Suppose that you have already sent input.EDF.txt to ~/dir. First, move to that directory.

$ cd ~/dir

$ ls

bs/     input.EDF.txt

Next, submit a job. The following is to estimate a static gene network with the default settings (options).

$ qsub -t 1-N ~tamada/sign/signbn-hcbs.sh -o bs/result input.EDF.txt

Here N corresponds to the number of bootstrap iterations. PLEASE specify a small value (such as 10) at first to check if it works properly. We recommend more than 1000 for N for the real execution. See SiGN-BN Manual for available options.

After the network estimation, that is, the submitted job has been finished, the estimated bootstrapped network files are generated in bs directory. There must exist N + 1 files. If not, it means that the network estimation was not successful. Ether successful or not, the log file named result.log may be generated in the same directory. It may help to solve the problem. Each bootstrapped network file is not usable for your analysis. You need to compile bootstrapped networks into a single network.

$ qsub ~tamada/sign/signbn-hcbs.sh --bin signproc --bs prefix=bs/result --output type=csml,file=result.csml

This generates result.csml from the bootstrapped networks, and it is the final result of the network estimation.

NNSR

Suppose that the input EDF file is input.EDF.txt, and you made a directory ~/dir for this analysis. First, move to that directory.

$ cd ~/dir

$ ls

input.EDF.txt

Next, submit a job. The NNSR algorithm is implemented with MPI, which is a library used to parallelize the application on supercomputer. Thus, you need to submit a job as an MPI job. Currently, NNSR is not installed on /usr/local/bin/. The execution binary signbnnnsr.X.X.X where X.X.X is the version number, and the job script for MPI application signmpi.sh is prepared in ~tamada/sign/. The NNSR runs through the job script.

$ qsub -pe mpi-fillup N ~tamada/sign/signmpi.sh ~tamada/sign/signbnnnsr.0.9.16 -o result input.EDF.txt

Here N represents the number of CPU cores used simultaneously. Running with many cores make the computation very fast. However, the job needs to wait for a longer time to acquire the specified numbers of cores. Please make sure your maximum numbers of cores. It depends on your account type and the status of the certification (in Japanese).

After the calculation finished, the output file result.sgn3 is generated. This is a SGN3 format file. You can convert it into CSML or the plain text format. To do that, use SiGN-Proc. Note that, with NNSR, because the network size is generally very big, it is not recommended to use the CSML format.

Para-OS

Suppose that the input EDF file is input.EDF.txt. At first, copy the job script ~tamada/sign/signmpi.sh into your working directory. If you do not need to change anything in the job script file, you can use it in ~tamada/sign directly. Suppose that the working directory is ~/dir.

$ cd ~/dir

$ cp ~tamada/sign/signmpi.sh .

$ ls

input.EDF.txt   signmpi.sh

Next, submit a job into the Grid Engine. The Para-OS algorithm is implemented with MPI. Thus, you need to submit a job as an MPI job. To do so, use the -pe option for the Grid Engine. For example, if you want to use 16 processes, execute the qsub command as follows.

$ qsub -pe mpi-fillup 16 signmpi.sh ~tamada/sign/paraos.0.1.2 -o result.sgn3 input.EDF.txt

Specify the number of processes you want to use instead of 16 in the above example.

After the calculation finished, the output file result.sgn3 is generated. This is a SGN3 format file. You can convert it into CSML or the plain text format. To do that, use SiGN-Proc.

^ Go to Top

Analyzing Estimated Gene Networks with Cell Illustrator Online or Cytoscape

After the estimation, you will have your gene network in the CSML format. It can be viewed and analyzed on Cell Illustrator Online (CIO). For a huge network including more than 1,000 genes, however, it may be difficult to use CIO to analyze it. SiGN-Proc can be used to analyze such a huge network. It can extract sub-networks around genes of your interest.

You can also use Cytoscape to analyze the estimated network. To do so, convert the network to TXT format. This can be done by SiGN-Proc tool. For example, if you want to convert a SGN3 format network file to TXT, perform like below:

$ ~tamada/sign/signproc -t sgn3 input.sgn3 --output file=output.txt,type=TXT,args=\{H,N,P\}

This will generate a tab-separated file output.txt where each row corresponds to an edge with its parent and child genes. The first line represents the header and Parent and Child columns are the source and sink nodes of an edge. The file can be imported in Cytoscape.

^ Go to Top

Example with a Sample Data Set

HC+Bootstrap

The sample EDF file sample003.edf.txt is put on ~tamada/sign/samples/ at SHIROKANE. Let's try to estimate a gene network from this.

At first, prepare directories for this network estimation, and then copy.

$ mkdir ~/example

$ mkdir ~/example/bs

$ cd ~/example

$ cp ~tamada/sign/samples/sample003.edf.txt .

$

Next, submit a job.

$ qsub -t 1-1000 ~tamada/sign/signbn-hcbs.sh -o bs/result sample003.edf.txt

Your job-array XXXXXXX.1-1000:1 ("signbn-hc.sh") has been submitted

$

Here XXXXXXX represents your Grid Engine job ID. You can check your job status by the qstat command.

After the job finished, compile the bootstrapped networks.

$ qsub ~tamada/sign/signbn-hcbs.sh --bin signproc --bs prefix=bs/result --output type=csml,file=sample.csml

Your job XXXXXXX ("signbn-hc.sh") has been submitted

$

Again, wait for your job to be finished. Then, you will have your network sample.csml.

$ ls

bs/  sample.csml  sample003.edf.txt  se  so

$

Files se and so are automatically generated by the Grid Engine job and contain messages output to the standard output and error, respectively.

^ Go to Top

NNSR

The sample EDF file GN-10k-500-100.txt is available in ~tamada/sign/samples at the gateway node of Shirokane 2 and 3. This is the first 100 genes of GN-10k-500.txt which consists of 10,540 genes with 500 samples used in our paper published in TCBB. It is also in the same directory. Let's use them as sample data files.

At first, prepare a directory for this network estimation, and then copy the input file.

$ mkdir ~/example2

$ cd ~/example2

$ cp ~/sign/samples/GN-10k-500-100.txt .

$

Next, submit a job. In this example, to suppress the execution time, let's use -m and -T options.

$ qsub -pe mpi-fillup 16 ~tamada/sign/signmpi.sh ~tamada/sign/signbnnnsr.0.9.16 -T 100 -m 50 -o result GN-10k-500-100.txt

Your job XXXXXX ("signmpi.sh") has been submitted

$

Wait for a while until the job finishes. After the job finishes, four files will be generated.

$ ls

result.log   result.sgn3   se   so

$

Files se and so are automatically generated by the Grid Engine and contain message output to the standard output and the standard error, respectively. If the job does not correctly finish, check these files. Also the log file result.log may be helpful for resolving the problem.

This is the very short test execution. For simulating a more realistic situation, use GN-10k-500.txt without the -m and -T options with much more CPU cores. For example:

$ qsub -pe mpi-fillup 256 ~tamada/sign/signmpi.sh ~tamada/sign/signbnnnsr.0.9.16 -o result GN-10k-500.txt

Your job XXXXXX ("signmpi.sh") has been submitted

$

^ Go to Top