This page describes the file formats supported by SiGN.
(Last update: Oct. 11, 2021)
EDF is a text file format designed for representing a gene expression data set, originally developed for Cell Illustrator. It can be used for the input data matrix format for SiGN.
EDF is a tab or comma separated text file format. The file consists of three parts: meta data section, attribute section and data section. In the attribute and data sections, each column corresponds to a sample, and ind in the data section, each line corresponds to samples of a gene (probe). The entire specification of EDF is very large. Therefore, SiGN supports only a part of the specification which are required for the gene network estimation. Here is an example of the EDF file of the tab-separated version.
# This is comment $Version 1.0 @PrimaryKeyGroupID 1 1 2 2 3 3 3 3 @SecondaryKeyGroupID 1 2 1 2 1 2 2 3 gene1 1.1 2.2 3.3 4.4 5.5 6.6 101.0 gene2 7.7 8.8 9.9 10.1 11.11 12.12 102.0 gene3 13.13 14.14 15.15 16.16 17.17 18.18 103.0
In the meta data section, each line starts with '$' and defines global information about this data set or attributes in the following attribute section.
In the attribute section, each line starts with '@' and defines the attributes of expression samples. The first column is an attribute key followed by the attributes of samples. In the above example, two attribute keys PrimaryKeyGroupID and SecondaryKeyGroupID are specified for the samples.
In the data section, each line represents the expression data of a gene (or a probe). The first column is a name of the gene followed by its expression values.
Both meta data section and attribute section can be omitted. Therefore, the simplest form of the input data can be as below.
gene1 1.1 2.2 3.3 4.4 5.5 6.6 101.0 gene2 7.7 8.8 9.9 10.1 11.11 12.12 102.0 gene3 13.13 14.14 15.15 16.16 17.17 18.18 103.0
This represents an input data file consisting of three genes with seven samples.
Some sample files are available at the SiGN-BN DOWNLOAD page.
The above PrimaryKeyGroupID and SecondaryKeyGroupID are pre-defined attributes that specify the primary and secondary ID of samples. In SiGN, the primary ID corresponds to time points, and the secondary ID corresponds to replicate ID. In EDF, you can sort columns in any order as long as the primary and the secondary IDs are given. The program automatically extracts the required values from the information of the primary and secondary IDs in the file. The ID must begins with 1 (minimum ID = 1). These two IDs of samples are very important if you want to use time series data.
Basically, all the predefined attribute keys can be re-defined in the meta data section. For example, to re-define PrimaryKeyGroupID, use $PrimaryKeyGroupID meta data key and specify its new name. For example,
... $PrimaryKeyGroupID time @time 1 2 3 4 gene1 10.5 12.1 43.1 23.4 ...
this re-defines PrimaryKeyGroupID as time and use the new attribute time in the attribute section. Also, SecondaryKeyGroupID attribute key can be redefined by $SecondaryKeyGroupID meta data key.
Here is another example:
$Version 1.0 $PrimaryKeyGroupID time $SecondaryKeyGroupID rep @time 1 1 2 2 3 3 3 3 @rep 1 2 1 2 1 2 2 3 gene1 1.1 2.2 3.3 4.4 5.5 6.6 101.0 gene2 7.7 8.8 9.9 10.1 11.11 12.12 102.0 gene3 13.13 14.14 15.15 16.16 17.17 18.18 103.0
$PrimaryKeyGroupID : Re-defines the PrimaryKeyGroupID attribute key name.
$SecondaryKeyGroupID : Re-defines the SecondaryKeyGroupID attribute key name.
$KeywordOfNA : keyword regarded as N/A (missing) values.
Other meta data keys are simply ignored when reading EDF files.
Note: N/A values are supported only by SiGN-SSM.
@PrimaryKeyGroupID : Used as time point IDs.
@SecondaryKeyGroupID : Used as replicate IDs.
By default, SiGN assumes that delimiter of EDF files is tab (i.e. tab-separated text file).
If the extension of the file name is csv (i.e. the file name ends with .csv), SiGN assumes that the delimiter is comma, instead of tab.
In SiGN-SSM, attribute keys PrimaryKeyGroupID and SecondaryKeyGroupID are already renamed with time and replicate.
The following keys and their values are available for the -I option when the program reads an EDF file.
allow_nan= { on | off }
nan= str
Here are output network file formats supported by SiGN programs. Some formats accepts optional arguments when outputting files. For example, in SiGN-BN, you can specify the arguments by the --output-args option in the comma-separated key=value style format. Available arguments are list in each file format explanation below.
CSML (Cell System Markup Langulage) is an XML based network format originally developed for describing simulatable cell system networks used in Cell Illustrator.
SiGN Native network format.
Simple tab separated edge list. By default, in the TXT format file, each tab-separated line corresponds to an edge consisting of a parent node as the first column and its child node as the second column. Note that the TXT format file contains only nodes that are connected to others. This means that the TXT format does not contain the full information of the estimated network. Thus, we do not recommend to use only this format for your important results.
header [ =on ]
H [ =on ]
prop [ =on ]
P [ =on ]
nprop [ =on ]
N [ =on ]
name [ =on ]
This is not really an network format, but can be specified as an output network format.
The NODELIST file is a tab separated text file containing the list of nodes, instead of edges. Each line describes the properties of nodes such as hubness.
This is an output-only format, thus cannot be specfied when you read a network from a file.
betweenness
closeness
This is the simple parent list format using base64 encoding, designed for saving the size of the network produced for a large number iteration of the bootstrap method.
This saves only the network structure. This does not save the node names nor other information.
Copyright © 2010-2021 Yoshinori Tamada, Hirosaki University, Kyoto University, SiGN Project Members and Laboratory of DNA Information Analysis & Laboratory of Sequence Analysis, Human Genome Center, Institute of Medical Science, The University of Tokyo. All Rights Reserved.