SiGN File Formats


Introduction
Expression Data Format (EDF)
Network File Formats

Introduction

This page describes the file formats supported by SiGN.

(Last update: Oct. 11, 2021)


Expression Data Format (EDF)

EDF is a text file format designed for representing a gene expression data set, originally developed for Cell Illustrator. It can be used for the input data matrix format for SiGN.

EDF is a tab or comma separated text file format. The file consists of three parts: meta data section, attribute section and data section. In the attribute and data sections, each column corresponds to a sample, and ind in the data section, each line corresponds to samples of a gene (probe). The entire specification of EDF is very large. Therefore, SiGN supports only a part of the specification which are required for the gene network estimation. Here is an example of the EDF file of the tab-separated version.

# This is comment
$Version	1.0
@PrimaryKeyGroupID	1	1	2	2	3	3	3	3
@SecondaryKeyGroupID	1	2	1	2	1	2	2	3
gene1	1.1	2.2	3.3	4.4	5.5	6.6	101.0
gene2	7.7	8.8	9.9	10.1	11.11	12.12	102.0
gene3	13.13	14.14	15.15	16.16	17.17	18.18	103.0

In the meta data section, each line starts with '$' and defines global information about this data set or attributes in the following attribute section.

In the attribute section, each line starts with '@' and defines the attributes of expression samples. The first column is an attribute key followed by the attributes of samples. In the above example, two attribute keys PrimaryKeyGroupID and SecondaryKeyGroupID are specified for the samples.

In the data section, each line represents the expression data of a gene (or a probe). The first column is a name of the gene followed by its expression values.

Both meta data section and attribute section can be omitted. Therefore, the simplest form of the input data can be as below.

gene1	1.1	2.2	3.3	4.4	5.5	6.6	101.0
gene2	7.7	8.8	9.9	10.1	11.11	12.12	102.0
gene3	13.13	14.14	15.15	16.16	17.17	18.18	103.0

This represents an input data file consisting of three genes with seven samples.

Some sample files are available at the SiGN-BN DOWNLOAD page.

Primary and Secondary Key Group IDs

The above PrimaryKeyGroupID and SecondaryKeyGroupID are pre-defined attributes that specify the primary and secondary ID of samples. In SiGN, the primary ID corresponds to time points, and the secondary ID corresponds to replicate ID. In EDF, you can sort columns in any order as long as the primary and the secondary IDs are given. The program automatically extracts the required values from the information of the primary and secondary IDs in the file. The ID must begins with 1 (minimum ID = 1). These two IDs of samples are very important if you want to use time series data.

Redefinition of attribute keys

Basically, all the predefined attribute keys can be re-defined in the meta data section. For example, to re-define PrimaryKeyGroupID, use $PrimaryKeyGroupID meta data key and specify its new name. For example,

...
$PrimaryKeyGroupID time
@time	1	2	3	4
gene1	10.5	12.1	43.1	23.4
...

this re-defines PrimaryKeyGroupID as time and use the new attribute time in the attribute section. Also, SecondaryKeyGroupID attribute key can be redefined by $SecondaryKeyGroupID meta data key.

Here is another example:

$Version	1.0
$PrimaryKeyGroupID	time
$SecondaryKeyGroupID	rep
@time	1	1	2	2	3	3	3	3
@rep	1	2	1	2	1	2	2	3
gene1	1.1	2.2	3.3	4.4	5.5	6.6	101.0
gene2	7.7	8.8	9.9	10.1	11.11	12.12	102.0
gene3	13.13	14.14	15.15	16.16	17.17	18.18	103.0

Supported meta data keys in SiGN

$PrimaryKeyGroupID : Re-defines the PrimaryKeyGroupID attribute key name.

$SecondaryKeyGroupID : Re-defines the SecondaryKeyGroupID attribute key name.

$KeywordOfNA : keyword regarded as N/A (missing) values.

Other meta data keys are simply ignored when reading EDF files.

Note: N/A values are supported only by SiGN-SSM.

Supported attribute keys in SiGN

@PrimaryKeyGroupID : Used as time point IDs.

@SecondaryKeyGroupID : Used as replicate IDs.

Comma vs Tab

By default, SiGN assumes that delimiter of EDF files is tab (i.e. tab-separated text file).

If the extension of the file name is csv (i.e. the file name ends with .csv), SiGN assumes that the delimiter is comma, instead of tab.

SiGN-SSM

In SiGN-SSM, attribute keys PrimaryKeyGroupID and SecondaryKeyGroupID are already renamed with time and replicate.

Input parameters

The following keys and their values are available for the -I option when the program reads an EDF file.

allow_nan= { on | off }

Specify to allow NaN (Not a Number) (or missing) values in the file. How NaN values are treated is depending on the score function. Currently, only the BNRCMV score function accepts NaN values.

nan= str

The string that are treated as a NaN value. By default, "NAN" is used.

Network File Formats

Here are output network file formats supported by SiGN programs. Some formats accepts optional arguments when outputting files. For example, in SiGN-BN, you can specify the arguments by the --output-args option in the comma-separated key=value style format. Available arguments are list in each file format explanation below.

CSML

CSML (Cell System Markup Langulage) is an XML based network format originally developed for describing simulatable cell system networks used in Cell Illustrator.

SGN3

SiGN Native network format.

TXT

Simple tab separated edge list. By default, in the TXT format file, each tab-separated line corresponds to an edge consisting of a parent node as the first column and its child node as the second column. Note that the TXT format file contains only nodes that are connected to others. This means that the TXT format does not contain the full information of the estimated network. Thus, we do not recommend to use only this format for your important results.

Output arguments

header [ =on ]
H [ =on ]

At the first line, the header that explains the meanings of the columns is inserted.

prop [ =on ]
P [ =on ]

This outputs the additional properties of edges after the names of the parent and the children.

nprop [ =on ]
N [ =on ]

This outputs the additional properties of nodes after the names of the parent and the children, and edge properties if specified.

name [ =on ]

Specifies to include the name column of the unique edge names.

NODELIST

This is not really an network format, but can be specified as an output network format.

The NODELIST file is a tab separated text file containing the list of nodes, instead of edges. Each line describes the properties of nodes such as hubness.

This is an output-only format, thus cannot be specfied when you read a network from a file.

Output arguments

betweenness

Specifies to calculate betweenness centrality values for all the nodes.

closeness

Specifies to calculate closeness centrality values for all the nodes. This takes edge direction into account.

BSF

This is the simple parent list format using base64 encoding, designed for saving the size of the network produced for a large number iteration of the bootstrap method.

This saves only the network structure. This does not save the node names nor other information.


Copyright © 2010-2021 Yoshinori Tamada, Hirosaki University, Kyoto University, SiGN Project Members and Laboratory of DNA Information Analysis & Laboratory of Sequence Analysis, Human Genome Center, Institute of Medical Science, The University of Tokyo. All Rights Reserved.