INGOR
Public Member Functions | List of all members
ytGDF Class Reference

Reads and Writes ytData in GDF format. More...

#include <ytGDF.h>

Public Member Functions

ytDataytGDF_read_fp (FILE *fp, ytKeyValues *args)
 Reads a GDF file from a file stream. More...
 
void ytGDF_write_fp (FILE *fp, ytData *data, ytKeyValues *args)
 Writes a ytData instance in GDF into a file stream. More...
 

Detailed Description

Reads and Writes ytData in GDF format.

The GDF format is a tab delimited text file format for various input data. It can represent both row=sample and column=sample style data. The row=sample style means that each row represents a sample or a case, and each column reprents a variable or an attribute of a sample. The column=sample style means the opposite of the row=sample style. That is, each row represents a variable and a column a sample.

Here is the example of a GDF file.

$GDF 1.0
# This is a comment line
$NA nan
@type cont cont disc
@name c1 c2 c3
r1 1.0 1.5 0
r2 2.2 nan 1

There are four sections in a GDF file. The first section is the meta data section. In the meta data section, users can describe various attributes (or data) for the data themselves. Meta data are pairs of meta keys and their values. Meta keys begin with "$" followed by a tab and their values. If the row begins with "$", then the line is a single meta data in the meta data section.

The second section is column attribut section, that specifies various attributes for columns. The above example specifies two attribute keys "type" and "name" for columns.

In addition to column attributes, GDF can define attributes for rows. The third section is row attribute section. It is a single row consisting of attribute keys, each begins with "%". Here is the example.

$GDF 1.0
- @type cont cont
- @name c1 c2
%name %alias
r1 R1 1.0 1.5
r2 R2 2.2 2.4

If row attribute section is omitted, then the first column is assumed to represent names for rows. If you specify multiple row attribute keys, you require empty cells in column attribute section to align attributes. In this case, a hyphen (minus) "-" is used to fill these empty cells.

For samples, two IDs are defined, the primary ID and the secondary ID, in order to distinguish samples. The primary ID represents an individual, the same experiment with different conditions, and so on. The secondary ID represents the index of times for the dynamic model, that is, there are some dependencies between consecutive two indices.

By default, the "name" attribute is used as the primary ID, and the 1-origin integer index value "time" attribute corresponds to the secondary IDs.

$GDF 1.0
- @type cont cont
- @name c1 c2
%name %time
id1 1 1.0 5.1
id1 2 2.0 5.2
id1 3 3.0 5.3
id2 1 1.1 3.3
id2 2 2.1 3.4
id2 3 3.1 3.5

The GDF format is a superset of the EDF format. The EDF is used in SiGN software. Therefore, an EDF file can be read and written by the GDF routines as a GDF format file. There are several different points between EDF and GDF. Use the edf option (argument) for reading the EDF format file. In INGOR, give an argument like "-I edf".

Supported meta keys

$GDF

The GDF format version. This meta key should be at the first line so that this can be a byte marker for representing the file type.

$KeywordOfNA
$NA
$NAN

Keyword (string expression) of NaN (Not a Number) and missing values. The default is "NA".

$PrimaryKey
$PrimaryKeyGroupID

Attribute name used as the primary IDs for samples. The primary ID is used to distinguish the same individuals, genes and etc. The default is "name".

$SecondaryKey
$SecondaryKeyGroupID

Attribute name used as the secondary IDs for samples. The secondary ID is used to distinguish samples observed at the same time, year and etc. The default is "time".

$PrimaryKeyType

Attribute value type for primary IDs. The default type is "string". (The primary ID key is "name".) The possible values are: string, integer, or double.
string : character string.
interger : 1-origin integer value.
dobuble : double precision floating-point real value.

$SecondaryKeyType

Attribute value type for secondary IDs. The default type is "integer". (The secondary ID key is "time".) See above for possible values and their meanings.

Supported attribute keys

type, typeID

Specifies the type of values in columns or rows. This can appear either as a row attribute or a column attribute.

continuous, cont, real, c

continuous, floating point real values.

ordinal, integer, int

Ordinal integer numbers.

discrete, disc, d

Discrete values repsented by 0-origin integers, i.e., 0, 1, ...

categorical, cat, nominal
Discrete values represnted by string keywords.

name

Column/row names.

alias

alias names. This is a second name of the variable and samples.

time

This is available only for samples. By default, time attributes are 1-origin consecutive integers (values begining from 1). The key name "time" can be changed by the "$SecondaryKey" meta key.

Member Function Documentation

◆ ytGDF_read_fp()

ytData * ytGDF_read_fp ( FILE *  fp,
ytKeyValues args 
)

Reads a GDF file from a file stream.

The following key-value arguments are acceptable. These overwrites the settings written in the file.

na=string
nan=string

The keyword representing a missing value. If a value of data is identical to string, then it is regarded as NaN (not a number). By default "NA" is used. The keyword is case insensitive.
This overwrites one specified by the meta data key "$KeywordOfNA".

type=string

The default variable type. Value string can be a type name defined in GDF.

types=type1:type2 ...

Variable types.

label_cols=n
l=n

The number of columns that are not data values. If the file does not have any label columns, then specify n=0, for example.

header_rows=n
h=n

The number of rows that are not data values. The first n lines will be ignored when reading a file.

name_row=n

Line number that contains the names of columns. This is useful when you read a simple tab-delimited text file that has often column names at the first row.

row_var

Each row represents a variable. Specify this for reading an EDF file.

col_var

Each column represents a variable. This is default.

edf

EDF mode (SiGN compatible mode). This sets "PrimaryKeyGroupID" to the secondary key name to handle its values as consecutive time points for dynamic model, and "SecondaryKeyGroupID" to the primary key name. (Note: The meanings of the primary and secondary IDs are opposite between GDF and EDF.)

write_edf=file

Writes the read data in EDF format into the specified file.

empty

Ignores (allows) empty cells in the data section. By default, an empty cell in the data section of the input file causes an error.

assume_real

Assumes all the variables to be real values. This does not automatically convert discrete values into one-hot vectorized data. For categorical data, this uses the internal indices of categories as their real values.

split_xy

If this is specified the first half of samples are regarded as data for explanatory variables and the second half are for objective variables in the regression model. This is mainly for dynamic model by specifying data of manually converted data.

csv

Specifies to use a comma character as a field delimiter. By default, a TAB character is used.

rhl

This is an alias for row_var,name_row=1,label_col=1. That is, each row corresponds to a variable, the first row is a header row and represents the names of samples (columns), and the first column represents the names of variables.

chl

This is an alias for col_var,name_row=1,label_col=1. That is, each column corresponds to a variable, the first row is a header row and represents the names of variables, and the first column represents the names of samples.

◆ ytGDF_write_fp()

void ytGDF_write_fp ( FILE *  fp,
ytData data,
ytKeyValues args 
)

Writes a ytData instance in GDF into a file stream.

Arguments

na=string
nan=string
String used as NA, NaN (Not a Number), and missing values.
edf
Specifies to output in the EDF format.
tsv
Specifies to output in simple tab separated text file. The header row is output at the first row.
v=n
Verbose level. (default: n =0)
Parameters
fpfile stream to which the given data is output.
datadata to output.
argsoutput arguments.

The documentation for this class was generated from the following file: