Haplin reads data in two formats:
Haplin’s own text file format;
PED format (as generated by plink 1.9
).
Both types of data are read in through the use of
genDataRead
function:
dir.exmpl <- system.file( "extdata", package = "Haplin" )
exemplary.file1 <- paste0( dir.exmpl, "/HAPLIN.trialdata.txt" )
my.gen.data.haplin <- genDataRead( file.in = exemplary.file1, file.out = "trial_data1",
dir.out = ".", format = "haplin", n.vars = 0 )
## Reading the data in chunks...
## -- chunk 1 --
## -- chunk 2 --
## ... done reading.
## Reading the marker names...
## Warning: No map file given, map file empty or the number of map file rows not
## equal to the number of markers in data; will generate dummy marker names.
## ...done
## Preparing data...
## ... done preparing
## Saving data...
## ... saved to files: ./trial_data1_gen.ffData, ./trial_data1_gen.RData
exemplary.file3 <- paste0( dir.exmpl, "/exmpl_data.ped" )
my.gen.data <- genDataRead( exemplary.file3, file.out = "ped_data", dir.out = ".",
format = "ped" )
## 'n.vars' was not given explicitly and will be set to 6 based on the format given.
## Reading the data in chunks...
## -- chunk 1 --
## -- chunk 2 --
## ... done reading.
## Reading the marker names...
## Warning: No map file given, map file empty or the number of map file rows not
## equal to the number of markers in data; will generate dummy marker names.
## ...done
## Preparing data...
## ... done preparing
## Saving data...
## ... saved to files: ./ped_data_gen.ffData, ./ped_data_gen.RData
The function reads in all the data in the file, creates
ff
objects to store the genetic information and
data.frame
to store covariate data (if any). These objects
are saved in .RData
and .ffData
files, which
can be later on easily uploaded to R (with genDataLoad
) and
re-used.
CAUTION: This can take a long time for large datasets (such
as from GWAS analysis, e.g., reading in a 7 GB file will take ca.15
minutes), however, this needs to be run only once and then, the next
time you need to use the data, use the genDataLoad
function
(see section Re-using the data, below). Be careful
NOT TO DELETE the output files .ffData and .RData!
The genDataRead
function returns a list object with
three elements:
data.frame
with covariate data
(if available in the input file);To see all the available arguments and usage examples, type:
(this works also with any other function)
The PED-formatted data has by default 6 first columns reserved for
family information and covariates. However, if the user has covariate
data that are in a separate file, this can be read together with the
genotype data, using the same function genDataRead
. The
main file can be either in haplin or PED format.
add.cov.file <- paste0( dir.exmpl, "/add_cov_data2.dat" )
my.gen.data.haplin3 <- genDataRead( file.in = exemplary.file1, file.out = "trial_data3",
dir.out = ".", format = "haplin", n.vars = 0, cov.file.in = add.cov.file )
## Reading the data in chunks...
## -- chunk 1 --
## -- chunk 2 --
## ... done reading.
## Reading the marker names...
## Warning: No map file given, map file empty or the number of map file rows not
## equal to the number of markers in data; will generate dummy marker names.
## ...done
## Preparing data...
## ... done preparing
## Reading covariate file...
## 'cov.header' not given - assuming the first line is the header...
## ...done
## Saving data...
## ... saved to files: ./trial_data3_gen.ffData, ./trial_data3_gen.RData
## This is raw genetic data read in through genDataRead.
##
## It contains the following parts:
## cov.data, gen.data, aux
##
## with following dimensions:
## - covariate variables = smoke, vitamin, pheno1, pheno2
## (total 4 covariate variables),
## - number of markers = 2 ,
## - number of data lines = 249
add.cov.file2 <- paste0( dir.exmpl, "/add_cov_data.dat" )
my.gen.data2 <- genDataRead( exemplary.file3, file.out = "ped_data2", dir.out = ".",
format = "ped", cov.file.in = add.cov.file2 )
## 'n.vars' was not given explicitly and will be set to 6 based on the format given.
## Reading the data in chunks...
## -- chunk 1 --
## -- chunk 2 --
## ... done reading.
## Reading the marker names...
## Warning: No map file given, map file empty or the number of map file rows not
## equal to the number of markers in data; will generate dummy marker names.
## ...done
## Preparing data...
## ... done preparing
## Reading covariate file...
## 'cov.header' not given - assuming the first line is the header...
## ...done
## Saving data...
## ... saved to files: ./ped_data2_gen.ffData, ./ped_data2_gen.RData
## This is raw genetic data read in through genDataRead.
##
## It contains the following parts:
## cov.data, gen.data, aux
##
## with following dimensions:
## - covariate variables = id.fam, id.c, id.f, id.m, sex, cc, smoke, vitamin, pheno1, pheno2
## (total 10 covariate variables),
## - number of markers = 429 ,
## - number of data lines = 1659
NOTE: The file with the additional information should have a header with names of the data columns!
The object created by genDataRead
includes a lot of
information. We have created functions that will help the user to
navigate it.
First of all, when you type in the name of the object, a short summary will be displayed:
## This is raw genetic data read in through genDataRead.
##
## It contains the following parts:
## cov.data, gen.data, aux
##
## with following dimensions:
## - covariate variables = id.fam, id.c, id.f, id.m, sex, cc
## (total 6 covariate variables),
## - number of markers = 429 ,
## - number of data lines = 1659
If you want to show and/or extract part of phenotype information, you
can use the showPheno
function:
## id.fam id.c id.f id.m sex cc
## [1,] "1" "1" "3" "2" "2" "1"
## [2,] "1" "2" "0" "0" "2" "0"
## [3,] "1" "3" "0" "0" "1" "1"
## [4,] "2" "1" "3" "2" "1" "0"
## [5,] "2" "2" "0" "0" "2" "0"
## id.fam id.c id.f id.m sex cc
## [1,] "1" "1" "3" "2" "2" "1"
## [2,] "1" "2" "0" "0" "2" "0"
## [3,] "1" "3" "0" "0" "1" "1"
## [4,] "2" "1" "3" "2" "1" "0"
## [5,] "2" "2" "0" "0" "2" "0"
## [6,] "2" "3" "0" "0" "1" "1"
## [7,] "3" "1" "3" "2" "1" "0"
## [8,] "3" "2" "0" "0" "2" "1"
## [9,] "3" "3" "0" "0" "1" "0"
## [10,] "4" "1" "3" "2" "1" "1"
## [11,] "4" "2" "0" "0" "2" "0"
## [12,] "4" "3" "0" "0" "1" "0"
## [13,] "5" "1" "3" "2" "2" "1"
## [14,] "5" "2" "0" "0" "2" "0"
## [15,] "5" "3" "0" "0" "1" "1"
## [16,] "6" "1" "3" "2" "1" "1"
## [17,] "6" "2" "0" "0" "2" "1"
## [18,] "6" "3" "0" "0" "1" "0"
## [19,] "7" "1" "3" "2" "2" "1"
## [20,] "7" "2" "0" "0" "2" "1"
## id.fam id.c id.f id.m sex cc
## [1,] "2" "1" "3" "2" "1" "0"
## [2,] "2" "2" "0" "0" "2" "0"
## [3,] "2" "3" "0" "0" "1" "1"
## [4,] "3" "1" "3" "2" "1" "0"
## [5,] "3" "2" "0" "0" "2" "1"
## [6,] "3" "3" "0" "0" "1" "0"
## [7,] "4" "1" "3" "2" "1" "1"
## [8,] "4" "2" "0" "0" "2" "0"
## [9,] "4" "3" "0" "0" "1" "0"
## [10,] "5" "1" "3" "2" "2" "1"
## [11,] "5" "2" "0" "0" "2" "0"
## [12,] "5" "3" "0" "0" "1" "1"
## id.fam id.c id.f id.m sex cc
## [1,] "1" "1" "3" "2" "2" "1"
## [2,] "1" "2" "0" "0" "2" "0"
## [3,] "2" "2" "0" "0" "2" "0"
## [4,] "3" "2" "0" "0" "2" "1"
## [5,] "4" "2" "0" "0" "2" "0"
## [6,] "5" "1" "3" "2" "2" "1"
## [7,] "5" "2" "0" "0" "2" "0"
## [8,] "6" "2" "0" "0" "2" "1"
## [9,] "7" "1" "3" "2" "2" "1"
## [10,] "7" "2" "0" "0" "2" "1"
## [11,] "8" "1" "3" "2" "2" "0"
## [12,] "8" "2" "0" "0" "2" "1"
## [13,] "9" "1" "3" "2" "2" "1"
## [14,] "9" "2" "0" "0" "2" "0"
## [15,] "10" "2" "0" "0" "2" "1"
## [16,] "11" "1" "3" "2" "2" "1"
## [17,] "11" "2" "0" "0" "2" "1"
## [18,] "12" "2" "0" "0" "2" "1"
## [19,] "13" "2" "0" "0" "2" "0"
## [20,] "14" "1" "3" "2" "2" "1"
The output can be saved to an object:
## id.fam id.c id.f id.m sex cc
## [1,] "1" "1" "3" "2" "2" "1"
## [2,] "1" "2" "0" "0" "2" "0"
## [3,] "2" "2" "0" "0" "2" "0"
## [4,] "3" "2" "0" "0" "2" "1"
## [5,] "4" "2" "0" "0" "2" "0"
## [6,] "5" "1" "3" "2" "2" "1"
With the functions nindiv
and nfam
, you can
get the number of individuals or number of families in your data:
## [1] 1659
## [1] 550
Note that the nfam
function assumes that your dataset is
from a triad or dyad study, i.e., includes information on the child and
at least one parent.
The function nsnps
will tell us how many markers/SNPs
there is in the data. Be careful since per default it assumes that the
data is triad data (i.e., mother, father, and child were genotyped), so
if your data is from a case-control study, be sure to specify that with
argument design = "cc"
.
## [1] 429
To get the SNP names use:
## [1] "m1" "m2" "m3" "m4" "m5"
## [1] "m12" "m13" "m14" "m15" "m16" "m17" "m18" "m19" "m20" "m21" "m22" "m23"
## [13] "m24" "m25" "m26" "m27" "m28" "m29" "m30" "m31"
To extract and/or show genotypes for specific individuals or markers,
use the showGen
function:
## opening ff /tmp/Rtmpsqe1qO/ff/ff10e4241211a5.ff
## ff (open) byte length=30 (30) dim=c(5,6) dimorder=c(1,2) levels: G <NA> C T A
## l_m10_a l_m10_b l_m15_a l_m15_b l_m121_a l_m121_b
## [1,] T T A T T T
## [2,] T T T T T T
## [3,] T T A T T T
## [4,] A T A T T T
## [5,] A T T T T T
## ff (open) byte length=2010 (2010) dim=c(201,10) dimorder=c(1,2) levels: G <NA> C T A
## l_m1_a l_m1_b l_m2_a l_m2_b l_m3_a l_m3_b l_m4_a l_m4_b l_m5_a l_m5_b
## [1,] G G T A T T G G G G
## [2,] G G A A T T G G G G
## [3,] G G T A T T G G G G
## [4,] G G A A A T G G C G
## [5,] G G T A A T G G C G
## [6,] G G A A A T G G C G
## [7,] G G A A A T G G G G
## [8,] G G A A A T G G G G
## : : : : : : : : : : :
## [194,] G G A A A A G G C G
## [195,] G G A A A A G G G G
## [196,] G G T A A T G G C C
## [197,] G G T A A T G G G G
## [198,] G G T A A T G G G G
## [199,] G G A A T T G G G G
## [200,] C G A A T T G G G G
## [201,] C G A A A T G G C G
As above, this output can be saved to an object:
## ff (open) byte length=1206 (1206) dim=c(201,6) dimorder=c(1,2) levels: G <NA> C T A
## l_m10_a l_m10_b l_m15_a l_m15_b l_m121_a l_m121_b
## [1,] A T T T A T
## [2,] A T T T T T
## [3,] A A T T A T
## [4,] T T T T A T
## [5,] A T T T A T
## [6,] T T T T A T
## [7,] A T T T T T
## [8,] A T T T T T
## : : : : : : :
## [194,] A T T T T T
## [195,] A A T T T T
## [196,] T T T T T T
## [197,] A T T T NA NA
## [198,] A T T T T T
## [199,] A A T T A T
## [200,] A A T T T T
## [201,] A A T T T T
CAUTION: Note that these functions work only with objects
resulting from genDataRead
, and not
genDataPreprocess
, as the preprocessing disturbs the coding
of the data, thus making the output not easy to understand.
After loading the data, it is necessary to pre-process it to the internal format used by Haplin. This is done by evoking the command:
my.prepared.gen.data <- genDataPreprocess( data.in = my.gen.data, map.file =
"my_gen_data.map", design = "triad", file.out = "my_prepared_gen_data",
dir.out = "." )
CAUTION: This action can be very time-consuming for large datasets (e.g., estimated time for ca.45,000 SNPs and 1,600 individuals, a PED file of ca.700MB, is ca.6 minutes on a 8-core CPU). However, this needs to be done only once and the output, stored in small files, can be used for the subsequent analysis repeatedly. (See also section Choosing a subset of data )
This will also create .RData
and .ffData
files, which take much less space than the input PED files. Be careful
not to delete these files, as they can be re-used by simply loading into
R (the genDataLoad
function) right before Haplin
analysis.
NOTE: The information about the
my.prepared.gen.data
object can be displayed by simply
writing the name of the object.
The output of the genDataPreprocess
function (or
genDataLoad
) can then be used to run the analysis.
If you know that you want to focus your analysis on a certain region of the entire SNP set, or perhaps you’re impatient and want to check out Haplin without waiting a long time for the preprocessing to finish, you can easily choose a subset of the data to pre-process and analyze. This can be done with the command:
gen.data.subset <- genDataGetPart( data.in = my.gen.data, markers = c( 3:15,22 ),
design = "triad", file.out = "my_gen_data_subset", dir.out = "." )
This function allows you to specify the subset in various ways:
If you give a combination of these parameters, the result will be the
intersection of the subsets defined by each of the parameters alone. The
subset is then available in the gen.data.subset
object and
written to file.out
file(s). These can be loaded and
re-used multiple times.
To make it easier to choose certain subsets of data, we created the following functions:
getChildren
- extracts only the children’s data from
the triad-design data;getMothers
and getFathers
- extracts only
the fathers or mothers, respectively;getFullTriads
- extracts only the full families, if the
input data contains also dyads.IMPORTANT: Remember that each time you start a new R session, you need to load the data into the memory with the command:
This takes much less time than re-reading and running the data preparations!