||||
|\/|
|/\|
||||
SIB-PAIR 1.00b (26 Sep 2008)
by
David L. Duffy
(1995-2008)
A program for elementary genetical analyses
David L. Duffy, MBBS PhD.
Queensland Institute of Medical Research,
300 Herston Road,
Herston, Queensland 4029, Australia.
Email: davidD@qimr.edu.au
CONTENTS
Program Sib-pair performs a number of simple analyses of family data that
tend to be "nonparametric" or "robust" in nature. It is modelled to some
extent on the Genetic Analysis System [Young, 1995] in terms of the command
language and types of analysis. Included are routines for:
- Imputation of genotypes.
- Estimation of allele frequencies in codominant genetic systems.
- Simple and complex segregation analysis of a binary trait.
- Estimation of familial correlations and sibship variances for a
quantitative trait. Variance components analysis of quantitative and
binary traits using a variety of likelihoods.
- Haseman-Elston sib-pair regression of a quantitative (or binary) trait
using full and half-sib data, and variance components linkage
analysis for normally distributed quantitative traits.
- Carrying out multiple versions of the transmission-disequilibrium test.
- Testing allelic association with a binary or quantitative trait -- Monte
Carlo simulation of null distribution of simple tests, or
"measured genotype" familial analysis including combined
association and linkage analysis.
- Single locus Affected Pedigree Method identity-by-state and
identity-by-descent linkage analysis. This includes Wards [1993] extensions
to include unaffected pedigree members.
- Writing out of locus and pedigree files in the formats used by the
programs APM, Arlequin, ASPEX, CRIMAP, FISHER, GAS, GDA, Genehunter,
LINKAGE, LOKI, MENDEL, MERLIN, PAP and SAGE.
An example of use
Sib-pair is command line oriented, and writes output to the standard
output (the screen if you are using it interactively). Output can be
saved to a file by diverting it to a file using the "out"
command (in which case you will no longer see it coming to the screen).
Alternatively, you may use another program that can copy from the
standard output, such as tee, emacs (the powerful editor
usually found on Unix systems), syn (a powerful Windows editor)
or the Tcl/Tk based GUI program isp which can found on the same
site as Sib-pair itself.
Bearing this in mind, here is a sample interactive session. We start
at the command line of your operating system shell:
> sib-pair
|||| SIB-PAIR: A program for simple genetic analysis
|\/| Version : Version 1.00.beta (12-Sep-2008)
|/\| Author : David L Duffy (c) 1995-2008
|||| Job run : Tue Sep 16 12:25:34 2008 (moonboom)
Type "help" for help, "quit" to quit, "ctrl-C" to interrupt.
>>
A double arrow command prompt appears. We read in a previously prepared
script:
>> include ex.in
The contents of ex.in are a series of Sib-pair commands:
# declare four loci
set loc a affection
set loc b quantitative
set loc m1 marker 0.0 cM
set loc m2 marker 5.1 cM
# read the pedigree data
read ped inline
ex1 1a x x m n x 1 3 1 2
ex1 1b x x f n x 1 2 3 4
ex1 2a 1a 1b m n 3.5 1 2 1 3
ex1 2b x x f n 1.1 2 2 2 3
ex1 3a 2a 2b m y 4.3 1 2 1 2
ex1 3b 2a 2b m n 2.0 2 2 2 3
ex1 3c 2a 2b f n 0.8 2 2 3 3
ex1 4a 3c 3d f y x 1 2 2 3
ex1 3d x x m n x x x x x
ex1 4b 3b 3e m y 4.7 1 2 3 4
ex1 4c 3b 3f m n 1.6 2 2 1 3
;;;;
# The four semicolons ends the in-line data
run
The "run" command actually starts the initial processing of the
pedigree.
NOTE: Imputation level 1. Imputing untyped parental
genotypes where unequivocal.
Pedigree file = inline.ped
Number of loci = 4
Locus Type Position
---------- ---- --------
a a 6
b q 7
m1 m 8-- 9
m2 m 10-- 11
Number of marker loci= 2
Bonferroni corr. 5% = 0.025321
Bonferroni corr. 1% = 0.005013
Bonferroni corr. 0.1%= 0.000500
NOTE: Creating dummy record for ex1-3e.
NOTE: Creating dummy record for ex1-3f.
NOTE: Father and mother of person ex1-4a appear to be swapped around. Reordering.
NOTE: Person ex1-3e appears as a mother and sex was unspecified.
Setting sex to female.
NOTE: Person ex1-3f appears as a mother and sex was unspecified.
Setting sex to female.
Pedigree: ex1 No. members: 13 No. founders: 6 No. sibships: 5
Total number of pedigrees = 1
Number with only 1 member = 0
Total number of sibships = 5
Total number of subjects = 13
Total subjects genotyped = 10 ( 76.9%)
Total number of genotypes = 20
Largest pedigree (members) = 13 (Pedigree ex1)
Deepest pedigree (genrtns) = 4 (Pedigree ex1)
Mean size of pedigrees = 13.0
Mean pedigree depth = 4.0
Mean size where >1 members = 13.0
Mean depth where >1 geners = 4.0
Number of imputed genotypes= 0
Number of pedigree errors = 0
Number of deleted records = 0
We obtain a few routine messages and summary statistics. The small table
of Bonferroni corrections is a reminder that Sib-pair does not usually
correct P-values for multiple tests; it is up to the user to decide what the
appropriate significance thresholds are.
Blank records are created for named but missing parents. This is
necessary, as a formal pedigree should have neither or both parents
present for each person.
>> ls
a* b* m1 m2
The "ls" commands shows trait loci (with an asterisk appended),
and marker loci.
>> drop $m
>> ls
>> undrop
a* b* (m1) (m2)
The "drop" commands drops loci from the scope of subsequent
commands, while the "undrop" command returns access to all
loci.
>> describe m1
------------------------------------------------
Allele frequencies for locus "m1 "
------------------------------------------------
Allele Frequency Count Histogram
1 0.3000 6 ******
2 0.6500 13 *************
3 0.0500 1 *
Number of alleles = 3
Heterozygosity (Hu) = 0.5105
Poly. Inf. Content = 0.4064
Number persons typed = 10 ( 76.9%)
>> describe snp
Marker NAll Allele(s) Freq Het Ntyped
-------------- ---- ----------- ------ ------ ------
m1 3 1 .. 3 - 0.5105 10
m2 4 1 .. 4 - 0.7211 10
The "describe" command gives summary information about
loci. For marker loci, it tabulates simple allele counts and
proportions in the dataset. Given the keyword "snp", it gives
a summary for all markers, one line per locus. For traits,
"describe" gives familial correlations or recurrence risks.
The "tabulate" gives simpler tables:
>> tab a
a x: 2 y: 3 n: 8
>> tab a m1
------------------------------------------------
Cross-tabulation of "a " ... "m1 "
------------------------------------------------
m1
a 1/2 1/3 2/2
---------------------------------------
n 2 (.286) 1 (.143) 4 (.571)
y 3 (1.00) 0 (.000) 0 (.000)
No. complete observations = 10
LR contingency chi-square = 5.5
Degrees of freedom = 2
P-value =0.0643
There are a few other simple descriptive commands. The
"count" command gives information about families and
individuals:
>> count b>1
Count where "b > 1":
Pedigree Con=T Num ASPs Trios 4+
---------- ----- ----- ----- ----- -----
ex1 6 13 1 0 0
Total 6 13 1 0 0
The "select" command selects pedigrees containing a specified
number (or one or more) individuals meeting the criterion. The
"print" command is individual oriented:
>> print b>1
Print where "b > 1":
ped=ex1 id=2b fa=x mo=x sex=f b=1.1000 m1=2/2 m2=2/3
ped=ex1 id=4c fa=3b mo=3f sex=m b=1.6000 m1=2/2 m2=1/3
ped=ex1 id=4b fa=3b mo=3e sex=m b=4.7000 m1=1/2 m2=3/4
ped=ex1 id=3a fa=2a mo=2b sex=m b=4.3000 m1=1/2 m2=1/2
ped=ex1 id=3b fa=2a mo=2b sex=m b=2.0000 m1=2/2 m2=2/3
ped=ex1 id=2a fa=1a mo=1b sex=m b=3.5000 m1=1/2 m2=1/3
Number of matched persons = 6 out of 13 ( 46.2%)
Number of matched pedigrees = 1 out of 1 (100.0%)
If we had instead issued:
>> keep $m
>> print b>1
we would obtain:
Print where "b > 1":
ped=ex1 id=2b fa=x mo=x sex=f m1=2/2 m2=2/3
ped=ex1 id=4c fa=3b mo=3f sex=m m1=2/2 m2=1/3
...
As of 2006-Mar-01, we can write expressions containing genotypes:
>> undrop
>> print m2=="3/4"
Print where "m2 = = 3/4":
ped=ex1 id=1b fa=x mo=x sex=f a=n b=x m1=1/2 m2=3/4
ped=ex1 id=4b fa=3b mo=3e sex=m a=y b=4.7000 m1=1/2 m2=3/4
The "associate" command gives results from family based
tests of allelic association for a trait versus all active marker
loci:
>> ass a
--------------------------------------------------
Allelic association testing for trait "a "
--------------------------------------------------
Marker Typed Allels Chi-square Asy P Emp P Iters
---------- ------ ------ ---------- ------ ------ ------
m1 10 3 1.9 0.3930 0.1575 127 AssX2-HWE .
m1 1 3 1.1 0.5978 0.5978 23 RC-TDT .
m2 10 4 0.9 0.8192 0.7692 26 AssX2-HWE .
m2 1 4 2.1 0.4471 0.4471 160 RC-TDT .
The "sibpair" command gives results from regression based
tests of linkage for a trait versus all active marker loci. The
P-values for these tests can be "empirically" estimated by
gene-dropping marker alleles under the null hypothesis of no
linkage:
>> sib b sim
---------------------------------------------------------
Sham S+D H-E for trait "b " v. all markers
---------------------------------------------------------
Marker FSibs HSibs t-value Asy P Emp P Iters
---------- ------ ------ ---------- ------ ------ ------
m1 3 1 2.6 0.1180 0.0288 695 H-E +
m2 3 1 1.5 0.1908 0.2128 94 H-E .
Finally, we log transform the quantitative trait values and write out a
Genehunter type pedigree file, so we can further examine our data using a
multipoint program:
>> b=log(b+1)
>> keep b -- m2
>> write locus merlin merlin.dat
>> write merlin merlin.ped
>> write map merlin merlin.map
The "dummy" keyword adds a dummy binary trait variable to the
Genehunter pedigree file, so that program will calculate ibds for
us. If we like, Merlin can be run from within Sib-pair:
>> $ merlin --vc
The "$" command shells out to run another program.
When we exit from Merlin, we will be returned to Sib-pair:
>> quit
This job took 1.8 minutes
There are a number of operations useful for manipulating pedigree data
prior to analysis. These allow you to prune ("prune")
pedigrees down to selected individuals and only those relatives needed
to connect the index people, to reduce families to unrelated cases and
controls ("case"), to break up large pedigrees into component
nuclear families ("nuclear") or into unrelated cliques
("subped"), if the pedigree file does not specify all the
connecting relatives (between different branches say). One can also
select ("select" and "unselect") particular groups
of pedigrees, and specify different values of a variable in each
group:
# Select out ASP nuclear families
>> nuclear
>> select containing exactly 2 where dementia and isnon and anytyp
# or EDAC nuclear families
>> unselect
>> select containing 1 where IgE>1000 and isnon and anytyp
>> select containing 1 where IgE<50 and isnon and anytyp
Sib-pair also can be used as a calculator, and has a few genetics
utilities, notably the "sml" and "grr" commands
which gives expected recurrence risks, ibd's and genotype
frequencies for a specified diallelic model:
>> sml 0.01 0.5 0.2 0.1
------------------------------------------------
Single Major Locus Recurrence Risk Calculation
------------------------------------------------
Frequency(A): 0.010000; Pen(AA): 0.500; Pen(AB): 0.200; Pen(BB): 0.100
Trait Prev : 0.102020; Pop AR: 2.0%; Var(Add): 0.000206; Var(Dom): 0.000004
Measure MZ Twin Sib-Sib Par-Off Second
---------- ---------- ---------- --------- ----------
Rec risk 0.104 0.103 0.103 0.103
Rel risk 1.023 1.011 1.011 1.006
Odds rat 1.025 1.012 1.012 1.006
PRR 1.020 1.010 1.010 1.005
ibd|A-A 1.000 0.502 0.500 0.251
ibd|A-U 1.000 0.500 0.500 0.250
Freq of A if Affected: 0.019898 (0.000,0.039,0.961)
Freq of A if Unaffctd: 0.008875 (0.000,0.018,0.982)
Mating Proportion Risk to offspring
---------- ----------- ------------------
UnA x UnA 0.806 0.102
Aff x UnA 0.183 0.103
Aff x Aff 0.010 0.104
Finally, most commands can be interrupted by typing "ctrl-C" (that
is holding the control and C keys down simultaneously). This returns
you to the Sib-pair prompt, after finishing the current command as quickly
as possible. To interrupt the program completely, you need
to strike "ctrl-C" six times.
Analytic Methods
Imputation of unobserved genotypes. This is performed using
the algorithm described by Lange & Goradia [1987]. Firstly,
(0) A phenoset (all possible genotypes) for one locus is
generated for each individual in the pedigree. Then, iterate by nuclear
families, repeating the next two steps until no further updates:
(1) Parental genotypes inconsistent with their offspring are
removed; (2) child genotypes inconsistent with their parents are
removed. Finally, (3) If zero genotypes remain, report an
inconsistency; if one genotype remains, this becomes the imputed
genotype; if the joint spouse genotype is unambiguous, but the specific
genotype each spouse carries is ambiguous, if requested, randomly assign
a genotype to each parent. These latter genotypes might be used only for
calculation of statistics for offspring of the pair, but not for the
parents themselves. A further extension is to sequentially (founders
then nonfounders) impute the remaining missing genotypes as the most
likely member of the phenoset. This or a faster randomised algorithm is
always run (unless the imputation flag is set to "-1", see
below) to give starting values for the MCMC methods, but these simulated
genotypes will not be saved unless the imputation level is set to "3" or
"full".
Sex imputation. Likelihood of observed homozygosity at
multiple sex-linked loci is calculated under hypothesis of male and
female sex, assuming 0.1% of male and female genotypes are miscalled
as heterozygotes.
Allele frequencies. These are for codominant systems only.
Either a straight allele count is used, or the contribution of each
pedigree is weighted by the number of founders it contains.
Alternatively, the imputed and observed genotypes in the founders can be
counted, or the MLE of the founder allele frequencies calculated by
MCMC (see below).
Hardy-Weinberg proportions. These are tested by a Pearson
chi-square, with the P-value estimated via "gene-dropping"
(Monte-Carlo,MC) simulation. These are based on the genotypes in the
founders of that pedigree, where typed, or on the gene frequencies in
the total sample where the founder is untyped, and the structure of the
pedigree. For diallelic markers, the exact test of Hardy-Weinberg
equilibrium (assuming all individuals is unrelated) is also
calculated.
Segregation ratios. The default analysis assumes the pedigrees
are unascertained, and gives the naive estimators. The ascertainment
corrected analysis is for nuclear families and follows Davie [1979]:
p=(r-j)/(t-j), where p is the risk, r is the total
number of affected children, j, the number of sibships containing
exactly one proband, t, the number of children. A fairly
efficient approximate sampling variance is also given.
Haplotype reconstruction. This routine is not currently
available in the Fortran 95 version of Sib-pair. This is performed on a
nuclear family by family basis, though incorporating grandparental
information where available. Initial reconstruction is performed using
a simulated annealing algorithm that maximizes a sharing based criterion
based on length of runs of the same alleles on a putative chromosome
among sib pairs, parent-offspring pairs, and grandparent-child pairs.
The order of loci in the pedigree file is treated as the linkage order,
and map distance information is not used. Missing parental
haplotypes are filled in using the childrens' haplotypes in a simple
fashion. Mendelian inconsistencies are flagged in the printout.
The second algorithm is more ambitious and attempts to construct
recombination minimized haplotypes, again on a nuclear family by family
basis, and dealing more intelligently with missing data. A simulated
annealing algorithm using multiple restarts is used. Recombination
events and Mendelian errors are flagged in the output.
Admixture analysis. This refers to testing for a mixture of
specified distributions in the empirical distribution of a quantitative
trait, usually a mixture of normals, though Sib-pair also offers
mixtures of exponential and Poisson distributions. Information from the
relationship between family members is not utilised. The usual EM
approach is used.
Test for normality. The Filliben correlation [Filliben 1975]
is calculated as a test for normality. This is the correlation between
the observed data and the rankits expected under the assumption of
normality. The P-value for this statistic is approximate, and is
produced using an approach modelled on that used by Royston [1993] for
the related Shapiro-Francia W':
log(1-rf) ~ N(m,s)
m:=1.0402 (log(log(n))-log(n)) - 1.99196
s:=0.788392/log(n)) + 0.31293
where n is the sample size. This performs reasonably well versus
the empirical percentiles:
| | Coverage
|
|---|
| n | 5 | 10 | 20 | 50 | 100 | 500 | 1000 | 5000
|
|---|
| Nominal P=0.05 | 0.021 | 0.044 | 0.055 | 0.057 | 0.059 | 0.052 | 0.045 | 0.033
|
|---|
| Nominal P=0.01 | 0.00 | 0.006 | 0.009 | 0.014 | 0.019 | 0.007 | 0.007 | 0.007
|
|---|
A common use of the Filliben correlation is as the criterion for
selecting an optimal transformation of the data.
The J0.02 statistic is a variance-corrected order-statistic
based skewness measure:
[(P0.02+P0.98)/2-P0.50]/[P0.75-P0.25]
[David & Johnson 1956]. A test of normality can be constructed,
using the standard error of J0.02. This is based on
interpolation of results from Monte-Carlo simulations
(SE(J0.02):=1.36/sqrt(N) cf Resek [1974]).
Sibship variance test. This is the linear model suggested by Fain
[1977] for the detection of the phenotypic effects of quantitative trait
loci. Briefly, if parental trait values are at the extreme of the population
distribution, then they will be carrying multiple increasing or decreasing
alleles at the QTLs. As a result, the trait variance among their offspring is
decreased compared to sibships whose parents have trait values close to the
population mean. This U-shaped relationship between midparent value and
sibship variance can be detected by fitting linear or quadratic curves.
Variance components analysis. This is the usual mixed effects
analysis of quantitative traits assuming multivariate normality. The
log-likelihood:
LL = -0.5 [log(det(S)) + (y-u)' inv(S) (y-u)]
is maximized, where S is the variance-covariance matrix for the
trait values for each phenotyped pedigree member, and y and u
are the trait values and their expected values:
u= B'X,
reducing to the grand mean in the absence of covariates. The main
diagonal elements of S takes the value:
VA+VQ+VD+VE,
and the off-diagonal elements:
Ri,jVA+ibdi,jVQ+Ki,jVD,
where,
Ri,j is the coefficient of relationship for the
i-jth pair of relatives
K is the coefficient of fraternity
ibd is the average ibd sharing at the marker location being
tested for linkage to a QTL.
The maximization is performed using AS319 (variable metric minimizer
with numerically estimated gradients). The phenometric models (ADE, AE,
E) are fitted to the intact entire pedigree, but the models including
VQ can be fitted to the entire pedigree, or to
sibships only.
Categorical trait association analysis. This is the Pearson
goodness-of-fit based test for equality of allelic gene frequencies at a
marker locus in individuals expressing different trait values (RxC
table). Both the nominal (ignoring relatedness of the sample) and empirical
P-values for the test are output. The empiric P-value is estimated via
gene-dropping simulation. These are based on the gene frequencies in the
total sample, and the structure of the pedigree, and are conditional on the
observed occurrence of the binary trait in the families. The gene-dropping
can also be further conditioned on identity by descent at another marker
(which may in fact be the marker being tested for association). This
therefore gives a "model-free" test of association conditional on
linkage.
Formerly, a sibship permutation based test was provided, calculating the
same chi-square statistic for members of sibships that contain at least
one affected and one unaffected typed individual, and generating a
(within-sibship) permutation P-value. This is now replaced by a more
powerful score (FBAT) test combining the within-sibship association and
transmission-disequilibrium tests after Knapp [1999] and Laird et al
[2000]. In this approach, the complete or partial genotype of untyped
parents is reconstructed from the genotypes of the affected and
unaffected children. The transmission of alleles from these
reconstructed parents is conditioned on the genotypes of the children
used in the reconstruction. The appropriate conditional distribution
under the null hypothesis is approximated via Monte Carlo simulation and
rejection sampling. For binary traits, a sibship permutation test in
the form of a conditional logistic regression stratifying on sibship is
also available
In the case of multiple generation pedigrees, Sib-pair scans the
pedigree upwards nuclear family by nuclear family, saving imputed
genotypes, so that these contribute as children to subsequent
nuclear families. I don't think this will be biased, but have yet to check.
This behaviour can be turned off.
Population genetic F statistics (FIS, FIT and
FST) are also calculated for each marker assuming individuals
with different indicator trait values come from separate related demes.
These are estimated following the approach of Pons and Chaouche [1995],
as described by Excoffier [2001]. Results for each locus are
combined to give multilocus summaries for the sample as:
| F*IS = (Σ HS - Σ HO)/Σ HS
|
| F*IT = (Σ HT - Σ HO)/Σ HT
|
| F*ST = (Σ HT - Σ HS)/Σ HT
|
Quantitative trait association analysis. This fits an additive
(allelic means) model predicting an individual's trait value from his/her
genotype at a marker locus. The residual sum-of-squares is compared to those
obtained via a gene-dropping simulation of the pedigrees, giving an empiric
P-value.
Two-locus and N-locus linkage disequilibrium estimation. One
algorithm finds informative founder matings, or informative matings
where all the grandparents are untyped, and imputes the two-locus
haplotypes transmitted to the offspring. The loci are assumed to be
tightly linked, so that four parental haplotypes are counted, as opposed
to the more usual two haplotypes from the child. This is the default
for markers where there are more than 6 alleles per marker. Both
D and D' measures are calculated.
For two locus analyses, the second algorithm combines phased and
unphased genotype data in an EM fitted log-linear model. This can
handle X-linked loci. For more than two markers, and for testing
haplotype association to a trait, phase information is discarded (ie all
individuals are treated as unrelated), and only autosomal markers can be
used.
Homozygosity analysis. This tests for an increase in observed
homozygosity at a marker locus in individuals expressing a binary trait,
comparing this to the predicted homozygosity based on the allele frequencies
in the total sample. This may occur in the presence of allelic association
with a recessive trait locus, and/or deletional loss of heterozygosity (the
parents would be untyped for such an individual not to have been flagged as a
Mendelian inconsistency of course). A one-tailed empiric P-value is estimated
via "gene-dropping" simulation, based on the gene frequencies in the total
sample, and the structure of the pedigree. The multipoint homozygosity
analysis uses the mean maximum marker homozygosity run length in the set
of cases. Again, gene dropping is used to produce the distribution of
this statistic under the null given the marker map, allele frequencies and
observed pedigrees.
Transmission-disequilibrium test. The original formulation of the
TDT is for a diallelic marker [Spielman et al 1993]. The TDT statistic
calculated by the program is the Pearson goodness-of-fit based test of
symmetry in the square table of transmitted versus nontransmitted alleles to
each affected child [Haberman, 1979]. Empiric P-values are produced by
randomization of the table. This global allelic form does not correct for the
(usually small) correlation in parental genotypes induced by linkage
disequilibrium (absent of course under the null hypothesis). Another allelic
test provided is the marginal allelic test suggested by Spielman and Ewens
[1996], which is probably slightly more powerful than the global symmetry
test [Kaplan et al 1997].
The genotypic TDT P-value is estimated via gene-dropping based on the
genotypes of typed ancestors of probands (where both parents of the proband
and all antecedents must be typed), and the structure of the pedigree. The
test statistic compares the observed number of each genotype transmitted with
the number expected based on the parental genotypes. Pairs of cells whose
total count is less than a given cutoff may be excluded from the analysis to
increase power. The P-value for the TDT testing each allele versus all others
in turn ("allele-by-allele") is the exact two-sided binomial probability (via
the beta distribution). When the plevel is zero, only the best
of the allele-by-allele test results are printed, and the P-value is
Bonferroni corrected for the number of alleles at that marker.
Note that the default option is to use probands for the TDT only one
parent is typed. For the diallelic marker case with unequal allele
frequencies, using one parent families does lead to biased results. The
unified transmission test available via the "assoc" command
does not suffer from this problem (see above).
The Schaid and Sommer [1993] genotypic risk ratio tests for familial
association under the assumption of Hardy-Weinberg equilibrium, or
conditional on parental genotypes, is also offered. This uses
log-linear modelling (implemented as iteratively reweighted least
squares) of a biallelic locus with both parents genotyped. The attributable
risk is also produced.
Haplotype Relative Risk analysis. This is the original
familial association statistic comparing transmitted and nontransmitted
allele frequencies unmatched on family (Falk and Rubinstein 1987; Knapp
et al 1993). As usual, an empiric P-value is estimated via
"gene-dropping" simulation, based on the gene frequencies in the total
sample, and the structure of the pedigree.
Sib-pair analysis. Identity by descent estimation is based on the
sib pair and parental genotypes when available. In the case of untyped
parents, the full-sib sharing is the sum of sharing for each possible set of
parental genotypes weighted by their likelihood based on all children in the
sibship. Half-sib sharing is estimated based only on known genotypes, whether
observed or unequivocally imputed. The effective degrees of freedom for the
t-test of the slope of the regression line is given as the number of
individuals in the sample (counted once only) who are in a sib pair where
both members are typed at trait and marker, minus two. For a sample made up
of nuclear families (no halfsibs), this will be equivalent to the Σ(sibship
size-1)-2 value used by SIBPAL 2.6, and originally suggested by Hodge
[1984]. For binary traits, the same ordinary-least-squares analysis is
performed -- the t-statistics from these results are only really
applicable to large samples, and tend to be too liberal. The quantity
regressed is not the usual Haseman-Elston squared trait difference, but
a function of the squared trait sums and differences following Sham and
Purcell [2001]. This approach is supposed to approach the power of the
variance components approach according to those authors,
and gives appropriate Type 1 error rates.
For the Fulker & Cardon methods, the expected ibd through the
interval between the two markers is estimated using the equation given in
Olson [1995]. Haseman-Elston regressions are performed at a series of points
across the interval using the ibd sharing of the two flanking markers,
and the given size of the interval. The Haldane mapping function is used.
Affected sib pair analysis. This is the original IBS based approach
described by Lange [1986a], extended to half-sibs as per Bishop and
Williamson [1990]. No correction for sibship size is made -- that is all
possible pairs are treated as independent. The usual two d.f. chi-square is
calculated, with expected counts being calculated based on the observed gene
frequencies in the total sample. The IBD based mean test is also
calculated.
Affected Pedigree Method. This uses the measures of genetic
similarity described by Weeks and Lange [1988; see also, Lange, 1986a,
1986b], Whittemore and Halpern [1994], and Ward [1993, 1995]. The expected
mean and variance for each pedigree is estimated via gene-dropping
simulation. These are based on the observed gene frequencies in the total
sample, and the structure of the pedigree. Both ibs and ibd based family
scores can be estimated. The original APM statistics, the APM statistic of
Whittemore and Halpern and the T(AB) and GPM statistics of Ward are
calculated.
Multilocus IBS sharing statistics. These are used to confirm the
pedigree structure using marker data. One approach calculates the overall
probability of sharing two alleles IBS for full and half sibs, summing over
all loci, and ignoring any linkage between markers. The second approach uses
gene dropping based on the given marker map. Two lists are generated, one by
individual, the other by relative pair.
Martingale residuals. The elegant approach of Commenges [1994] to
genetic analysis of age-at-onset is to analyse the residuals obtained from a
nonparametric or semiparametric survival analysis. Sib-pair implements the
former, calculating the martingale residuals using the Nelson-Aalen estimator
for the integrated hazard [eg Andersen et al 1993], which are then
transformed following Therneau et al [1990] to give a more symmetrical
distribution.
Generalized linear models and survival regression. These
are the usual IRLS algorithms (using AS 164, [Stirling 1981]). The
exponential and Weibull regressions are implemented as Poisson regressions
(with log time as offset) as per Aitken and Clayton [1980]. Conditional
logistic regression calls the orginal logccs subroutine of Breslow
and coworkers [Smith et al 1981].
Other standard statistical tests. A number of classical tests
for independent data (eg unrelated cases) are implemented, such as
contingency chi-square test (with Monte Carlo "exact"
P-values, see below), ordinal by ordinal trend test for contingency
tables [Yates 1948], and nonparametric one way analysis of variance via
the Kruskal-Wallis test. Sib-pair can calculate the Pearson correlation
coefficient for between-trait association, the Kaplan-Meier estimator
for the survivor function, and Nelson-Aalen estimator for the hazard
function, and measures of agreement for contingency tables.
Deterministic Estimation of Pedigree Likelihoods
Rather than the original Elston-Stewart algorithm [Elston and Stewart
1971] augmented by pedigree traversal, Sib-pair uses the iterative
peeling approach described by Wang et al [1996] (following Janss et al
[1992]). This has the advantage of "automatically" dealing
with loops, although the resulting likelihood in that case is only
approximate: this approximation can be improved by several methods, but
these are not currently implemented. The algorithm follows the detailed
description in Chapter 2.1 of Schelling [2004], though at present only
allows evaluation for a single codominant locus.
Briefly, the iterative approach peels up and down simultaneously by
calculating anterior and posterior values for each
individual, where the anterior and posterior values represent the scaled
likelihood contributions for ancestors and siblings, and mates and
descendants respectively. The values for any individual rely on those
for the other relatives, so these are reciprocally updated over multiple
iterations until they converge. Usually a maximum of 10-20 iterations
suffices, and if an ideal ordering of evaluations is used, a single
iteration in unlooped pedigrees is sufficient ie it is equivalent to
the usual pedigree traversal algorithms eg Lange and Boehnke [1983]; Wang
et al [1996], Fernandez and Fernando [2001].
Monte Carlo Algorithms
Gene-dropping. The "unconditional" algorithm producing null
distributions is as follows. Repeat the following 1-3 steps a large
number of times. (1) Founder genotypes are assigned using the allele
frequencies in the observed sample, assuming panmixia and Hardy- Weinberg
equilibrium (HWE). Iterate, until all genotypes are assigned: (2) If
both parental genotypes are nonmissing, randomly assign the index a genotype
based on Mendelian autosomal inheritance (ie if parental genotypes are {1/2}
and {3/4}, a child's genotype is randomly selected from {{1/3}, {1/4}, {2/3},
{2/4}}, with each genotype having a probability of selection of 0.25). Once
complete, (3) calculate the test statistic based on the family's simulated
genotype. Following completion of the outer loop, (4) summarize the
distribution of the resulting test statistic. This procedure is used to
generate null distributions for the association Pearson chi-square.
For the share command, this also allows for recombination between
markers based on the given linkage map
The null distributions for the genotypic marginal TDT is generated using a
"conditional on founders" algorithm, that takes observed founder/ancestor
genotypes as given. Only typed nonfounders genotypes where both parents were
typed are simulated.
ibd estimation. The modification to calculate ibd
distributions gives each founder (two) unique alleles in his/her "typing
genotype". A simple gene-drop gives the null distributions for the
ibs and ibd statistics of the APM method.
I based the "conditional on observed genotypes" (gene-drop with
rejection sampling) algorithm for calculating ibd on that
described by Blangero et al [1995]. As before, each founder is assigned
a typing genotype made up of two unique alleles. Offspring are only
assigned ibd-typing genotypes that are consistent with the
observed genotype at the observed locus of interest. For example, say
the observed genotypes in the parents are 100/102 and 100/102, and the
typing genotypes associated with these are set
to {1/2} and {3/4} respectively. If the child is 100/102, assignment of
typing genotypes {1/3} and {2/4} will be rejected. This is equivalent to a
child's genotype being randomly selected from {{1/4}, {2/3}}, with each
genotype having a probability of selection of 0.5. The resulting ibd
statistics based on the typing genotype will approximate ibd for the
marker locus of interest.
Multipoint ibd estimation. This is a simplification of the
Cardon-Fulker approach as extended by Almasy and Blangero [1998]. The
ibd estimates from sets of markers in complete linkage are
combined by calculating the variance-weighted mean ibd for each
pair of relatives over the set. The estimates of the ibd
variances for each marker arise as a byproduct of the MCMC algorithm.
Gene-dropping conditional on ibd. The previous algorithms can be
combined by gene dropping a marker conditional on "ibd
indicator" genotypes that represent segregation at either that
same locus, or a neighbouring more informative marker. This allows us
to simulate the null distribution assuming linkage but no association.
Missing genotype simulation by Monte-Carlo Markov Chain (MCMC).
The calculation of ibd in the presence of missing genotypes is
performed via a Metropolis algorithm. This algorithm is a multiallelic
extension of that described by Lange and Matthysse [1989]. One iteration
of the generation of a legal constellation of imputed and observed
genotypes is produced by:
(1) Perform (a) (b) (c) or (d):
(a) Simulate ibd, then "mutating" up to four imputed founder
alleles. These propagate through the pedigree using the current pattern
of ibd transmission as indicated by the ibd-typing
alleles, and are rejected and resimulated if an (unordered)
inconsistency with an observed genotype occurs.
(b) Simulate ibd, then switch the parent of origin for an
individual heterozygote. Propagate this change up through the pedigree
to the originating founder(s), but not below the chosen "pivot"
individual.
(c) A "conditional on observed genotypes" dropping of ibd-typing
alleles, with the refinement that this ignores imputed genotypes.
Inconsistencies thus generated for imputed genotypes are resolved by
changing the imputed genotypes. This procedure will be slow in the
presence of a large number of untyped nonfounders.
For (a)-(c), additional local proposals (as below) are compounded to
these, and the resulting proposed constellation is accepted or rejected
via the Metropolis criterion.
(d) Resimulate all untyped x untyped founder mating joint genotypes
conditional on their offspring and other spouses, then other pedigree
members singly, again conditional on surrounding genotypes. This is a
simple Gibbs sampler, and is more efficient than the above when there
are many missing genotypes in larger pedigrees.
MCMC burn-in. In releases prior to version 0.96.0, there was no
burn-in for this Metropolis algorithm, as preliminary empirical tests had
found the results from this program agreed well with "exact" results from
programs such as GENEHUNTER. Subsequently, I have found some pedigrees where
using the starting genotypes from the Lange-Goradia approach does lead to
biased ibd estimates for certain pairs of relatives. Therefore, the
program now performs a number of burn-in iterations (default 100) prior to
those used to estimate ibd. The required number of such iterations
depends on the number of missing genotypes in the pedigree.
Metropolis generalized linear mixed model and finite polygenic model
sampler. This is either a "standard" or "slice"
Metropolis sampler, where the simulated variables include diallelic QTL
genotypes, Gaussian breeding values, a single QTL allele frequency
(shared by all QTLs in the FPM), up to three genotypic means (shared by
all QTLs in the FPM), polygenic and environmental variances (including
pedigree ("VC") and maternally-derived sibship
("VS") variances).
The trait model can be gaussian, binomial (with identity, probit or
logit link), poisson (including log link), weibull or MFT.
Proposals for diallelic QTLs genotypes are straightforward to generate,
and do not usually give rise to noncommunication between sets of
legal genotype proposals. Proposals for continuous variables are
generated from random normal deviates, and a tuning parameter can be
set that alters the variance of these proposal distributions.
The likelihood contribution from the ith individual to the
Metropolis criterion for these models is (see for example, Guo and
Thompson [1993]):
LL = F*Σ(log(P(Gj))) + F*log(f(a|VA)) +
(1-F)*log(f(a|aFA, aMO)) + log(c|VA) +
I*log(f(y|G1,...Gj, a, c, VE))
where,
P(x) denotes the probability of x,
f(x) denotes the density of x,
y is the trait value,
a is the breeding value,
c is the pedigree-specific intercept,
Gj is the genotype at the jth QTL,
VA is the additive polygenic variance,
VC is the familial environmental variance,
VE is the error variance,
F=1 when a founder, 0 when a nonfounder
I=1 when phenotype observed, 0 when unobserved.
The conditional density for the breeding values of offspring includes
the correction for inbreeding (the segregation variance being
1-0.5*(FFA+FMO). The random effects are modelled
as zero-mean gaussian.
The realizations of the parameters are summarized as means, and
approximate standard errors produced by batching (default
B=iter1/2 [Jones et al 2005]). The interbatch lag-1
serial correlation is calculated as a diagnostic for the appropriate
number of values to simulate [Ripley 1987].
The implementation of the generalized linear mixed models is quite
straightforward in the chosen Metropolis paradigm (it would be more work
to produce a Gibbs sampler, I believe), but for the "standard"
sampler, it is important to check that the proposal acceptance rates are
in the optimum range (usually stated as 0.2-0.6, Ripley 1987). This is
less critical for the slice sampler, where the tabulated acceptance
rates are actually the ratio of accepted proposals to the number of
function evaluations (and so are just a measure of algorithm
efficiency). Models fitting VC and
VS or VA are two-level GLMMs and so
I have fitted a number of test datasets from the literature. There are
surprising differences between results from standard software for some
of these datasets, so although Sib-pair sometimes does not give
identical results to that from non-simulation-based maximum likelihood
methods, this may reflect approximations used by other programs.
Increasing the number of random effects chains is realized by
duplicating the families the appropriate number of times and correcting
the likelihood and standard errors. One is essentially averaging over
multiple estimates of the random effects for each individual, as global
parameters such the fixed effects regression coefficients and overall
variances are the same over the replicate chains at that iteration.
This seems to reduce bias in the estimation of the random effects, but
with the side effect of increasing the between-batch correlation, and so
slowing estimation. The tabulated results below generally used 4 chains
run for 10000 iterations after a 1000 iteration burnin.
Binomial GLMM analysis of seed germination dataset of Crowder et al
(1978) using different approaches. PQL1 is the penalised
quasilikelihood approach implemented as glmmPQL() in the MASS package
[Venables and Ripley 2002], while PQL2, AGQ are results from lmer() in
the lme4 package of Bates and Sarkar [2005] using penalized
quasilikelihood, adaptive Gaussian Quadrature respectively. The BUGS
results are from the BUGS Examples manual.
| | Parameter Estimate (SE)
|
|---|
| Method | Sib-pair | AGQ | PQL1 | PQL2 | BUGS
|
| SD of Plate Effect | 0.28 (0.17) | 0.24 (0.09) | 0.23 | 0.24 | 0.29 (0.15)
|
| Intercept | -0.51 (0.13) | -0.54 (0.17) | -0.54 (0.17) | -0.54 (0.16) | -0.51
|
| Seed | 0.06 (0.32) | 0.10 (0.28) | 0.09 (0.27) | -0.09 (0.28) |
|
| Root | 1.31 (0.26) | 1.33 (0.24) | 1.32 (0.23) | 1.32 (0.24) |
|
| Seed x Root | -0.79 (0.44) | -0.81 (0.38) | -0.81 (0.38) | -0.81 (0.38) |
|
Binomial GLMM analysis of "bacteria" dataset from the R MASS
package [Venables and Ripley 2002] using 5 different approaches. PQL1
is the penalised quasilikelihood approach implemented as glmmPQL() by
Ripley in the MASS package, while PQL2, AGQ and Laplace are results from
lmer() in the lme4 package of Bates and Sarkar [2005] using penalized
quasilikelihood, adaptive Gaussian Quadrature and the Laplace
approximation respectively.
| | Parameter Estimate (SE)
|
|---|
| Method | Sib-pair | AGQ | PQL1 | PQL2 | Laplace
|
| RE Variance | 2.05 (1.23) | 1.70 (1.05) | 1.98 | 3.27 | 1.66
|
| Intercept | 3.70 (0.76) | 2.86 (0.48) | 2.74 (0.38) | 2.75 (0.48) | 2.81 (0.48)
|
| Low dose | -1.46 (0.74) | -1.36 (0.82) | -1.25 (0.64) | -1.25 (0.82) | -1.35 (0.82)
|
| High dose | 0.60 (0.74) | 0.58 (0.85) | 0.49 (0.67) | 0.49 (0.85) | 0.58 (0.85)
|
| Week>2 | -1.66 (0.51) | -1.63 (0.46) | -1.61 (0.36) | -1.61 (0.46) | -1.57 (0.46)
|
Binomial GLMM analysis of contraception usage data
from the 1988 Bangladesh Fertility Survey [Steele et al 1996].
| | Parameter Estimate (SE)
|
|---|
| Method | Sib-pair | PQL
|
| RE Variance | 0.25 (0.06) | 0.22
|
| Intercept | -1.67 (0.16) | -1.66 (0.15)
|
| Age | -0.03 (0.01) | -0.03 (0.01)
|
| Urban | 0.72 (0.07) | 0.72 (0.12)
|
| 1 child | 1.10 (0.12) | 1.09 (0.16)
|
| 2 children | 1.36 (0.15) | 1.35 (0.17)
|
| 3+ children | 1.32 (0.17) | 1.32 (0.18)
|
Poisson GLMM analysis of epileptic seizure count data of Thall and Vail
[1990] using different approaches. PQL1 is the penalised
quasilikelihood approach implemented as glmmPQL() in the MASS package
[Ripley and Venables 2002], while PQL2, AGQ are results from lmer() in
the lme4 package of Bates and Sarkar (2005).
| | Parameter Estimate (SE)
|
|---|
| Method | Sib-pair | AGQ | PQL1 | PQL2
|
| RE variance | 0.268 (0.101) | 0.252 | 0.197 | 0.101
|
| Intercept | 1.834 (0.112) | 1.833 (0.074) | 1.870 (0.106) | 1.870 (0.074)
|
| Progabide | -0.346 (0.152) | -0.334 (0.105) | -0.310 (0.149) | -0.309 (0.105)
|
| log basal rate | 0.861 (0.119) | 0.883 (0.091) | 0.882 (0.129) | 0.882 (0.091)
|
| Base:therapy | 0.394 (0.157) | 0.339 (0.143) | 0.342 (0.203) | 0.342 (0.143)
|
| log age | 0.513 (0.330) | 0.481 (0.244) | 0.534 (0.346) | 0.533 (0.244)
|
| Period 4 | -0.159 (0.054) | -0.160 (0.055) | -0.160 (0.077) | -0.160 (0.143)
|
Poisson GLMM analysis of European male melanoma death rate dataset of
Langford et al (1998) using different approaches. PQL1 is the penalised
quasilikelihood approach implemented as glmmPQL() by Ripley and Venables
[2002] in the MASS package, while PQL2, AGQ are results from lmer() in
the lme4 package of Bates and Sarkar (2005). The STATA result used the
xtpois command, and comes from the review article at
http://www.mlwin.com/softrev/revstata.html.
| | Parameter Estimate (SE)
|
|---|
| Method | Sib-pair | AGQ | PQL1 | PQL2 | STATA
|
| Region variance | 0.188 (0.037) | 0.170 (-) | 0.161 | 0.125 | 0.102 (-)
|
| Intercept | -0.151 (0.058) | -0.139 (0.043) | -0.129 (0.049) | -0.129 (0.043) | -0.138 (0.017)
|
| UVB insolation | -0.035 (0.011) | -0.034 (0.009) | -0.038 (0.010) | -0.038 (0.009) | -0.056 (0.004)
|
The final results are sometimes sensitive to the choice of starting
values for the random effects (the fixed effects are started
automatically from the marginal model parameter estimates using
"reg"), and to the proposal step size. Because of the
correlation between random and fixed effects in GLMM's other than
Gaussian (since the intercept affects variance), differences in the
estimated random effect size do alter the fixed effects regression
coefficients.
The output (with plevel set to 1) allows plotting of parameter
estimates to assess convergence of the chain.
Multiple genotype imputation. This is one approach to dealing
with association in families where trait data is available for some
individuals without marker genotype data. The genotype sampler is used
to simulate missing marker genotypes conditional only on observed marker
genotypes in the pedigree and the population allele frequencies for that
marker (which can be fixed to a specified value). It is not,
unfortunately, simulated conditional on values at a given trait, so it
assumes data is missing-completely-at-random (MCAR). Multiple cycles of
imputation followed by association analysis of the augmented data are
carried out, then averaged results are produced. The estimate of the
mean effect (with standard error) for an allele is produced following
Rubin [1987].
Randomized TDT. The randomization test for the global allelic TDT
permutes the transmission table by randomly selecting a single proband-parent
pair and reversing the transmitted and nontransitted alleles. One "shuffle"
of the table involves N such permutations, where N is the number of such
informative parent-proband pairs in the observed pedigrees (this reduces the
correlation between successive tables in the random walk to close to
zero).
Sequential empiric P-values. The Monte-Carlo P-values provided for
the various MC-based tests are produced using the sequential approach
described by Besag and Clifford [1991]. In this refinement, we only generate
as many pseudosamples as is necessary to give a P-value numerator of size
mincount; the denominator is the number of pseudosamples. The practical
effect of this procedure is that if the true P-value is large, then
relatively few pseudosamples are generated to give a less precise estimate of
this uninteresting value. Besag and Clifford suggest a value for
mincount of 10-20. It is necessary to set a maximum denominator to avoid
excessive computation for "highly significant" results.
Other empiric P-values. An exception to this is the algorithm used
for empirical P-values for the APM. Here, a P-value for each family is
simulated at the same time as the mean and variance. These P-values were
previously combined using the procedure due to Fisher [Hedges and Olkin
1985], that is, twice the sum of the natural logarithms of the P-values was
treated as a chi-square variate with 2*N degrees of freedom, where N is the
number of contributing families. This does not seem to be particularly
powerful, so each P-value is now inverse-normal transformed to a Z-score, and
these combined in an unweighted fashion [Hedges and Olkin 1985].
Shortest paths. A shortest path through a pedigree between
two individuals is generated using Djikstra's algorithm. The presence
of multiple paths is not flagged, and the path is not purely genetic
relationships.
Pedigree loops. This is a depth-first search with backtracking.
Currently prints only one loop per pedigree.
The program reads commands from standard input, and writes results to
standard output. Therefore, the program can be run interactively, or if a
series of commands is to be found in a file, in batch mode. If the input file
was test.in, entering "sib-pair <test.in >test.out"
would perform the commands in test.in, and write results to test.out.
A command is a single line of keywords, locus names and/or variable
values. If a "\" character is the last word of a line, the next
line is interpreted as a continuation of the previous command. Sib-pair is
case-sensitive, so that the keyword "READ" is not equivalent to "read".
Commands are either global, which can be entered at any time; descriptive
(set impute, set locus, read pedigree), which must
precede the run statement; the run statement, that causes the
dataset to be read and processed; or analytic, which act only after the
run statement.
One command, set plevel, controls verbosity of output. Some useful
descriptive tables are only printed if plevel is at least 1.
Sib-pair's parser can evaluate simple algebraic and logical
expressions for each record in a datafile, but does not directly allow
complex programming. It does offer simple macro facilities including a
loop construct. The eval command accesses a Scheme interpreter
that can carry out more sophisticated computing.
There are command line options that may save a few keystrokes.
Running "sib-pair -i test.in" starts Sib-pair and
"includes" the contents of test.in. The command
"sib-pair -l test.in" starts Sib-pair and runs the
"locus" command on the contents of test.in. The
command "sib-pair -h" (or --help) starts
Sib-pair and runs the help command. If the program has been compiled
with g95, the --g95 flag prints useful information
about environmental variables that can be tweaked to alter performance,
and run time error messages.
Global commands
- !|#. The rest of the line is a comment, and is echoed to
standard output.
- echo. Prints the rest of the line to standard output.
- $ <operating system command>. The rest of the line (up to position 80) is a command,
and is passed to the shell for execution.
- dir [<OS specific args>]. Prints list of files in
current directory.
- pwd [<new directory>]. Prints name of current
directory, or changes directory to that specified.
- file delete|rename <name> [<new_name>].
Deletes or renames a file.
- clear. Restarts the program, closing all workfiles and zeroing all
arrays.
- help [All|Examples|Globals|Operators|Data|
Analysis|<search_string>].
Prints a brief description of the commands -- either all, a subset,
or all matching the search string.
- info. Information about program settings and the
current dataset. For the latter, gives counts of active and
inactive pedigrees, individuals, and loci and a table
of numbers observed for every trait and marker.
- list|ls|which [<loc1> [[to] <locN>]]
[$(a|q|m|h|x)[m|r]].
List of loci in current analysis. The "$<letter>" keyword
denotes the list of of one class of locus, which can be reordered according
to genetic map position (m) or in reverse (r) to the
dataset ordering. The which command gives the position/rank of each
locus argument in the total list of loci.
- show loci. List of active loci along with table of
numbers available (as per info).
- show pedigrees. Same as gener, with print level 0.
- show ids. List of indidivual IDs tabulated versus pedigree(s).
- show map. Shows the current marker map.
- show macros. Shows the current macro definitions.
- time. Print time elapsed since start of the program.
- set timer [on|off]. Show
time taken by each command.
- include <command_file>. Read in Sib-pair
commands from a file. If the command file is not specified, a
directory browser will start up.
- output [<output_file>]. Divert Sib-pair
output from standard output (the screen) to a file. Issuing the
output command a second time closes the output file. If the command
file is not specified on the first call, a directory browser will start up.
- macro <macro_name> [ (= <macro value>) |
(<- all <marker> | bur | che |
epo | imp | ite | ls | min | ple |
pwd| sex | twi)].
Create a macro variable or function. If an equals sign is present, the
rest of the line is taken as the macro variable body. If the <-
(setting) sign is present, the macro variable takes the current
value of the named program setting (or list of alleles or markers).
Otherwise, multiple lines of the body of a macro function are expected
to follow on subsequent lines, terminating with an empty line or
";;;;". The function body contains Sib-pair commands and
parameters for substitution, represented as %1, %2 etc.
The macro is called either as a function or a variable.
To call a macro as a function, the entire macro name must be the first
word on the command line, followed by any required parameters (which can
be any legal string). The function body is evaluated, and the resulting
Sib-pair command(s) is then acted upon. For example:
# Draw a pedigree (need to have dot and gv):
# First set up macro
>> macro draw
draw> select pedigree %1
draw> write dot %%.dot %2
draw> $ dot -Tps -o %%.ps %%.dot
draw> $ gv --scale=-4 %%.ps
draw> $ rm %%.dot %%.ps
draw> unselect
draw>
# Call macro
>> draw ped1 trait
The %% keyword is replaced by the macro call ID, a unique 5
character string. The %0 keyword is replaced by all the macro
parameters, while %+2, %+3... is replaced by
that parameter and all subsequent parameters.
To call the macro as a variable, the entire macro name is prepended by
the "%" character. It can then appear anywhere in a command.
The macro variable is replaced by the macro body, but there is no
evaluation of macro function parameters or %%. The variable name can be
delimited by brackets, so that it can be embedded in another string.
The resulting command is then parsed:
>> macro a=1
>> macro b=+
>> macro b1=2
>> %a%b%a
=> 2.
>> 1%b1
=> 12.
>> 1%(b)1
=> 2.
>> macro a=D14S52 D14S43
>> tab %a
------------------------------------------------
Cross-tabulation of "D14S52" ... "D14S43"
------------------------------------------------
[...]
- eval [<Scheme expression>]. Accesses the Scheme
read-eval-print-loop. If called without arguments, presents the Scheme
prompt "%%", otherwise evaluates the rest of the line as if it
were a Scheme expression. Sib-pair macro variables and functions are
stored within the Scheme environment, and so can read and written to.
The Sib-pair specific extensions to Scheme are
"(nloci)", "(ls)",
"(loc)", "(loctyp)",
"(locord)", "(locnotes)",
"(read-line)", "(run)" and
"(help)".
| (ls ["m" | "x" | "h" | "a" | "q" | "d[mxhqa]"]) | List locus names
|
| (nloci) | returns total number of loci
|
(loc )| returns locus at that position in the locus list
| | (locord <loc>) | returns position of a locus in the locus list
| | (locnotes <loc>) | returns notes for a locus
| | (loctyp <loc>) | evaluates type of a locus ("adhmqx")
| | (read-line <port>) | Read next line from port as a string
| | (run "<Sib-pair command>") | Run a Sib-pair command
| | (string-split <string> [<sep>] | Convert from string to list of
words split on white-space (or a specified character separator)
| | (system "<operating system command>") | Run an OS command
| | (help) | Information about this Scheme implementation
| |
See the appendix for more details.
- last [<line_number>]. With no argument, displays
the command history, otherwise submits that line of the history for
reevaluation. A negative line number counts backwards from the current
line. The command history is saved to a file "sib-pair.log".
- quit|bye. Halts the program.
- set prompt [on|off]. Displays a prompt, and
activates/resets the command line history.
- set gui [on|off]. Activates or stops the GUI.
The GUI is either Java based (using the japi library to interface
with the Java AWT library), or GTK2.0 based (using the pilib library
to interface), depending on the compile time choice.
- set ndecimal_points [<nwid>]
<ndec>. The total width (number of characters) of a
quantitative variable written to a new pedigree file defaults to 9 (and
is fixed to 8 for some files, notably MENDEL and FISHER) but can be set
as high as 20 for GAS and LINKAGE format files. The number of decimal
places can be set to ndec.
- set epoch [iso|jul|<epoch>]. Set or
show the epoch used for julian dates. Defaults to "iso" epoch of
1970-01-01.
- set out|plevel
<level>|verbose|on|off. Increasing the
print level causes more information to be printed by almost all procedures.
Print level 1 prints out the identities and genotypes of parents
imputed where the genotype was missing, raw counts of genotypes for the
hwe procedure, expected ibs probabilities for the asp
procedure etc. Print level 2 (or verbose) writes out the
statistics for each simulated dataset for the MC based procedures, the
intrapair variance and ibd sharing for each pair in the sib pair analysis,
etc. Print level -1 omits outputting a list of pedigrees.
- set printstyle [rectangular|pairwise].
Sets the format used by the print command. The default type is
rectangular, so that the selected records are printed on one line with
values at each variable in regular columns. If the pairs style
has been set, then each value is printed as a
variable_name=value pair.
- set table_separator <separator_character>. Sets the
character used to separate columns in summary tables of test results.
Defaults to " " (a space). This affects the output from
the summary, hwe, frequency, assoc commands.
- set weight founders|imputed. Weights contribution of each
pedigree to the allele frequencies by the number of typed founders, or
alternatively gives the count of the founder alleles, observed and
imputed.
- set analysis [imputed | observed]. Includes
imputed genotypes in generalized linear models fitted using the
regress command. Genotypes are automatically reimputed each time
regress is called. The idea of this is to allow multiple
imputation association analysis. Imputation is performed using
Sib-pair's single locus genotype MCMC sampler. The regress
command with imputed genotypes respects the set frequencies
command, so that the population allele frequencies can be specified in
advance.
- set burn-in <number of iterations>. Controls the
number of Markov Chain Monte-Carlo iterations used by the apm
algorithm discarded before estimation commences. Default is 100 iterations.
Setting bur to zero means no burn-in is performed (the old
default).
- set iteration<number of iterations>. Controls the
number of iterations used by the various Monte Carlo algorithms. Default is
200 iterations. Setting ite to zero means the Monte Carlo procedures
are not performed.
- set emit <number of iterations>. Controls the
number of (Monte-Carlo) Expectation Maximization iterations used by the
mcf algorithm.
- set batch <number of batches>. Controls the
number of batches used by the fpm algorithm for the estimation of
parameter standard errors.
- set chain <number of MCMC chains>. Controls the
number of chains used by the fpm algorithm.
- set tune <MCMC tuning parameter>. Controls the
multiplier for the MCMC proposal step size used by the fpm algorithm.
The base step size is usually the fixed effects model standard error for that
parameter, and tune defaults to 0.3.
- set mincount <minimum numerator of P-value>. Controls
the number used for Monte Carlo simulation of a P-value. Default is 20
pseudosamples with a test statistic more extreme than that for the observed
statistic. Set mincount equal to iter if this is not
desired.
- set seeds <seed1> <seed2><
seed3>. Initializes random number generator seeds to given values,
rather than via system time.
- set fbatimpute on|off. Enables (default) or
disables imputation of missing child alleles based on their own
offspring.
- set tdt bot[h parents]|one
[parent]|first. Limits TDT statistic to cases where either
both parents or at least one parent is typed, or one proband per family
where both parents typed.
- set hre zero|children. Assume zero recombinants
between markers for dis command where parents genotyped, thus
counting four imputed parental haplotypes. Alternatively, only
utilize two haplotypes from children.
- set map function kosambi|haldane. Set the mapping function used
by multipoint analytic and locus file output routines.
- set sml <Frequency of A allele> <Penetrance
of AA genotype> <Penetrance of AB genotype>
<Penetrance of BB genotype>. Set default single
major locus model written to locus files (usually p=0.05, pen=(0.5, 0.5,
0.05)).
- set liability <binary trait> <liability
class indicator> <number of classes>. Declare
a quantitative trait to indicate liability class for the named binary
trait. Used to generate Linkage format locus and pedigree files.
Utilities
- pchisq <chi-square> <degrees of freedom>
[<df2>].
Calculate P-value for central Chi-square distribution (or F-distribution).
- chisq <nrows> <ncols>.
Calculate contingency chi-square and permutation P for
flat table entered via keyboard.
- proportion <numerator> <denominator>
<confidence interval width>. Calculate accurate confidence
interval following Wilson (as described by Agresti and Coull) for a
proportion.
- sml <Frequency of A allele> <Penetrance
of AA genotype> <Penetrance of AB genotype>
<Penetrance of BB genotype>. Calculates recurrence risks
and segregation ratios under a specified diallelic generalized single major
locus model.
- sml <Frequency of A allele> <Mean
for AA genotype> <Mean for AB genotype>
<Mean for BB genotype>
<standard deviation for AA genotype>.
[<AB SD> [<BB SD>]].
Calculates mean, variance components and parent-offspring regression results
under a specified diallelic generalized single major locus model.
- grr <trait prevalence> (<Frequency of
A allele> <genotypic risk ratio>
[add|dom|rec]) | (<Frequency of A
allele in cases> <Frequency of A allele in
controls> case-control). Calculates recurrence risks and
segregation ratios under a diallelic generalized single major locus
model specified via trait prevalence, ratio of penetrances and pattern
of inheritance (codominant multiplicative, dominant or recessive). If
the case-control modifier is present, instead it expects the
prevalence, risk allele frequency in cases, and risk allele frequency in
controls.
- ito <Frequency of A allele>
[<Penetrance of AA genotype> <Penetrance of
AB genotype> <Penetrance of BB
genotype>]. Calculates conditional genotype frequencies for a
relative of a proband of given genotype (if only allele frequency
provided) or affection status under the specified diallelic generalized
single major locus model (if penetrances were also specified). ITO refers
to the matrices used to perform the calculation for pairs of relatives.
Algebraic operators and functions
- "<allele1>/<allele2>".
Double quotes mark the contained text for special evaluation by the
parser. A constant genotype is written as two numbers (1-999) or letters
(a-zA-Z) separated by a slash and surrounded by quotes. Other
quoted items are passed intact to be read, either as a reserved command
or as a single Fortran real, so "1+3" is evaluated as 1000,
and "1 1" as 11.
- (<value>|<locus>) *|/|+|-|^
(<value>|<locus>). Arithmetic
operations combining numerical constants and/or trait values. The
result of an operation involving constants is a single constant, but an
operation involving a trait value results in nobs results (where
nobs is the number of individuals in the pedigree file.
- (<value>|<locus>) <|>|=<|=>|ge|le|
eq|==|ne|^=|and|or (<variable>|<locus>). Logical
operations comparing numerical constants and/or trait values. when
operating on genotypes, the equality and inequality operators require
both pairs of alleles to meet the criterion, but the comparison operators
test true if either pair of alleles meets the criterion. That
is "2/2">"1/3" evaluates to True, but
"1/2"=="2/2" evaluates to False.
- if <logical expression>
then <action>
[else if <logical expression> then
<action>]...
[else <action>]. Conditional evaluation of expressions.
The if statements can be nested.
- log|log10|sqrt|exp|sin|cos|tan|asin|acos|atan|abs|int|round
(<variable>|<locus>). Functions
acting on numerical constants and/or trait values.
- rand|rnorm. Produce a random value from U(0..1) or N(0,1).
- istyp|untyp <marker>. Test if genotyped
at given marker. Necessary since if imputation is higher than -1, all
untyped individuals have a genotype containing negative allele numbers (used
to start MCMC algorithm).
- ishom|ishet <marker>. Test if homozygous
(or heterozygous) at given marker.
- alla|allb <marker>. Return the
first or second allele for each individual at the given marker.
- marcom. Show the maximum of the number of markers an individual
and any of his relatives are both genotyped at.
- numtyp|anytyp|alltyp. Show number of markers an individual
is genotyped at, or indicate whether genotyped at any one or all marker
loci.
- male|female|isfou|isnon. Test sex and founder status of
individual.
- num|nfound. Number of members and number of founders of the
pedigree containing an individual.
- famnum|index. Position of pedigree and of individual in the
active dataset.
- <expr> : <expr> [:...]. The
colon separates the members of a sequence of algebraic expressions. Its
main function is to allow multiple expressions in the branches of an
if-then-else statement, but it will minimize overheads in repetitive
calculations. If multiple bracket-enclosed expressions
are encountered, then a : is automatically inserted:
>> (a=rnorm) (b=a^2)
Command iteration
- ... { (<val1> <val2>
...<valN>)|($[mxqa]) } .... Repeat the
associated command, placing each value of the iterator list into the
position the list currently occupies. There may be multiple iterators,
and iterator lists may be nested. This macro extension allows much of the
functionality of proper computer languages:
Iteration over a range of numerical values of a function | sml {0.1
0.3 0.5} 1 0 0
|
Iteration over a range of loci of a function | tab a1 {m1 m2}
|
Iteration over a range of functions | {tdt ass} a1
|
Combinatorial generation of strings eg locus names | set loc a{1 2 3} aff
|
| Compound statements | if (male) then m{1 2}=x
|
Data Declaration commands
- set datadirectory <pathname>. Sets directory to be
searched for pedigree files.
- set workdirectory <pathname>. Sets directory to which
temporary files are written.
- set impute
off|on|full|sequential|lange-goradia|nil|<level>. Toggles
imputation routine.
- Imputation level 0 (off) does not impute genotypes.
It does generate legal genotypes via gene-dropping for all untyped
individuals, that are used to start the MCMC genotype sampler. These
genotypes are "hidden" to other routines. The gene-dropping
approach is slowed by, and can fail (stochastically) in large
pedigrees. In this case, the imputation level can be set to -1
or nil (completely off!).
- Imputation level 1 (on) imputes single individual's
genotypes if unambiguous using results of the Lange-Goradia imputation
algorithm. All other missing genotypes are silently imputed via
gene-dropping.
- Imputation level 2 (full) carries out the same
imputation as level 1, but the imputed values for all missing
genotypes are printable and saveable using write. If
imputation has already been carried out, changing the imputation level
to 2 affects just the printing of "hidden" imputed
values of unobserved genotypes.
- Imputation level 3 (lange-goradia or
sequential) imputes values for all missing genotypes via
sequential application of the Lange-Goradia genotype elimination
algorithm. That is, the missing genotype is imputed to be the most
likely genotype conditional on the typed members of the pedigree, and
those genotypes imputed prior to that iteration. The imputed genotypes are
not visible to non-MCMC commands (eg write) unless the
imputation level is changed to 2.
- set errordrop off|on|<level>. Toggles
automatic deletion of genotypes that give rise to a Mendelian inconsistency,
either an entire nuclear family (level 1), or an entire pedigree (level 2,
the default).
- set checking off|on. Toggles the first level testing
for Mendelian inconsistencies within nuclear families.
- set locus <locus name> <locus type>
[<map position> [<description...>]].
Declares position (by order within list), name and type of locus within
pedigree file. Locus type may be either:
| marker | an autosomal (fully) codominant marker
|
| xmarker | X-linked codominant marker
|
| haploid | Y or mitochondrial codominant marker
|
| quantitative | quantitative (or interval or ordinal) trait
|
| affection | binary trait
|
It is best to avoid a locus name containing reserved characters (eg
"+-*/()^"), if algebraic manipulation of that variable will be required
(otherwise quotation of the name is required). Names identical to commands
also cause trouble unless protected by brackets.
The fifth column (optionally) contains the genetic map position. All
subsequent words (up to a total of 40 characters) are stored as an
annotation. The annotation is appended to the long form of output of
some commands (eg show loci or list), and is searchable by
some commands (currently keep|drop where). If you wish to annotate,
but do not have a map position, a "." for the fifth column is
unobtrusive and accepted by Sib-pair.
- declare loci
<number>(m|x|q|a)
[<number>(m|x|q|a)... ].
Declare a batch of loci, automatically generating names: "trait1",
"trait2"..."traitN", "mar1",
"mar2"..."marN" as appropriate.
- rename <locus name> [to] <new name>.
Change name of previously declared locus.
- loci <command file>. Read in Sib-pair locus and
pedigree file declarations from a file. If the command file is
not specified, a directory browser will start up.
- read locus linkage <locus file name>. Read locus
names, types and map positions from a Linkage-format locus (.dat)
file. Does not recognise factor coding of genotypes, but does
create a new quantitative trait for liability class
- read locus merlin <locus file name> [xli].
Read locus names, types from a Merlin-format locus (.dat) file.
Markers are treated as sex-linked when the xli modifier is used.
- read locus plink <locus file name>. Read locus
names, types and map positions from a PLINK locus (.map) file.
- read pedigree <pedigree file name>|inline.
[skip <lines_at_beginning>].
Reads a GAS type pedigree file either from an external file, or inline
following the command. The inline data is terminated by a line containing
";;;;". The skip keyword leads to the skipping
past that number of lines at the beginning of the file.
- read linkage <pedigree file name>|inline.
Reads a pre-MAKEPED LINKAGE type pedigree file.
- read ppd <pedigree file name>|inline.
Reads a post-MAKEPED LINKAGE type pedigree file.
- read cases <data file name>|inline
[sex] [skip <lines_at_beginning>].
Reads a data file containing unrelated individuals. The
individual ID is the first column of data, which is followed by
all the phenotypes. If the sex keyword is present then the
second column of data is expected to be the sex. The skip
keyword leads to the skipping past that number of lines at the beginning
of the file.
- read bin <data file name>. Reads a Sib-pair
"binary" pedigree file. The "run" statement is
bypassed, so this command reads in locus and pedigree data immediately.
No checking is done for pedigree and Mendelian errors and reordering of
pedigree members is not performed. These can be large files, but on
systems where gzip is available, can be automatically compressed (see
write bin) and decompressed. The default format for the file
arises from a Fortran unformatted write of the locus and pedigree
arrays, and so will be compiler and platform specific.
- read plink <file prefix> [compress].
Reads a PLINK .bed and ancillary files:
<prefix>.bim, <prefix>.fam and
<prefix>.bed. The "run" statement is bypassed,
so this command reads in locus and pedigree data immediately. No
checking is done for pedigree and Mendelian errors or reordering of
pedigree members -- this is assumed to have been performed by the
program that prepared the files. If the compress modifier is
present, then the genotype data is stored in a 4 bits per genotype format
internally.
- set sex marker <locus name>. Declares
a sex-informative marker, such as Amelogenin.
- set twin <quantitative trait>. Declares
a variable to identify monozygotic (MZ) twins. All individuals within a
pedigree with the same nonzero value of this variable are taken to be
part of the same MZ sibship (twin pair or higher order multiple).
Different values indicate different MZ pairs within the same family.
This information is used to write out MZ twin indicators to the pedigree
files used by MENDEL, MERLIN and SOLAR, and to test for marker
inconsistencies.
- set skiplines <slines>. Skip first slines lines
in pedigree file) when reading in.
- order <loc1>...[<locB> to
<locC>]...
[$(m|x|q|a)[r|m]]...<locN>. Set
order of loci. Addition of r to a class eg $mr, reverses
the order of all members of that class, while the m modifier
causes the order to be the genetic map order. You may have to revise
the genetic map order (by set map or set dist to get
sensible export files for some programs such as Linkage (Sib-pair assumes a
map position lower than the preceding position implies the markers are
unlinked).
- set map <pos1>...<posN>. Set map positions for
the marker loci. This will overwrite any original map positions.
- set distances <dis1_2> <dis2_3>...<posN-1_N>.
Set interlocus map distances map positions for the marker loci. Distances
are in centiMorgans. This will overwrite any original map positions.
- set chromosome <chr1> [...<chrN>].
Assign each marker locus to a specific chromosome. If there are more
markers than specified chromosomes, the last specified chromosome is
reused.
- read map <map file name> [[k]bp]. Read in map
positions for loci from a file, matching via names of previously
declared markers. Should recognize most formats of map file
automatically eg Merlin, Mendel, Solar. Tests number of columns and
whether column contents are numeric or alphabetical, skipping first row
as possible header. The bp (kbp) modifier tells
Sib-pair to divide the positions by 106 (103);
thus the map distances become Mbp, and remain readable.
- set frequencies <marker>
[<allele_freq1>...<allele_freqN>]. Sets the
"population" allele frequencies for a marker to be used by
MCMC procedures that simulate missing genotypes for that marker. This
currently only affects the set analysis and gpe commands.
Only one marker at a time can have specified frequencies. To
free the frequencies (allow calculation from the dataset), call the
command without specifying any frequencies after the marker name.
- run. Reads in pedigree file and creates working pedigree
file. Imputes genotypes if requested.
Data manipulation commands
- keep|drop <loc1>...[<locB> to
<locC>]...
[$(m|x|q|a)]...<locN>.
Retain or exclude loci for subsequent analysis. Consecutive loci can be
summarized as a range, as can all members of a particular class of locus
type (marker, quantitative, affection) via a class
($type) token. Note that dropped variables can still be used in
algebraic and logical expressions.
- keep|drop where (monomorphic |
max <frequency> | number_typed <ntyp>
| chromosome <chr1> [...<chrN>] |
| distance <smallest_gap> |
| r2 <smallest_r2> |
every number_skipped | position
<start_position> <end_position>; |
hwe_p [<critical_P>] |
test_p [<critical_P>] |
search_string>). Retain or exclude loci for analysis. Note
that dropped variables can still be used in algebraic and logical
expressions. The where condition can be used to match the set of
loci meeting that condition. Available conditions are: test that a
marker is monomorphic, that the commonest marker allele frequency
exceeds a threshold, the number of individuals typed falls below a
threshold, the marker is closer than a set amount to the last included
marker (map distance), marker is in too high linkage disequilibrium (r2)
with the last included marker, every Nth marker in list, on a given
chromosome or chromosomes, within an interval on the genetic map, the
HWE test or most recently applied test P-value is smaller than the
critical level (defaults to 0.05/Nmarkers), or the marker annotation
matches the search string.
- undrop [(<loc1>...[<locB>
to <locC>]...
[$(m|x|q|a) ] ...<locN>]) |
where <search string>].
Return previously dropped loci to analysis. Default is to undrop all
dropped loci. Loci can be selected on name (including wild cards),
class, or annotation (where). This is not the reverse of the
delete command.
- select [containing|exactly
<nprobands>] [where] <a logical
expression>. Select pedigrees containing one or more individuals
with a trait value meeting the criterion.
- select pedigree|id [[not] in]
<ped1>...<pedN>. Select pedigrees included or
excluded from a list of pedigree or individual names. The names can contain
wildcard characters: "." (match any character in the target at that
position in the search string) and "*" (match any characters zero or
more times in the target at that position in the search string).
- unselect [<Nth>]. Returns all pedigrees excluded by a
select command back to the analysis. If an integer argument is given,
this gives how many select statement subsettings to roll back.
The argument can be negative (reversing the effect of a previous
targetted unselect).
- pack loci|pedigrees. Permanently delete all
loci currently excluded by a drop command, or all pedigrees
currently excluded by a select command from the work file.
- edit <pedigree> <person>|all
<trait>
to <value> [<new value>]. Allows editing
of trait values or genotypes. The all keyword performs the
action on all members of that pedigree: since wildcards can now be
used, an equivalent is "edit <ped> *".
- copy <pedigree1> <person1>
<pedigree2> <person2> Copies all
active trait values and genotypes from first record to second
record.
- update|merge [<locus1> ...<locusN>]
<phenotype_file_name>. Updates phenotype/genotype data in
the current dataset using values read from a file. The first line of
the file gives the names of the variables that are included in the
subsequent lines. The first column is the pedigree_ID and is
named ped[igree]; the second column is the
individual_ID, and is named id. The remaining columns
should have names that match locus names in the current dataset (data
for nonmatching names are skipped).
ped id locus_name1 locus_name2 locus_name3 ...
1 1 A/A 12.434 y ...
1 2 A/B x n ...
...
Where the pedigree and individual IDs for a record in the update file
match that of an active individual in the current dataset, the
corresponding phenotype and genotype values for that individual are
updated using the values read from the file. If merge was
called, only missing values in the current dataset are updated, but
all values are overwritten if the call was to update.
By specifying locus names on the command line, you can further control the
loci that are updated from the new file.
- delete <pedigree> <person>|all
Sets all data to missing for a specified individual. The
all keyword performs the action on all members of that pedigree.
- delete [<locus1>...<locusN>]
[whe|where] <a logical expression>.
Sets specified data to missing for all individuals meeting particular
criteria.
- get <relationship> <statistic>
<trait> [<newtrait>]
| get | all | mean | <trait> | [<newtrait>]
|
| | offspring|children | minimum
|
| | sons | maximum
|
| | daughters | sum
|
| | parents | count
|
| | father | sample
|
| | mother |
|
| | spouse |
|
| | husband |
|
| | wife |
|
| | siblings |
|
| | brothers |
|
| | sisters |
|
Summarizes trait values of all relatives of the specified class of all individuals,
saving the result to a trait locus if requested. The sample option
samples with replacement unless the all option is used, when the
sampling is without replacement.
- recode (<marker>|$(m|x)) [frequencies].
Recodes alleles at that marker or set of markers to 1..N, where the
ordering defaults to the allele size (or collation order for letter
alleles). If the freq modifier is present, the numbering is by
ascending allele frequency. If the let modifier is present,
the numbering is "1..4" for "ACGT", and the reverse
for num.
- recode <marker>
<all1|value1>...<allN|valueN>
to <new allele|new value>
[...<newN>]. Allows pooling and/or recoding of marker
alleles prior to subsequent analysis. If there are fewer new values than
old values, the last new value is recycled.
- combine <marker1> [...<markerN>]
[<threshold>]. Pool rare alleles for a marker into one new
allele. "Rare" defaults to a frequency of 5%, but can be changed
via the last parameter on the command line.
- flip <marker> [to <marker2>
] [...<markerN>]. Recode SNP marker alleles to the
complementary strand coding ie "[ACGT]" ->
"[TGCA]".
- date <quantitative_trait>
[julian|gregorian|year]. Convert a numeric
variable from Julian to Gregorian, Gregorian to Julian, or Gregorian to
"decimal" year. The "chronological" Julian date is
the number of days since the epoch, usually 1970-01-01 or -4712-01-01.
Gregorian dates are represented as 8 (or 9) digit integers of the form
of (-)YYYYMMDD. The decimal years are YYYY.x, where the decimal part is
the day of year number (from 1...366) divided by the length of that year
(365 or 366).
- date (<yyyymmdd> julian)|
(<juldate> gregorian). Convert a
single date from Julian to Gregorian or Gregorian to Julian.
- test dob <DOB_variable> [gregorian]
[<threshold>]. Test that parent and offspring DOBs are
consistent. The threshold controls the minimum age of parents at
birth of offspring, and defaults to 4380 days, if gregorian has been
used to declare the DOB variable to represent Gregorian dates.
- test age <age_variable>
[<threshold>]. Test that parent and offspring ages are
consistent. The threshold controls the minimum age of parents at
birth of offspring, and defaults to zero, since the units are unknown.
- test sex. Uses X-chromosome and/or Amelogenin to test the
putative sexes of genotyped individuals.
- test haploid [mitochondrial]. Uses Y-chromosome or
mitochondrial markers to test the putative relationships of
genotyped individuals.
- test locus [<marker 1> [...<marker N>]].
Carry out the usual Sib-pair Mendelian error screen on the specified locus and
only on active pedigrees. Useful if checking was turned off when the pedigree
was first read in.
Analysis commands
- transform <xtrait> <divisor>
<subtractand> <power> <lower
threshold> <higher threshold>. This transform the
quantitative trait xtrait as:
boxcox({xtrait-subtractand}/divisor)
where boxcox() is (a slightly altered) Box-Cox transformation, so that:
- if power=0, the transformation is log({x-s}/d);
- if power=1, it is {x-s}/d;
- and otherwise [({x-s}/d)p-1]/p.
The resulting transformed value can then be truncated above or below using a
specified low or high threshold.
- standardize <trait> [familywise]. Replace each
trait value with its Z-score, ie (x-xbar)/sd, where
xbar is the total sample mean, and sd the total sample standard
deviation. This can also be performed using the individual's family mean and
standard deviation, if the fam keyword is included.
- adjust <ytrait> on <xtrait>
[to <adjustment value of xtrait|m|f>].
Performs linear regression of quantitative trait ytrait on quantitative or
binary trait xtrait (or sex, if sex is set to on), calculates
residuals, and adds adjustment value or, if not specified, the mean
value of xtrait. The residuals then replace the original values of
ytrait. A multiple regressive adjustment of Y on
X1 and X2 requires
sequential adjustment of Y on
X2,X1 on
X2, and then Y on the adjusted
X1. Has been superceded by the more powerful residuals
command.
- residuals <ytrait> on
<loc1>...[to]...<locN> [complete_obs].
Replace quantitative trait with the residuals from the multivariate
regression on the list of predictors (which may include the average
allele length of a marker locus). The com option means only
individuals with no missing values for any of the listed traits will be
updated. Otherwise, missing values are replaced with the sample mean
for that phenotype when calculating the predicted value.
- predict <ytrait> on
<loc1>...[to]...<locN> [complete_obs] Replace
quantitative trait with the predicted value from the multivariate
regression on the list of predictors (which may include the average
allele length of a marker locus). The com option means only
individuals with no missing values for any of the listed traits will be
updated. Otherwise, missing values are replaced with the sample mean
for that phenotype. when calculating the predicted value
- impute <ytrait>. Replace missing quantitative trait
values with the predicted values from a regression on the spouse, sibling and
offspring observed values. Designed mainly for imputing missing age or date
of birth.
- impute <ytrait> on
<loc1>...[to]...<locN> [complete_obs] Replace
missing quantitative trait values with the predicted value from the
multivariate regression on the list of predictors (which may include the
average allele length of a marker locus). The com option means
only individuals with no missing values for any of the listed predictor
traits will be updated. Otherwise, missing values are replaced with the
sample mean for that phenotype when calculating the predicted
value.
- kaplan-meier <age-at-onset> <censor>
[residuals]. Prints the product-limit estimator for the survivor
function for the quantitative trait age-at-onset, where censor
is the binary outcome trait, which is affected when
age-at-onset represents the age at which the individual first expressed
the trait. The age-at-onset is replaced by a nonparametric residual
when requested. If affected, this is:
sgn(1-H(t)).(-2(1-H(t)+log(H(t))))1/2
If unaffected:
-(-2H(t))1/2
where H(t) is the Nelson-Aalen estimate of the integrated
hazard function at that age t.
- lifetable (<start>|0) <end>
<censor> [<cohort stratum width> [<period
stratum width>]] [time] [covariate
<covariate>]. Prints the life table for the quantitative
trait end, where censor is the binary outcome trait, which
is affected when end represents the age at which the
individual first expressed the trait. Start represents the time of
entry into the study of the individual. If the start trait
is "0", the start of observation is taken to be 0 for all
individuals. The <cohort stratum
width> is the bin width used to divide up the entry times into
the study. The <period stratum width> is the bin width used
to divide up person time of followup (and defaults to years). The
<time> modifier causes the start and end values to be
treated as a continous measure, rather than a Julian date (the default).
The covariate keyword declares the following named variable to be
a categorical covariate which will be used to stratify the dataset.
- rank <trait> <rank>. Write the ranks
of a quantitative trait to the quantitative variable rank.
- blom <trait> <blom_score>. Write
the approximate inverse normal scores for a quantitative trait to the
quantitative variable blom_score.
- simulate <trait> [<h2>
[<linked extant marker>]]. The data for the named trait
is replaced by simulated data for a trait under the control of a QTL
(or polygenes) with a total heritability of h2 (defaulting to 50%).
If a second marker name is given, the controlling QTL is simulated as being
completely linked to the second marker.
- simulate <marker> [<linked extant
marker>] [<number of equifrequent marker alleles> |
<allele 1 frequency>...<allele N frequency>].
The data for the named autosomal marker is replaced by simulated data.
If a second marker name is given, the new marker is simulated as being
completely linked to the second marker. Either a set of allele frequencies,
or the number of (equifrequent) alleles, can be given for the simulation.
If the sum of the given allele frequencies is less than 1, an
extra allele will be added automatically.
- simulate pedigrees [<nped> [<ngen>
[<min_number_of offspring>
[<max_number_of_offspring> ]]]]. Generate a set of
nped (default 100) random pedigrees, each of ngen (default
2) generations. The component nuclear families each contain between
min_number_of offspring and max_number_of_offpsring
offspring (defaulting to a range 0-2). The generated pedigrees are each
descended from a single founder couple (with marry-ins).
- nuclear [maxsibs] [grandparents]. Split pedigrees
into component nuclear families, duplicating individuals as necessary. If
maxsibs is set, then sibships containing more than maxsibs members
are truncated. The gra option includes the grandparents as well.
- subpedigrees. Split nominal pedigrees into component true
pedigrees. Sib-pair normally can analyse a group of individuals with the same
pedigree ID, even if they are not all related. This command splits such
groups into uniquely named formal pedigrees.
- join <pedname1> [<pedname2>...<pednameN>] Join or rejoin pedigrees into a single pedigree,
appropriately dealing with shared individual IDs and their associated
data. Note that Sib-pair allows noncontiguous pedigree blocks to use
the same pedigree name, so multiple pedigrees of the same name will be
collected up.
- prune [<binary trait> [|<quantitative
trait> over|under <threshold>]].
Reduce pedigree to contain probands and minimum number of connecting
relatives.
- cases <locus>.
Reduce pedigree to unrelated individuals with non-missing values at the
trait i.e. the informative founders, and any informative nonfounders
who are not directly related to any individuals already selected.
- unique_id [sequential]. Generate unique consecutive
(within family) numerical IDs for all individuals (as well as new
numeric pedigree IDs). The sequential gives IDs from
1...total_records, instead of 10001, 10002...20001...
- hash [ <file_of_IDs> | <pedigree_ID>
<individual_ID> | show | delete | size
<percent_table_size>]. Sets up, show, deletes or utilizes a
hashed index for searching out IDs. Does not allow use of wild cards cf
print. If the single argument is a file name, each line in the file is
expected to contain a pedigree and individual ID as the first two words.
These will be searched for in the current dataset. Increasing the
plevel gives lists of unmatched and matched IDs. If two
arguments are given, these are taken to be a pedigree and individual ID
to be searched for. The show, delete, and size are
for tuning, and will not be needed for ordinary use.
- print [where] <a logical expression>.
Print trait values for individuals, with a combination of trait values
meeting the criterion.
- print ped <Ped1>...<PedN>
[id <Id1>...<IdN>]
Print trait values for individuals, with specified combination of
pedigree and individuals IDs. The pedigree and ID names can contain
wildcard characters: "." (match any character at that position
in the search string) and "*" (match zero or more characters).
- write [<pedigree file name>]. Writes a GAS type
pedigree file from the current dataset. Default is to screen.
- head|tail [<nrec>|
(<skip> <nrec>)]. Writes the first or last
nrec records (default 20) of a pedigree file to the screen. If
two arguments are present, then the first represents the number of
records to skip over before writing nrec records.
- more [<nrec>]. Pages through
the current dataset, nrec records (default 20) per page.
Allows paging backwards and forwards by full or half pages.
- write pap. Writes the required pedigree files trip.dat
and phen.dat (note that you may have to sort
trip.dat).
- write bin <pedigree file name>
[compress]. Writes a Sib-pair "binary" pedigree file.
The file actually contains both locus descriptions and
pedigree/genotype/phenotype data, and so saves the state of the program
at that time. These can be large files, but on systems where
gzip is available, will be compressed if the compress
modifier is present. Such files are automatically decompressed by the
read bin command (if the filename has a .gz suffix. The
default format for the file arises from a Fortran unformatted write of
the locus and pedigree arrays, and so will be compiler and platform
specific.
- write <format> <pedigree file name> <modifier>
| write | pedigree|gas
| <pedigree file name> |
|
| | arl | | [par|all]
|
| | asp|tcl | |
|
| | beagle | | [fou]
|
| | crimap|tcl | |
|
| | csv | |
|
| | dot | |
|
| | fisher | |
|
| | gda | | [all]
|
| | haploview | |
|
| | linkage|ppd|gh | | [dummy] [numbered_alleles]
|
| | mendel | |
|
| | merlin | |
|
| | phe | |
|
| | sage | |
|
| | solar | | [phe] [nopedigreeID]
|
| | structure | | [fou]
|
Use of the keywords pedigree or gas writes a GAS type
pedigree file from the current dataset. Quantitative values are written
as F9.x or F8.4. The keyword gda writes a GDA Nexus datafile
containing all current marker genotypes for founders. If the keyword
all is added, nonfounders will be included as well, but the
"gdatype" format will not differentiate between relatives. Similarly,
arlequin writes a data file for the program Arlequin containing
haplotypes from one informative child per family, or two parents of such
a child if the par keyword is added. Only if the all
keyword is added will all genotyped individuals be printed. The
keywords linkage and ppd write a pedigree file from the
current dataset suitable for use by the LINKAGE (and FASTLINK) programs,
the former type requires preprocessing by the Makeped program (note that
if a quantitative trait value is zero -- that is nonmissing -- it is
recoded to 0.0001); aspex (or tcl) writes a linkage style
pedigree file but with the marker locus names as the first line, as the
ASPEX programs prefer; gh writes a linkage style pedigree file
with a dummy affection trait as the first trait and all the quantitative
traits last, with "-" for missing quantitative trait values.
The dummy option added to linkage or gh writes a
dummy affection locus as the the first trait (everybody affected). The
numbered_allele option skips recoding alleles to numbered
alleles. The haploview option is a linkage style file with
recoding of letter alleles from "ACGT" to "1234".
The sage keyword writes a pedigree file from the current dataset
suitable for use by the program FSP included in the SAGE package;
mendel writes a pedigree file from the current dataset suitable
for use by the programs MENDEL or SIMWALK; merlin writes a pedigree
file suitable for Merlin -- actually a LINKAGE "pre" format file
with zygosity included as the first trait, if the "set twin"
command has been previously issued; fisher writes a
pedigree file from the current dataset suitable for use by the program
FISHER; cri writes the ".gen" style file required by CRI-MAP;
phe writes the "pheno.dat" style file required by
Mapmaker-Sibs; both csv and solar give a comma delimited
file, with header naming columns, from which the pedigree ID column can
be dropped via the nop option, and the SOLAR phenotype (or
genotype, depending on a prior keep statement) file written by the
phe option. The SOLAR pedigree file has two additional columns:
MZ twin indicator (requiring a previous "set twin") and a
household (actually pedigree) indicator. The structure and
beagle commands write genotype data files for Structure and
Beagle respectively (and can be restricted to writing just founder data
using the founders option).
- write map mendel|merlin|loki
<map file name>. Writes out the map file required by
MENDEL 4.0, MERLIN or LOKI.
- write locus pap. Writes the required locus files
header.dat and popln.dat.
-
| write locus | aspex|tcl
| <locus file name> |
|
| | fisher | |
|
| | gas | |
|
| | haploview | |
|
| | linkage|gh | | [dummy] [xlinked]
|
| | loki | |
|
| | mendel | | [trait]
|
| | merlin | |
|
| | relpair | |
|
| | sage | |
|
| | sib-pair | | [<pedigree file name>]
|
| | structure | | <pedigree file name>
|
| | superlink | | [dummy] [xlinked]
|
Use of the keyword gas writes a GAS type locus file from the
current dataset; haploview writes an "info" file,
giving the marker position as the first word of the annotation if it is
numerical, or the map position multiplied by 106;
linkage writes a locus file from the current dataset suitable for
use by the LINKAGE (and FASTLINK) programs; gh writes the same as
linkage save that map distances are in cM. The dummy
option is used when the first trait is a dummy trait generated by
write linkage <file> dummy, while the xlinked option
declares the markers to be all X-linked. The keyword loki writes
a control file for LOKI's prep program; sage writes a
locus file from the current dataset suitable for use by the program FSP
included in the SAGE package; mendel writes a locus file from the
current dataset suitable for use by the programs MENDEL or SIMWALK (with
binary traits defaulting to a factor, but given as a diallelic locus if
the trait modifier is present); fisher writes a locus file
from the current dataset suitable for use by the program FISHER;
merlin for MERLIN; tcl or aspex writes the tcl
command file required by ASPEX programs such as SIB_PHASE;
sib-pair writes a Sib-pair style script; superlink writes
a LINKAGE style locus file modified for the SUPERGH option of SUPERLINK;
relpair writes the RELPAIR format (modified from that for
MENDEL): it infers chromosome number from the map position (as multiples
of 1000) or from the locus name (if it takes the form of
"DxSxxx").
- write var [mendel] <var file name>.
Writes out the var file (list of quantitative traits) required by
MENDEL.
- generations [<quantitative trait>
[reverse]]. List founders/marry-ins and sibships by generation
number for all pedigrees, (over)writing the generation number to a
quantitative trait if requested. If the reverse modifier is
present, the generation number counts up from the bottom of the
pedigree, rather than from the top.
- loops [<binary trait>]. Prints marital or
inbreeding loops in the active pedigrees. If a binary trait is
specified, the members of the loop are flagged as "y" at that
trait, with nonmembers set to missing at the trait.
- gpe <codominant marker> [mcmc]
[<allele dose estimate>]. Gives iterative peeling or MCMC
(Monte-Carlo Markov Chain) estimates of the genotype probability
estimates for the given marker for each individual: a vector of
probabilities corresponding to the possible genotypes at that marker.
For an observed genotype, this is 100% for the observed value and 0% for
all other possible genotype values. For an unobserved genotype, it
gives the probability distribution of possible genotypes conditional on
the sample allele frequencies (assuming Hardy-Weinberg Equilibrium) and
the observed genotypes in the individual's relatives. If the name of an
extant quantitative trait is appended to the command, the expected gene
dose for the first (ie lowest in the collation order) allele will
be written to this variable. The gpe command respects the set
frequencies command, so that the population allele frequencies can
be specified in advance.
- peel <codominant marker>. Calculate the
pedigree likelihood for the given marker.
- haplotypes <marker1> <marker2>
<newmarker> [<threshold>]. This infers phased
genotypes when the two markers are in complete (or near-complete) LD.
The threshold sets the maximum number of the rare haplotype that is
acceptable when LD is not complete.
- triads This routine lists haplotypes inferred from
fully typed parent-offspring triads, along with counts of
obligate recombinants.
- relatives <ped> <id>. This routine lists relatives of an index
individual: parents, sibs, spouses, offpsring and descendants. If the
pedigree is small (<12 members) or the plevel is set to
1 or higher, then a list of shortest paths from each pedigree
member to the index individual is also printed.
- ancestors <binary trait> |(<quantitative
trait>
>|>=|<|<=|==|^=
< threshold>). This
prints the IDs of the ancestor (and ancestral mating) shared by the greatest
possible number of probands in a family. The mean intrafamilial
inbreeding coefficient for the probands is also output.
- typed [<binary trait>|<quantitative
trait>]. Prints number of individuals phenotyped at each locus.
If a stratifying variable is specified, these counts are versus each
level of the trait.
- frequencies|describe [[<codominant
marker>| <binary trait>| <quantitative
trait>]...[to]...<trait>] | [snp] .
Print allele frequencies for marker loci, segregation ratios for binary
trait, or means, variances, familial correlations and a sibship variance
test for a quantitative trait. Default is to describe all loci. The
snp option prints minor allele frequencies and number typed for
all diallelic marker loci.
- mcfrequency <codominant marker> | $m.
Print MCEM (Monte-Carlo Expectation-Maximization) estimates of the
founder allele frequencies for marker loci. A fixed number of EM
iterations are carried out, usually 20. This can be set higher if
desired using the set emit command.
- count [where] <a logical expression>.
Count individuals, and sibships and pedigrees containing such
individuals, with a combination of trait values meeting the criterion.
- print [where] <a logical expression>.
Print phenotype data for individuals with a combination of trait values
meeting the criterion.
- hist <quantitative trait> [<number_of_bins>]. Produce
Alternative interface to mix with one distribution. The
number of bins in the lineprinter histogram can be set.
- plot <quantitative trait>
<quantitative trait> [<file>]. Produce
an Encapsulated Postscript scatterplot for two quantitative traits.
Graphic file name defaults to "sib-pair.eps".
- tab <trait 1>...[<trait N>].
Print contingency table for one, two or N traits, along with contingency
chi-squares, Kruskal-Wallis test or odds ratio if appropriate. For
RxC contingency tables where the second variable is a diallelic marker
locus, allele frequencies and exact P-values testing Hardy-Weinberg
Equilibrium are printed for each level of the first trait.
- llm <trait 1> [[+] <trait 2>...]
[[+] <trait 1> * <trait 2>...] [-1]. Carry out
log-linear modelling of a multidimensional contingency table under the
specified model. The "lrt" command can be used to compare
sequentially fitted models.
- kruskal-wallis <quantitative trait> <trait>.
Print table of means for the quantitative trait for each level of
factor, along with the Kruskal-Wallis chi-square.
- regress <ytrait> =
<x1>...[to]... <xN> [offset
<offset>] [poisson|(exponential|weibull
[<censoring_trait>)] [shape <shape>]
[sim] [rep <nreplicates>]. Performs linear or
logistic or poisson or weibull regression of trait ytrait on set
of loci x1...xN. If an x variable is a marker genotype,
that independent variable is the mean allele size in the genotype, with
the exception of the first marker locus encountered in the list, which
is fully allelic effect coded. The offset option reads an offset
for the linear predictor from the specified trait. Addition of a binary
trait name to the end of the keyword list when the regression is
weibull or exponential declares this as the censoring
indicator. The shape keyword declares a starting value for the
solution of the Weibull distribution shape parameter. The sim
keyword gives a gene-dropped P-value for the first marker locus in the
list. The rep keyword specifies a number of replicates for
multiple imputation of the test marker locus genotypes, and is usually used
when set analysis imputed has already been issued.
- clreg <ytrait> =
<x1>...[to]... <xN> [ped].
Performs conditional logistic regression of trait ytrait on set
of loci x1...xN. If an x variable is a marker genotype,
that independent variable is the mean allele size in the genotype, with
the exception of the first marker locus encountered in the list, which
is fully allelic effect coded. The default stratification is on sibship, but
the addition of the ped keyword gives an analysis stratified on
pedigree.
- mixture <quantitative trait> [<Number of
distributions> [normal|pooled_normal|exponential|poisson]]. Estimate
mixing proportions, means and standard deviations for a 1..5 component
mixture model describing the specified quantititative trait. The default is a
mixture of Normal (Gaussian) distributions with different means and
variances, but a common variance can alternatively be specified. Other
distributions available are the exponential and Poisson. A line-printer type
histogram is produced.
- kinship [inbreeding [mc] |
pairwise | dominance |
<binary trait> [|<quantitative trait>
[>|>=|<|<=|==|^=
<threshold>]]. Write the numerator relationship matrix
(matrix of coefficients of relationship) for each pedigree in a lower
triangular form or as a list of pairs (in the latter case, the
coefficient of fraternity is also printed, along with an indicator as to
whether a pair are full sibs. When the dominance keyword is
present, a pairwise list is printed of bilineally-related pairs (defined
as K>0) that are not full sibs. Alternatively, if requested,
print a list of individuals with a non-zero inbreeding coefficient,
using a Monte Carlo estimator if mc modifier is present. If a
binary trait is specified, the NRM is only for the affecteds if
plevel=1; for plevel=0, only a summary for each pedigree
is printed: number of affecteds, number of "sporadic" cases ie
cases unrelated to any other affected family members (eg marry-ins with
no affected descendants), mean coefficients of relationship for affected
relative pairs and of inbreeding for cases.
- ibd <codominant marker> [pairwise]. Write the
estimated mean identity-by-descent sharing at a marker for all relative pairs
in a pedigree as a lower triangular matrix or a list of pairs.
- ibs <codominant marker> [pairwise]. Write the
estimated mean identity-by-state sharing at a marker for all relative pairs
in a pedigree as a lower triangular matrix or a list of pairs.
- hwe [founders] [<mar1> ..[to]..
<marN>] [$(m | x)]. Prints chi-square statistic for Hardy-Weinberg equilibrium for
all marker loci. Analysis may be restricted to founders, and if the
marker is diallelic, an exact test is carried out. If nonfounders are
included, then a gene-dropping simulated P-value is produced. The mean
IBS sharing for all typed matings is also calculated, and compared to
its expected value. This latter test may allow detection of homogamy
or assortment.
- cksib. Lists all sib pairs, and the mean of IBS at all
marker loci where both members of the pair are typed at the marker,
comparing this to that expected if related as specified by the pedigree
structure. The