| Title: | Fast PCA for Large Pedigrees |
|---|---|
| Description: | Carry out principal component analysis (PCA) of very large pedigrees such as found in breeding populations! This package exploits sparse matrices and randomised linear algebra to deliver a gazillion-times speed-up compared to naive singular value decoposition (SVD) (and eigen decomposition). |
| Authors: | Hanbin Lee [aut], Hannes Becher [cre], Ros Craddock [aut], Gregor Gorjanc [aut] |
| Maintainer: | Hannes Becher <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.1.3 |
| Built: | 2026-05-27 08:10:11 UTC |
| Source: | https://github.com/cran/randPedPCA |
This index is used by plot.rppca to downsample the col (colour) values. It is stored in the rppca object's ds slot.
dspc(pc, to = 10000)dspc(pc, to = 10000)
pc |
an object of class rppca |
to |
The down-sampling parameter. A numeric > 0 or a vector or NA. Interpreted as a proportion or integer or a index vector, see details. |
The parameter to is used to specify and possibly which individuals are sampled.
If NA, all individuals are retained. If to is of length one and is between 0 and 1,
then it is interpreted as a proportion. If it is greater than 1, it is taken to be
the number of individuals to be sampled (possibly rounded by sample.int). If
to is a logical or an integer vector, it is used for logical or integer indexing, respectively.
The integer indices of the sample individuals are written to the ds slot.
If ds exists, it is overwritten with a warning.
An (invisible) object of class rppca with a slot ds added.
Follows Skorski, M. (2021). Modern Analysis of Hutchinson’s Trace Estimator. 2021 55th Annual Conference on Information Sciences and Systems (CISS), 1–5. https://doi.org/10.1109/CISS50987.2021.9400306
getNumVectorsHutchinson(e, d)getNumVectorsHutchinson(e, d)
e |
A |
d |
A |
a scalar
Hutch++ trace estimation
hutchpp( B, num_queries = 10, sketch_frac = 2/3, center = FALSE, oraculum = oraculumLi )hutchpp( B, num_queries = 10, sketch_frac = 2/3, center = FALSE, oraculum = oraculumLi )
B |
An object related to the matrix A for which the trace is to be estimated |
num_queries |
Number of random vectors to draw |
sketch_frac |
Hutch++ detail |
center |
Whether or not to implicitly centre |
oraculum |
The oracle function to be used |
The Hutch++ algorithm (Meyer et al. 2021, https://doi.org/10.48550/arXiv.2010.09649) estimates the trace of a matrix A by evaluating matrix vector products of A and (sub-gaussian) random vectors. This is used on a matrix B which is related to A through some function. The oracle function has to be chosen so that oracle(B, G) returns the product A the oracle function is set to work on a pedigree's L inverse matrix. But this implementation is general and should work - given a custom oracle function - on other input too.
In the context of pedigree PCA, this is used to estimate the trace of an (implicitly) centred additive relationship matrix.
There logical parameter center allows for a pedigree's L matrix to be (implicitly) centred. This is important because centring changes the total variance of the data and thus the trace of A.
An estimate of A's trace - numeric
hutchpp(pedLInv) hutchpp(pedLInv, center=TRUE)hutchpp(pedLInv) hutchpp(pedLInv, center=TRUE)
RandPedPCA relies in the spam onject format. But matrices are commonly
stored in other formats.
importLinv(pth)importLinv(pth)
pth |
path to matrix market file for L inverse matrix in dgTMatrix format |
A spam sparse matrix
An L inverse matrix generated from an AlphaSimR simulation of 20 generations.
pedLInvpedLInv
## 'pedLInv' Matrix object of class 'spam' of dimension 2100x2100, with 6100 (row-wise) nonzero elements. Density of the matrix is 0.138 Class 'spam' (32-bit)
Simulation
An L inverse matrix generated from an AlphaSimR simulation of 20 generations. Two diverged populations A and B. After a number of generations, crossbreeding starts.
pedLInv2pedLInv2
## 'pedLInv2' Matrix object of class 'spam' of dimension 2650x2650, with 7750 (row-wise) nonzero elements. Density of the matrix is 0.11 Class 'spam' (32-bit)
Simulation
An L inverse matrix generated from an AlphaSimR simulation of 20 generations. One population, ABCD, is split into four, A, B, C, D.
pedLInv4pedLInv4
## 'pedLInv4' Matrix object of class 'spam' of dimension 4200x4200, with 12200 (row-wise) nonzero elements. Density of the matrix is 0.0692 Class 'spam' (32-bit)
Simulation
A dataframe.
pedMetapedMeta
## 'pedMeta' A 'data.frame' of 2100 individuals (rows) with 9 variables (cols):
Integer individual ID
Population code. A, B or AB
Generation of the individual
dam ID
sire ID
genetic value
phenotypic value
genetic value
phenotypic value
Simulation
A dataframe.
pedMeta2pedMeta2
## 'pedMeta2' A 'data.frame' of 2650 individuals (rows) with 12 variables (cols):
Integer individual ID
Population code. A, B or AB
Generation of the individual
dam ID
sire ID
genetic value
phenotypic value
genetic value
phenotypic value
genetic value
phenotypic value
for plotting
Simulation
A dataframe.
pedMeta4pedMeta4
## 'pedMeta4' A 'data.frame' of 4200 individuals (rows) with 9 variables (cols):
Integer individual ID
Population code. A, B, C, D, or ABCD
Generation of the individual
dam ID
sire ID
genetic value
phenotypic value
genetic value
phenotypic value
Simulation
A simple wrapper around rgl's pot3d function.
plot3D(x, dims = c(1, 2, 3), xlab = NULL, ylab = NULL, zlab = NULL, ...)plot3D(x, dims = c(1, 2, 3), xlab = NULL, ylab = NULL, zlab = NULL, ...)
x |
an rppca object |
dims |
vector of length 3 - indices of the PCs to plot |
xlab |
(optional) x axis label |
ylab |
(optional) yaxis label |
zlab |
(optional) xz axis label |
... |
additional arguments passed to rgl::plot3d |
Note, different to plot.rppca, which is relatively slow, plot3D does
not down-sample the principal components and it ignores the ds slot of an
rppca object if present.
No return value, called for its side effects.
pc <- rppca(pedLInv) plot3D(pc) ped <- pedigree(sire=pedMeta$fid, dam=pedMeta$mid, label=pedMeta$id) pc2 <- rppca(ped) plot3D(pc2)pc <- rppca(pedLInv) plot3D(pc) ped <- pedigree(sire=pedMeta$fid, dam=pedMeta$mid, label=pedMeta$id) pc2 <- rppca(ped) plot3D(pc2)
3D plot of PC scores with projections on coordinate planes
plot3DWithProj( pc, dims = c(1, 2, 3), plotProj = TRUE, grid = TRUE, col = 1, ff = 0.5, theta = -45, phi = 25 )plot3DWithProj( pc, dims = c(1, 2, 3), plotProj = TRUE, grid = TRUE, col = 1, ff = 0.5, theta = -45, phi = 25 )
pc |
An |
dims |
|
plotProj |
|
grid |
|
col |
the dot colours, integer or string, scalar or vector |
ff |
|
theta, phi
|
polar coordinates in degrees. |
nothing
ped <- pedigree(pedMeta$fid, pedMeta$mid, pedMeta$id ) pc <- rppca(ped) plot3DWithProj(pc, col=as.numeric(factor(pedMeta$population)))ped <- pedigree(pedMeta$fid, pedMeta$mid, pedMeta$id ) pc <- rppca(ped) plot3DWithProj(pc, col=as.numeric(factor(pedMeta$population)))
Generate range matrix for SVD
randRangeFinder(L, rank, depth, numVectors, cent = FALSE)randRangeFinder(L, rank, depth, numVectors, cent = FALSE)
L |
a pedigree's L inverse matrix in sparse 'spam' format |
rank |
An |
depth |
|
numVectors |
An |
cent |
|
The range matrix for randSVD
Uses randomised linear algebra, see Halko et al. (2010). Singular value
decomposition (SVD) decomposes a matrix
randSVD(L, rank, depth, numVectors, cent = FALSE)randSVD(L, rank, depth, numVectors, cent = FALSE)
L |
a pedigree's L inverse matrix in sparse 'spam' format |
rank |
An |
depth |
|
numVectors |
An |
cent |
|
A list of three: u (=U), d (=Sigma), and v (=W^T)
Using Hutchinson's method
randTraceHutchinson(L, numVectors)randTraceHutchinson(L, numVectors)
L |
A pedigree's L inverse matrix |
numVectors |
an If you do not have a good reason to do otherwise, use the function The higher |
a scalar
Fast pedigree PCA using sparse matrices and randomised linear algebra
rppca(X, ...) ## S3 method for class 'spam' rppca( X, method = "randSVD", rank = 10, depth = 3, numVectors, totVar = NULL, center = FALSE, ... ) ## S3 method for class 'pedigree' rppca( X, method = "randSVD", rank = 10, depth = 3, numVectors, totVar = NULL, center = FALSE, ... )rppca(X, ...) ## S3 method for class 'spam' rppca( X, method = "randSVD", rank = 10, depth = 3, numVectors, totVar = NULL, center = FALSE, ... ) ## S3 method for class 'pedigree' rppca( X, method = "randSVD", rank = 10, depth = 3, numVectors, totVar = NULL, center = FALSE, ... )
X |
A representation of a pedigree, see Details. |
... |
optional arguments passed to methods |
method |
|
rank |
|
depth |
|
numVectors |
|
totVar |
|
center |
|
The output slots are named like those of R's built in prcomp function.
Rotation is not returned by default as it is the transpose of the PC scores,
which are returned in x. scale and center are set to FALSE.
Which method performs better depends on the number of PC requested, whether
centring is applied, and on the structure of the pedigree. As a rule of thumb,
"rspec" is faster than the default when rank is 8 or greater.
A list containing:
xthe principal components
sdevthe variance components of each PC. Note that the total variance is
not known per se and this these components cannot be used to compute the
proportion of the total variance accounted for by each PC. However, if
nVecTraceEst is specified, rppca will estimate the total variance and
return variance proportions.
vPropthe estimated variance proportions accounted for by each PC.
Only returned if totVar is set.
scalealways FALSE
centerlogical indicating whether or not the implicit data matrix was centred
rotationthe right singular values of the relationship matrix.
Only returned if returnRotation == TRUE
varPropsproportion of the total variance explained by each PC. Only
returned if starting from a pedigree object without centring, or if totVar is supplied.
pc <- rppca(pedLInv) ped <- pedigree(sire=pedMeta$fid, dam=pedMeta$mid, label=pedMeta$id ) pc2 <- rppca(ped)pc <- rppca(pedLInv) ped <- pedigree(sire=pedMeta$fid, dam=pedMeta$mid, label=pedMeta$id ) pc2 <- rppca(ped)
Convert generic sparse matrix to spam format
sparse2spam(sprs)sparse2spam(sprs)
sprs |
A sparse matrix. |
A spam sparse matrix