This vignette details the procedures by which multivariate sex
differences in personality traits can be estimated by using regularized
logistic regression and gender diagnosticity approach with
*multid* package.

This approach can be used for regularized measurement of multivariate differences between any two groups, but in this example, differences between females and males in “answers to the Big Five Personality Test, constructed with items from the International Personality Item Pool” (dataset that can be obtained from https://openpsychometrics.org/) are examined.

The predictive approach as implemented with regularized methods allows for examination of group-membership probabilities and their distributions across individuals. In the context of statistical predictions of sex, these distributions are an updated variant to gender-typicality distributions used in gender diagnosticity methodology (Lippa & Connelly, 1990).

For more details and empirical examples of the integration of multivariate sex difference estimation and gender diagnosticity methodology, see Ilmarinen, Vainikainen, and Lönnqvist (2022) and Lönnqvist and Ilmarinen (2021).

You can install the released version of *multid* from CRAN with:

`install.packages("multid")`

You can install the development version from GitHub with:

```
install.packages("devtools")
::install_github("vjilmari/multid") devtools
```

Load multid and other packages needed for the vignette

```
library(rio)
library(multid)
library(dplyr)
library(overlapping)
```

In this vignette, openly available data “Answers to the Big Five Personality Test, constructed with items from the International Personality Item Pool”. The dataset is directly downloaded from https://openpsychometrics.org (by approach presented here https://stackoverflow.com/a/66351588/8995170) and there is no need to download it to your drive to reproduce the following analyses.

```
# create a temporary directory
<- tempdir()
td
# create a temporary file
<- tempfile(tmpdir=td, fileext=".zip")
tf
# download file from internet into temporary location
download.file("http://openpsychometrics.org/_rawdata/BIG5.zip", tf)
# list zip archive
<- unzip(tf, list=TRUE)
file_names
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
<- import(file.path(td, "BIG5/data.csv"))
dat
# delete the files and directories
unlink(td)
```

```
# save names of personality items to a vector
<-names(dat)[which(names(dat)=="E1"):
per.itemswhich(names(dat)=="O10")]
# code item response 0 (zero) to not available (NA) on all these items
<-
dat[,per.items]sapply(dat[,per.items],function(x){ifelse(x==0,NA,x)})
# check that the numerical range makes sense
range(dat[,per.items],na.rm=T)
```

`#> [1] 1 5`

```
# calculate sum scores for Big Five (reverse coded items subtracted from 6)
# see codebook from https://openpsychometrics.org for item content and keys
$N<-rowMeans(
datcbind(dat[,c("N1","N3","N5","N6","N7","N8","N9","N10")],
6-dat[,c("N2","N4")]),na.rm=T)
$E<-rowMeans(
datcbind(dat[,c("E1","E3","E5","E7","E9")],
6-dat[,c("E2","E4","E6","E8","E10")]),na.rm=T)
$O<-rowMeans(
datcbind(dat[,c("O1","O3","O5","O7","O8","O9","O10")],
6-dat[,c("O2","O4","O6")]),na.rm=T)
$A<-rowMeans(
datcbind(dat[,c("A2","A4","A6","A8","A9","A10")],
6-dat[,c("A1","A3","A5","A7")]),na.rm=T)
$C<-rowMeans(
datcbind(dat[,c("C1","C3","C5","C7","C9","C10")],
6-dat[,c("C2","C4","C6","C8")]),na.rm=T)
```

For the purpose of this example, only US responses will be analyzed for sex differences. Furthermore, only participants who reported being male (gender=1) or female (gender=2), and who did not have any missing responses are included.

```
<-
US.datna.omit(dat[dat$country=="US" & (dat$gender==1 | dat$gender==2),
c("gender",per.items,"N","E","O","A","C")])
# code categorical gender variables
$gender.cat<-
US.datifelse(US.dat$gender==1,"Male","Female")
```

In this section, several different ways for estimating sex differences in personality are demonstrated. We start with typical univariate differences in the broad Big Five domains, followed by multivariate differences in both domains and using personality items as a multivariate set. Special attention is given to regularized methods because sex differences and the distributions underlying these mean differences can be both examined from this approach.

**d_pooled_sd** -function is used to calculate Cohen’s
*d* with pooled standard deviation across the male and female
groups for each Big Five trait separately. The arguments include the
analyzed Big Five trait (**var**), sex as the binary
grouping variable (**group.var**), the values that
represents these groups (**group.values**), and logical for
whether statistical inferences is wanted (**infer**). All
arguments except statistical inference are also included in the
multivariate -function **D_regularized** presented further
below.

```
<-d_pooled_sd(data = US.dat,var = "N",
d.Ngroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "E",
d.Egroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "O",
d.Ogroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "A",
d.Agroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
<-d_pooled_sd(data = US.dat,var = "C",
d.Cgroup.var = "gender.cat",group.values = c("Male","Female"),
infer=T)
# compile to a table
<-rbind(d.N,d.E,d.O,d.A,d.C)
uni.drownames(uni.d)<-c("N","E","O","A","C")
round(uni.d,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> N 2945 5739 2.79 3.16 0.88 0.86 0.87 -0.37 -0.42
#> E 2945 5739 3.00 3.06 0.96 0.96 0.96 -0.07 -0.07
#> O 2945 5739 4.07 3.88 0.58 0.63 0.62 0.18 0.30
#> A 2945 5739 3.68 4.03 0.75 0.68 0.71 -0.35 -0.50
#> C 2945 5739 3.38 3.44 0.72 0.75 0.74 -0.07 -0.09
#> t_stat df p
#> N -18.45 5845.38 0
#> E -3.09 5919.79 0
#> O 13.44 6458.37 0
#> A -21.33 5455.81 0
#> C -3.95 6119.69 0
```

Column D indicates Cohen’s *d*. Females score higher in
Neuroticism (N; *d* = -0.42) and Agreeableness (A; *d* =
-0.50), while Males score higher in Openness (O; *d* = 0.30).
Differences in Extraversion (E) and Conscientiousness (C), albeit
statistically significant (*p* < .001), seem negligible in
magnitude (*d* < |.10|) .

The average absolute difference across the Big Five traits is
therefore: *d* = 0.27. Importantly, this average absolute
difference is not informative about the overall distance between males
and females. One should use multivariate methods for examining questions
regarding the overall distance in certain domain, like personality.

Other columns in the output are:

n.Male and n.Female = sample sizes

m.Male and m.Female = mean levels

sd.Male and sd.Female = standard deviations

pooled.sd = pooled standard deviation

diff = the raw (non-standardized) difference

t_stat = t statistic

df = Welch modified degrees of freedom

p = p-value

Although the univariate differences presented above estimate sex
difference in each broad Big Five trait separately, these single Cohen’s
*d* estimates do not provide a clear picture about the overall
distance between males and females. Multivariate methods, using multiple
variables and differences in these variables at the same, while
accounting for their covariability (intercorrelations), should be used
for understanding overall distances. These *D* values can be
interpreted in the same way as Cohen’s *d*, i.e., standardized
distances between males and females.

Mahalanobis’ *D* is the standardized distance between the Male
and Female group centroids in multivariate space. For calculation of
Mahalanobis’ *D*, correlation matrix of the multivariate set and
vectors of standardized mean differences are needed. The ordering of the
variables should be same in these inputs.

```
# let's use the previously obtained univariate d-values
<-uni.d[,"D"]
d_vector# calculate intercorrelations
<-cor(US.dat[,c("N","E","O","A","C")])
cor_mat# test that the ordering of these variables matches
names(d_vector)==colnames(cor_mat)
```

`#> [1] TRUE TRUE TRUE TRUE TRUE`

```
# print the correlation matrix
round(cor_mat,2)
```

```
#> N E O A C
#> N 1.00 -0.25 -0.10 -0.10 -0.26
#> E -0.25 1.00 0.15 0.31 0.10
#> O -0.10 0.15 1.00 0.10 0.05
#> A -0.10 0.31 0.10 1.00 0.17
#> C -0.26 0.10 0.05 0.17 1.00
```

```
# Calculate D
<-sqrt(t(d_vector) %*% solve(cor_mat) %*% d_vector)
D_maha D_maha
```

```
#> [,1]
#> [1,] 0.7681502
```

Mahalanobis’ *D* is, of course, larger than any of the
univariate *d*’s for single variables, here the standardized
distance between the group centroids of males and females in the
multivariate space is *D* = 0.77

Regularized variant of *D* is calculated using a predictive
approach whereby the binary sex-variable is predicted by the
multivariate variable set with regularized logistic regression. The
predicted values for each individual are indicative of logits of
probabilities of being a male or female.

These femininity-masculinity scores (FM) can be used in similar ways to any other variable to calculate standardized distances between mean FM of Males and mean FM of females with pooled standard deviations. Beyond estimation of mean differences, there are many other interesting features that can be informative of the multivariate/FM distributions.

In the **D_regularized** -function, separate partitions
of the original data are used for regulation (i.e., training) and
prediction (i.e., testing). Regularization means that the
beta-coefficients for variables in the multivariate set are penalized
through a specified function (here, a default elastic net penalty with
10-fold cross-validation is used). After this, the obtained coefficients
weights are used in the prediction part of the data as measures of FM
for each individual. Argument **out** (logical, meaning are
predictions made “out-of-bag”) and **size** (size of the
training partition of the data, equal for males and females) are used
for informing the function that a partition is wanted. The multivariate
set is supplied with **mv.vars** -argument.

By requesting, it is possible to calculate **pcc**
(probability of correctly classified), **auc** (area under
the receiver operating characteristics), and also frequencies indicating
the proportions of the sample that fall within certain cutoffs of
predicted probabilities from 0 to 1 for each group
(**pred.prob**).

```
# set seed number for reproducibility
set.seed(37127)
# decide the size of the regularization (training data fold)
# here, half of the smaller of the male and female groups is used
=round(min(table(US.dat$gender.cat))/2)
size
# run D_regularized
<-
D_outD_regularized(data=US.dat,
mv.vars=c("N","E","O","A","C"),
group.var="gender.cat",
group.values = c("Male","Female"),
out = T,
size=size,
pcc = T,
auc = T,
pred.prob = T)
# print the output
round(D_out$D,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> [1,] 1473 4267 0.34 -0.3 0.86 0.79 0.81 0.64 0.79
#> pcc.Male pcc.Female pcc.total auc
#> [1,] 0.63 0.67 0.66 0.71
```

From the output, we see that the sample sizes are smaller as compared to univariate estimates, because the sample sizes are only indicative of the samples in prediction/testing dataset.

Standardized multivariate difference between males and females was
estimated at *D* = 0.79. Very similar in magnitude to
multivariate sex difference indexed by Mahalanobis’ *D*.

There are additional features in this output that might be of
interest. Similarly to univariate *d* estimates, we get
descriptive statistics for both groups on FM-score metric, as well as an
estimate of pooled estimate of standard deviation that is also used for
calculation of D. **pcc** indicates probability of
correctly classified individuals, and **auc** is the area
under the receiver operating characteristics. These are both commonly
used indices of predictive performance. In this exemplary case, the
performance of broad Big Five domains in predicting sex. Detailing
predicted probabilities of being a male across certain cutoffs (defaults
were used in this example) for both groups can be obtained from
**P.table** part of the output:

`round(100*D_out$P.table,1)`

```
#>
#> [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#> Female 7.2 39.0 37.0 14.1 2.7
#> Male 1.2 18.1 37.1 32.6 11.0
```

From this table we see that majority of males are nevertheless not very clearly classified as males based on the Big Five personality traits (11% in .80-1 range), and same is true for classifying females as females (7% in 0-0.20 range).

For examining the regularized beta weights obtained at the training
phase, call **coefficients** on the **cv.mod**
object of the output. Argument **s** is the value of the
penalty parameter lambda by which the predictions are calculate. The
default “lambda.min” uses the values where the mean cross-validated
error is minimized, and “lambda.1se” uses the largest value of lambda
such that error is within 1 standard error of the minimum. To obtain
predicted values for a certain **s**, it must be supplied
already to **D_regularized** -function, but the weights
given by different approaches can be examined irregardless of the chosen
value of **s**. It is also possible to plot the
cross-validation curve to visualize how “lambda.min” and “lambda.1se”
were obtained. For more information on regularized regression conducted
with the glmnet package, please refer to https://glmnet.stanford.edu/articles/glmnet.html.

```
coefficients(D_out$cv.mod,
s = "lambda.min")
```

```
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) 3.37297381
#> N -0.54865910
#> E -0.02070834
#> O 0.64983681
#> A -0.88854735
#> C -0.23751055
```

```
coefficients(D_out$cv.mod,
s = "lambda.1se")
```

```
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#> s1
#> (Intercept) 2.3670214
#> N -0.3900418
#> E .
#> O 0.4626928
#> A -0.6930196
#> C -0.1053781
```

`plot(D_out$cv.mod)`

The same pattern can also be observed from visually inspecting the
distributions of the FM-scores, and assessing the
**overlap** between the distributions of FM-scores of males
and females.

```
# Obtain the D-estimate
<-unname(D_out$D[,"D"])
D
# Overlap relative to a single distribution (OVL)
<-2*pnorm((-D/2))
OVL
# Proportion of overlap relative to the joint distribution
<-OVL/(2-OVL)
OVL2
# non-parametric overlap with overlapping package
<-
np.overlapoverlap(x = list(D_out$pred.dat[
$pred.dat$group=="Male","pred"],
D_out$pred.dat[
D_out$pred.dat$group=="Female","pred"]),
D_outplot=T)
```

```
# this corresponds to Proportion of overlap relative to the joint distribution (OVL2)
<-unname(np.overlap$OV)
np.OVL2
# from which Proportion of overlap relative to a single distribution (OVL) is approximated at
<-(2*np.OVL2)/(1+np.OVL2)
np.OVL
# comparison of different overlap-estimates
round(cbind(OVL,np.OVL,OVL2,np.OVL2),2)
```

```
#> OVL np.OVL OVL2 np.OVL2
#> [1,] 0.69 0.7 0.53 0.54
```

In this example, the non-parametric overlap (np) estimates to these logistic distributions are almost identical to the parametric overlap estimates.

To obtain confidence intervals for *D*-estimates in the
regularized techniques, a bootstrap approach can be used whereby
*D* is estimated with multiple redrawn bootstrap samples from the
original sample and using the distribution of these *D* values
for constructing a percentile confidence interval around the estimate.
In this example, we take only 10 redraws with replacement, but in
reality one should take at least 1000 or 5000.

```
# number of redraws
=10
reps
# placeholder for D-estimates
<-rep(NA,reps)
D_out_boot
# repeat the procedure to obtain a distribution
<-Sys.time()
t1
for (i in 1:reps){
# draw replaced samples with the same size as the original data
<-sample_n(US.dat,size=nrow(US.dat),replace=T)
boot.dat
# run D_regularized for each sample
<-
D_out_boot[i]D_regularized(data=boot.dat,
mv.vars=c("N","E","O","A","C"),
group.var="gender.cat",
group.values = c("Male","Female"),
out = T,
size=size)$D[,"D"]
}<-Sys.time()
t2-t1 t2
```

`#> Time difference of 6.277245 secs`

```
# print 95% CI percentile interval
round(quantile(D_out_boot,c(.025,.975)),2)
```

```
#> 2.5% 97.5%
#> 0.77 0.83
```

Thus, it can be reported that the multivariate sex difference
*D* = 0.79, 95% CI [0.77, 0.83].

Recently there have been multiple studies in the field of personality psychology highlighting the importance of narrow personality traits, also in terms of their ability to predict criteria over and above the more commonly used broad trait domains, such as Big Five (e.g., Seeboth & Mõttus, 2018). These narrow personality traits are usually either facets or nuances that are commonly modeled as being part of a certain Big Five trait, although their unique predictive utility suggests that they have trait-like properties (stability, cross-informant agreement etc.) that go beyond the broad Big Five domains (e.g., Mõttus et al., 2019). That is, the narrower traits can be considered unique traits as well. It is therefore interesting to see whether the item scores as the variable set in multivariate estimation would also better predictions of sex, and different estimates of sex differences (see Ilmarinen et al., 2022).

Because the number of the variables is increased (from five to fifty
in this dataset), the use of regularization methods and different data
partitions for training and testing are strongly recommended to avoid
overfitting the data (for illustrative purposes, Mahalanobis’ *D*
is also be calculated below).

```
# place-holder vector for univariate Cohen d's for items
<-rep(NA,length(per.items))
d_vector_items
# loop through the items and obtain d for each item
for (i in 1:length(per.items)){
<-
d_vector_items[i]d_pooled_sd(data=US.dat,
var=per.items[i],
group.var="gender.cat",
group.values=c("Male","Female"))[,"D"]
}
<-cor(US.dat[,per.items])
cor_mat_items
# Calculate D
<-
D_maha_itemssqrt(t(d_vector_items) %*% solve(cor_mat_items) %*% d_vector_items)
D_maha_items
```

```
#> [,1]
#> [1,] 0.9212846
```

Item-based Mahalanobis’ *D* is somewhat larger that any of
those based on Big Five domains. The standardized distance between the
group centroids of males and females in a space comprised of personality
items is *D* = 0.92

It is straightforward to use D_regularized -function even with large variable sets (especially when the variable names are save in a vector first). Using separate fold for regularization and prediction for items. Domain-based results are printed for comparison.

```
<-
D_items_outD_regularized(data=US.dat,
mv.vars=per.items,
group.var="gender.cat",
group.values = c("Male","Female"),
out = T,
size=size,pcc = T,auc = T,pred.prob = T)
# Item-based D
round(D_items_out$D,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> [1,] 1473 4267 0.46 -0.4 0.9 0.86 0.87 0.86 0.99
#> pcc.Male pcc.Female pcc.total auc
#> [1,] 0.7 0.69 0.69 0.76
```

```
# Big Five domain-based D
round(D_out$D,2)
```

```
#> n.Male n.Female m.Male m.Female sd.Male sd.Female pooled.sd diff D
#> [1,] 1473 4267 0.34 -0.3 0.86 0.79 0.81 0.64 0.79
#> pcc.Male pcc.Female pcc.total auc
#> [1,] 0.63 0.67 0.66 0.71
```

With items as the multivariate set, *D* = 0.99 is slightly
larger than with Big Five trait domains, *D* = 0.79.

Differences in predictive performance can also be illustrated from comparing the predicted probabilities.

`round(100*D_items_out$P.table,1)`

```
#>
#> [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#> Female 11.6 38.5 33.4 14.3 2.2
#> Male 1.6 15.1 32.9 35.4 15.0
```

`round(100*D_out$P.table,1)`

```
#>
#> [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
#> Female 7.2 39.0 37.0 14.1 2.7
#> Male 1.2 18.1 37.1 32.6 11.0
```

We see that there are more people in the extremes, but based on the pcc and auc -values printed above, the overall predictive performance is not drastically improved. A large proportion (more than 30%) of the data is still classified in the 40%-60% range, meaning their personality profiles are rather indifferent in regards to sex.

Overlaps give similar information but in another format:

```
# Obtain the D-estimate
<-unname(D_items_out$D[,"D"])
D_items
# Overlap relative to a single distribution (OVL)
<-2*pnorm((-D_items/2))
OVL_items
# Proportion of overlap relative to the joint distribution
<-OVL_items/(2-OVL_items)
OVL2_items
# non-parametric overlap
<-
np.overlap_itemsoverlap(x = list(D_items_out$pred.dat[
$pred.dat$group=="Male","pred"],
D_items_out$pred.dat[
D_items_out$pred.dat$group=="Female","pred"]),
D_items_outplot=T)
```

```
# this corresponds to Proportion of overlap relative to the joint distribution (OVL2)
<-unname(np.overlap_items$OV)
np.OVL2_items
# from which Proportion of overlap relative to a single distribution (OVL) is approximated at
<-(2*np.OVL2_items)/(1+np.OVL2_items)
np.OVL_items
# compare the different overlap-estimates
round(cbind(OVL_items,np.OVL_items,
2) OVL2_items,np.OVL2_items),
```

```
#> OVL_items np.OVL_items OVL2_items np.OVL2_items
#> [1,] 0.62 0.63 0.45 0.46
```

```
# print Big Five domain based overlaps for comparison
round(cbind(OVL,np.OVL,OVL2,np.OVL2),2)
```

```
#> OVL np.OVL OVL2 np.OVL2
#> [1,] 0.69 0.7 0.53 0.54
```

There overlap between FM-distributions of males and females is somewhat smaller when individual items are used as the multivariate set rather than broad Big Five domains.

`sessionInfo()`

```
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=Finnish_Finland.utf8
#> [3] LC_MONETARY=Finnish_Finland.utf8 LC_NUMERIC=C
#> [5] LC_TIME=Finnish_Finland.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] overlapping_1.8 testthat_3.1.5 ggplot2_3.3.5 dplyr_1.0.10
#> [5] multid_0.7.1 rio_0.5.29
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.9 lattice_0.20-45 assertthat_0.2.1 glmnet_4.1-4
#> [5] digest_0.6.29 foreach_1.5.2 utf8_1.2.2 R6_2.5.1
#> [9] cellranger_1.1.0 plyr_1.8.7 evaluate_0.17 highr_0.9
#> [13] pillar_1.8.1 rlang_1.0.6 curl_4.3.2 readxl_1.4.0
#> [17] rstudioapi_0.13 data.table_1.14.2 jquerylib_0.1.4 Matrix_1.5-0
#> [21] rmarkdown_2.15 labeling_0.4.2 splines_4.2.2 stringr_1.4.0
#> [25] foreign_0.8-83 munsell_0.5.0 compiler_4.2.2 xfun_0.30
#> [29] pkgconfig_2.0.3 shape_1.4.6 htmltools_0.5.2 tidyselect_1.2.0
#> [33] tibble_3.1.8 codetools_0.2-18 fansi_1.0.3 withr_2.5.0
#> [37] brio_1.1.3 grid_4.2.2 jsonlite_1.8.2 gtable_0.3.0
#> [41] lifecycle_1.0.3 DBI_1.1.2 magrittr_2.0.3 pROC_1.18.0
#> [45] scales_1.2.0 zip_2.2.0 cli_3.4.1 stringi_1.7.6
#> [49] farver_2.1.0 bslib_0.3.1 ellipsis_0.3.2 generics_0.1.3
#> [53] vctrs_0.4.2 openxlsx_4.2.5 iterators_1.0.14 tools_4.2.2
#> [57] forcats_0.5.1 glue_1.6.2 hms_1.1.1 fastmap_1.1.0
#> [61] survival_3.4-0 yaml_2.3.5 colorspace_2.0-3 knitr_1.39
#> [65] haven_2.5.0 sass_0.4.1
```