CRAN Task View: Official Statistics & Survey Statistics
|Maintainer:||Matthias Templ, Alexander Kowarik, Tobias Schoch|
|Contact:||matthias.templ at gmail.com|
|Contributions:||Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide.|
|Citation:||Matthias Templ, Alexander Kowarik, Tobias Schoch (2023). CRAN Task View: Official Statistics & Survey Statistics. Version 2023-02-19. URL https://CRAN.R-project.org/view=OfficialStatistics.|
|Installation:||The packages from this task view can be installed automatically using the ctv package. For example, |
ctv::install.views("OfficialStatistics", coreOnly = TRUE) installs all the core packages or
ctv::update.views("OfficialStatistics") installs all packages that are not yet installed and up-to-date. See the CRAN Task View Initiative for more details.
This CRAN Task View contains a list of packages with methods typically used in official statistics and survey statistics. Many packages provide functions for more than one of the topics listed below. Therefore, this list is not a strict categorization and packages may be listed more than once.
The task view is split into several parts
- First part: “Producing Official Statistics”. This first part is targeted at people working at national statistical institutes, national banks, international organizations, etc. who are involved in the production of official statistics and using methods from survey statistics. It is loosely aligned to the “Generic Statistical Business Process Model”.
- Second part: “Access to Official Statistics”. This second part’s target audience is everyone interested to use official statistics results directly from within R.
- Third part: “Related Methods” shows packages that are important in official and survey statistics, but do not directly fit into the production of official statistics. It complements with a subsection on “Miscellaneous” - a collection of packages that are loosely linked to official statistics or that provide limited complements to official statistics and survey methods.
First Part: Production of Official Statistics
1 Preparations/ Management/ Planning (questionnaire design, etc.)
- questionr package contains a set of functions to make the processing and analysis of surveys easier. It provides interactive shiny apps and addins for data recoding, contingency tables, dataset metadata handling, and several convenience functions.
- surveydata makes it easy to keep track of metadata from surveys, and to easily extract columns with specific questions.
- blaise implements functions for reading and writing files in the Blaise Format (Statistics Netherlands).
- sampling includes many different algorithms (Brewer, Midzuno, pps, systematic, Sampford, balanced (cluster or stratified) sampling via the cube method, etc.) for drawing survey samples and calibrating the design weights.
- pps contains functions to select samples using pps sampling. Also stratified simple random sampling is possible as well as to compute joint inclusion probabilities for Sampford’s method of pps sampling.
- BalancedSampling provides functions to select balanced and spatially balanced probability samples in multi-dimensional spaces with any prescribed inclusion probabilities. It also includes the local pivot method, the cube and local cube method and a few more methods.
- PracTools contains functions for sample size calculation for survey samples using stratified or clustered one-, two-, and three-stage sample designs as well as functions to compute variance components for multistage designs and sample sizes in two-phase designs.
- surveyplanning includes tools for sample survey planning, including sample size calculation, estimation of expected precision for the estimates of totals, and calculation of optimal sample size allocation.
- stratification allows univariate stratification of survey populations with a generalisation of the Lavallee-Hidiroglou method.
- SamplingStrata offers an approach for choosing the best stratification of a sampling frame in a multivariate and multidomain setting, where the sampling sizes in each strata are determined in order to satisfy accuracy constraints on target estimates. To evaluate the distribution of target variables in different strata, information of the sampling frame, or data from previous rounds of the same survey, may be used.
- R2BEAT provides functions for multivariate, domain-specific optimal sample size allocation for one- and two-stage stratified sampling designs (i.e., generalization of the allocation methods of Neyman and Tschuprow to the case of several variables).
3 Data Collection (incl. record linkage)
3.1 Data Integration (Statistical Matching and Record Linkage)
- StatMatch provides functions to perform statistical matching between two data sources sharing a number of common variables. It creates a synthetic data set after matching of two data sources via a likelihood approach or via hot-deck.
- MatchIt allows nearest neighbor matching, exact matching, optimal matching and full matching amongst other matching methods. If two data sets have to be matched, the data must come as one data frame including a factor variable which includes information about the membership of each observation.
- MatchThem provides tools of matching and weighting multiply imputed datasets to control for effects of confounders. Multiple imputed data files from mice and amelia can be used directly.
- stringdist can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler).
- reclin is a record linkage toolkit to assist in performing probabilistic record linkage and deduplication.
- XBRL allows the extraction of business financial information from XBRL Documents.
- RecordLinkage implements the Fellegi-Sunter method for record linkage.
- fastLink implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. Documentation can be found on http://imai.princeton.edu/research/linkage.html
- fuzzyjoin provides function for joining tables based on exact or similar matches. It allows for matching records based on inaccurate keys.
- PPRL implements privacy preserving record linkage, especially useful when personal ID’s cannot be used to link two data sets. This approach then protects the identity of persons.
3.2 Web Scraping
Web scraping is used nowadays used more frequently in the production of official statistics. For example in price statistics, the collection of product prices, formerly collected by hand over the web or by in person visits to stores are replaced by scraping specific homepages. Tools for this process step are not listed here, but a detailed overview can be found on the CRAN task view on WebTechnologies.
4 Data Processing
4.1 Weighting and Calibration
- survey allows for post-stratification, generalized raking/calibration, GREG estimation and trimming of weights.
- sampling provides the function
calib() to calibrate for nonresponse (with response homogeneity groups) for stratified samples.
- laeken provides the function
calibWeights() for calibration, which is possibly faster (depending on the example) than
calib() from sampling.
- icarus focuses on calibration and re-weighting in survey sampling and was designed to provide a familiar setting in R for users of the SAS macro
Calmar developed by INSEE.
- CalibrateSSB include a function to calculate weights and estimates for panel data with non-response.
- Frames2 allows point and interval estimation in dual frame surveys. When two probability samples (one from each frame) are drawn. Information collected is suitably combined to get estimators of the parameter of interest.
- surveysd provides calibration by iterative proportinal fitting, a calibrated bootstrap optimized for complex surveys and error estimation based on it.
- inca performs calibration weighting with integer weights.
4.2 Editing (including outlier detection)
- validate includes rule management and data validation and package validatetools is checking and simplifying sets of validation rules.
- errorlocate includes error localisation based on the principle of Fellegi and Holt. It supports categorical and/or numeric data and linear equalities, inequalities and conditional rules. The package includes a configurable backend for MIP-based error localization.
- editrules convert readable linear (in)equalities into matrix form.
- deducorrect depends on package editrules and applies deductive correction of simple rounding, typing and sign errors based on balanced edits. Values are changed so that the given balanced edits are fulfilled. To determine which values are changed the Levenstein-metric is applied.
- deductive allows for data correction and imputation using deductive methods.
- rspa implements functions to minimally adjust numerical records so they obey (in)equation restrictions.
- surveyoutliers winsorize values of a variable of interest.
- univOutl includes various methods for detecting univariate outliers, e.g. the Hidiroglou-Berthelot method.
- extremevalues is designed to detect univariate outliers based on modeling the bulk distribution.
A general overview of imputation methods can be found in the CRAN Task View on Missing Data, MissingData. However, most of these presented methods do not take into account the specificities of survey’s from complex designs, i.e., methods that are not specifically designed for official statistics and surveys. For example, the criteria for applying a method often depend on the scale of the data, which in official statistics are usually a mixture of continuous, semi-continuous, binary, categorical, and count variables. In addition, measurement error can greatly affect non-robust imputation methods.
Commonly used packages within statistical agencies are VIM and simputation having fast k-nearest neighbor (knn) algorithms for general distances and (robust) EM-based multiple imputation algorithms implemented.
4.4 Seasonal Adjustment
Seasonal adjustment is an important step in producing official statistics and a very limited set of methodologies are used here frequently, e.g. X13-ARIMA-SEATS developed by the US Census Bureau. In the CRAN Task View TimeSeries section seasonal adjustment, R packages for this can be found.
5 Analysis of Survey Data
5.1 Estimation and Variance Estimation
survey works with survey samples. It allows to specify a complex survey design (stratified sampling design, cluster sampling, multi-stage sampling and pps sampling with or without replacement). Once the given survey design is specified within the function
svydesign(), point and variance estimates can be computed. The resulting object can be used to estimate (Horvitz-Thompson-) totals, means, ratios and quantiles for domains or the whole survey sample, and to apply regression models. Variance estimation for means, totals and ratios can be done either by Taylor linearization or resampling (BRR, jackkife, bootstrap or user-defined).
robsurvey provides functions for the computation of robust (outlier-resistant) estimators of finite population characteristics (means, totals, ratios, regression, etc.) using weight reduction, trimming, winsorization and M-estimation. The package complements survey.
surveysd offers calibration, bootstrap and error estimation for complex surveys (incl. designs with rotational designs).
gustave provides a toolkit for analytical variance estimation in survey sampling.
lavaan.survey provides a wrapper function for packages survey and lavaan. It can be used for fitting structural equation models (SEM) on samples from complex designs (clustering, stratification, sampling weights, and finite population corrections). Using the design object functionality from package survey, lavaan objects are re-fit (corrected) with the
lavaan.survey() function. This function also accommodates replicate weights and multiply imputed datasets.
vardpoor allows to calculate linearisation of several nonlinear population statistics, variance estimation of sample surveys by the ultimate cluster method, variance estimation for longitudinal and cross-sectional measures, and measures of change for any stage cluster sampling designs.
rpms fits a linear model to survey data in each node obtained by recursively partitioning the data. The algorithm accounts for one-stage of stratification and clustering as well as unequal probability of selection.
collapse implements advanced and computationally fast methods for grouped and weighted statistics and multi-type data aggregation (e.g. mean, variance, statistical mode etc.), fast (grouped, weighted) transformations of time series and panel data (e.g. scaling, centering, differences, growth rates), and fast (grouped, weighted, panel-decomposed) summary statistics for complex multilevel / panel data.
srvyr is inspired by the synthetic style of the
dplyr package (i.e., piping, verbs like
summarize). It offers summary statistics for design objects of the survey package.
weights provides a variety of functions for producing simple weighted statistics, such as weighted Pearson’s correlations, partial correlations, Chi-Squared statistics, histograms and t-tests.
svrep provides tools for creating, updating and analyzing survey replicate weights as an extension of survey. Non-response adjustments to both full-sample and replicate weights can be applied. Bootstrap replicate weights can be created for a variety of sampling designs, including stratified multistage samples and samples selected using systematic or unequal probability sampling.
- VIM is designed to visualize missing values using suitable plot methods. It can be used to analyse the structure of missing values in microdata using univariate, bivariate, multiple and multivariate plots where the information of missing values from specified variables are highlighted in selected variables. It also comes with a graphical user interface.
- longCatEDA extends the matrixplot from package VIM to check for monotone missingness in longitudinal data.
- treemap provide treemaps. A treemap is a space-filling visualization of aggregates of data with hierarchical structures. Colors can be used to relate to highlight differences between comparable aggregates.
- tmap offers a layer-based way to make thematic maps, like choropleths and bubble maps.
- rworldmap outline how to map country referenced data and support users in visualizing their own data. Examples are given, e.g., maps for the world bank and UN. It provides also new ways to visualize maps.
6 Statistical Disclosure Control
Data from statistical agencies and other institutions are in its raw form mostly confidential and data providers have to be ensure confidentiality by both modifying the original data so that no statistical units can be re-identified and by guaranteeing a minimum amount of information loss.
Unit-level data (microdata)
- sdcMicro can be used to anonymize data, i.e. to create anonymized files for public and scientific use. It implements a wide range of methods for anonymizing categorical and continuous (key) variables. The package also contains a graphical user interface, which is available by calling the function
- simPop using linear and robust regression methods, random forests (and many more methods) to simulate synthetic data from given complex data. It is also suitable to produce synthetic data when the data have hierarchical and cluster information (such as persons in households) as well as when the data had been collected with a complex sampling design. It makes use of parallel computing internally.
- synthpop using regression tree methods to simulate synthetic data from given data. It is suitable to produce synthetic data when the data have no hierarchical and cluster information (such as households) as well as when the data does not collected with a complex sampling design.
Aggregated information (tabular data)
- sdcTable can be used to provide confidential (hierarchical) tabular data. It includes the HITAS and the HYPERCUBE technique and uses linear programming packages (Rglpk and lpSolveAPI) for solving (a large amount of) linear programs.
- sdcSpatial can be used to smooth or/and suppress raster cells in a map. This is useful when plotting raster-based counts on a map.
- sdcHierarchies provides methods to generate, modify, import and convert nested hierarchies that are often used when defining inputs for statistical disclosure control methods.
- SmallCountRounding can be used to protect frequency tables by rounding necessary inner cells so that cross-classifications to be published are safe.
- GaussSuppression can be used to protect tables by suppression using the Gaussian elimination secondary suppression algorithm.
- DSI is an interface to DataShield. DataShield is an infrastructure and series of R packages that enables the remote and non-disclosive analysis of sensitive research data.
Second Part: Access to Official Statistics
Access to data from international organizations and multiple organizations
- OECD searches and extracts data from the OECD.
- Rilostat contains tools to download data from the international labour organisation database together with search and manipulation utilities. It can also import ilostat data that are available on their data base in SDMX format.
- eurostat provides search for and access to data from Eurostat, the statistical agency for the European Union.
- ipumsr provides an easy way to import census, survey and geographic data provided by IPUMS.
- FAOSTAT can be used to download data from the FAOSTAT database of the Food and Agricultural Organization (FAO) of the United Nations.
- pxweb provides generic interface for the PX-Web/PC-Axis API used by many National Statistical Agencies.
- PxWebApiData provides easy API access to e.g. Statistics Norway, Statistics Sweden and Statistics Finland.
- rdhs interacts with The Demographic and Health Surveys (DHS) Program datasets.
- prevR implements functions (see
import.dhs()) to import data from the Demographic Health Survey.
- rsdmx provides easy access to data from statistical organisations that support SDMX web services. The package contains a list of SDMX access points of various national and international statistical institutes.
- readsdmx implements functions to read SDMX into data frames from local SDMX-ML file or web-service. By OECD.
- regions offers tools to process regional statistics focusing on European data.
- statcodelists makes the internationally standardized SDMX code lists available for the R user.
- rdbnomics provides access to the DB.nomics database on macroeconomic data from 38 official providers such as INSEE, Eurostat, Wolrd bank, etc.
- iotables makes input-output tables tidy, and allows for economic and environmental impact analysis with formatting the data received from the Eurostat data warehouse into appropriate, validated, matrix forms.
- npi provides access to the API for the U.S. National Provider Identifier Registry, which is the authoritative data source for National Provider Identifier records in the healthcare domain.
Access to data from national organizations
- tidyqwi provides an api for accessing the United States Census Bureau’s Quarterly Workforce Indicator.
- tidyBdE provides access to official statistics provided by the Spanish Banking Authority Banco de Espana.
- cancensus provides access to Statistics Canada’s Census data with the option to retrieve all data as spatial data.
- sorvi provides access to Finnish open government data.
- insee searches and extracts data from the Insee’s BDM database.
- acs downloads, manipulates, and presents the American Community Survey and decennial data from the US Census.
- censusapi implements a wrapper for the U.S. Census Bureau APIs that returns data frames of Census data and meta data.
- censusGeography (archived) converts specific United States Census geographic code for city, state (FIP and ICP), region, and birthplace.
- idbr implements functions to make requests to the US Census Bureau’s International Data Base API.
- tidycensus provides an integrated R interface to the decennial US Census and American Community Survey APIs and the US Census Bureau’s geographic boundary files
- inegiR provides access to data published by INEGI, Mexico’s official statistics agency.
- cbsodataR provides access to Statistics Netherlands’ (CBS) open data API.
- EdSurvey includes analysis of NCES Education Survey and Assessment Data.
- nomisr gives access to Nomis UK Labour Market Data including Census and Labour Force Survey.
- readabs implements functions to download and tidy time series data from the Australian Bureau of Statistics.
- BIFIEsurvey includes tools for survey statistics in educational assessment including data with replication weights (e.g. from bootstrap).
- CANSIM2R provides functions to extract CANSIM (Statistics Canada) tables and transform them into readily usable data.
- statcanR provides an R connection to Statistics Canada’s Web Data Service. Open economic data (formerly CANSIM tables) are accessible as a data frame in the R environment.
- cdlTools provides functions to download USDA National Agricultural Statistics Service (NASS) cropscape data for a specified state.
- csodata provides functions to download data from Central Statistics Office (CSO) of Ireland.
Small Area Estimation
- sae provides functions for small area estimation (basic area- and unit-level model, Fay-Herriot model with spatial/ temporal correlations), for example, direct estimators, the empirical best predictor and composite estimators.
- rsae provides functions to estimate the parameters of the basic unit-level small area estimation (SAE) model (aka nested error regression model) by means of maximum likelihood (ML) or robust M-estimation. On the basis of the estimated parameters, robust predictions of the area-specific means are computed (incl. MSE estimates; parametric bootstrap).
- emdi provides functions that support estimating, assessing and mapping regional disaggregated indicators. So far, estimation methods comprise direct estimation, the model-based unit-level approach Empirical Best Prediction, the area-level model and various extensions of it, as well as their precision estimates. The assessment of the used model is supported by a summary and diagnostic plots. For a suitable presentation of estimates, map plots can be easily created and exported.
- hbsae provides functions to compute small area estimates based on a basic area or unit-level model. The model is fit using restricted maximum likelihood, or in a hierarchical Bayesian way. Auxiliary information can be either counts resulting from categorical variables or means from continuous population information.
- BayesSAE provides Bayesian estimation methods that range from the basic Fay-Herriot model to its improvement such as You-Chapman models, unmatched models, spatial models and so on.
- SAEval provides diagnostics and graphic tools for the evaluation of small area estimators
- mind provides multivariate prediction and inference (mean square error) for domains using mixed linear models as proposed in Datta, Day, and Basawa (1999, J. Stat. Plan. Inference)
- JoSAE provides point and variance estimation for the generalized regression (GREG) and a unit level empirical best linear unbiased prediction EBLUP estimators can be made at domain level. It basically provides wrapper functions to the nlme package that is used to fit the basic random effects models.
- simPop allows to produce synthetic population data, sometimes needed as a starting population for microsimulations.
- sms provides facilities to simulate micro-data from given area-based macro-data. Simulated annealing is used to best satisfy the available description of an area. For computational issues, the calculations can be run in parallel mode.
- saeSim implements tools for the simulation of data in the context of small area estimation.
- SimSurvey simulates age-structured spatio-temporal populations given built-in or user-defined sampling protocols.
Indices, Indicators, Tables and Visualization of Indicators
- laeken provides functions to estimate popular risk-of-poverty and inequality indicators (at-risk-of-poverty rate, quintile share ratio, relative median risk-of-poverty gap, Gini coefficient). In addition, standard and robust methods for tail modeling of Pareto distributions are provided for semi-parametric estimation of indicators from continuous univariate distributions such as income variables.
- convey estimates variances on indicators of income concentration and poverty using familiar linearized and replication-based designs created by the survey package such as the Gini coefficient, Atkinson index, at-risk-of-poverty threshold, and more than a dozen others.
- ineq computes various inequality measures (Gini, Theil, entropy, among others), concentration measures (Herfindahl, Rosenbluth), and poverty measures (Watts, Sen, SST, and Foster). It also computes and draws empirical and theoretical Lorenz curves as well as Pen’s parade. It is not designed to deal with sampling weights directly (these could only be emulated via
- wINEQ fills the gap of ineq and allows for sampling weights directly. It contains various inequality measures such as Fini, Theil, Leti index, Palma ratio, 20:20 ratio, Allison and Foster index, Jenkins index, Cowell and Flechaire index.
- DHS.rates estimates key indicators (especially fertility rates) and their variances for the Demographic and Health Survey (DHS) data.
- micEconIndex implements functions to compute prices indices (of type Paasche, Fisher and Laspeyres); see
priceIndex(). For estimating quantities (of goods, for example) see function
- samplingbook includes sampling procedures from the book ‘Stichproben. Methoden und praktische Umsetzung mit R’ by Goeran Kauermann and Helmut Kuechenhoff (2010).
- SDaA is designed to reproduce results from Lohr, S. (1999) ‘Sampling: Design and Analysis, Duxbury’ and includes the data sets from this book.
- samplingVarEst implements Jackknife methods for variance estimation of unequal probability with one or two stage designs.
- memisc includes tools for the management of survey data, graphics and simulation.
- anesrake provides a comprehensive system for selecting variables and weighting data to match the specifications of the American National Election Studies.
- spsurvey includes facilities for spatial survey design and analysis for equal and unequal probability (stratified) sampling.
- FFD provides function to calculate optimal sample sizes of a population of animals living in herds for surveys to substantiate freedom from disease. The criteria of estimating the sample sizes take the herd-level clustering of diseases as well as imperfect diagnostic tests into account and select the samples based on a two-stage design. Inclusion probabilities are not considered in the estimation. The package provides a graphical user interface as well.
- mipfp provides multidimensional iterative proportional fitting to calibrate n-dimensional arrays given target marginal tables.
- MBHdesign provides spatially balanced designs from a set of (contiguous) potential sampling locations in a study region.
- quantification provides different functions for quantifying qualitative survey data. It supports the Carlson-Parkin method, the regression approach, the balance approach and the conditional expectations method.
- surveybootstrap includes tools for using different kinds of bootstrap for estimating sampling variation using complex survey data.
- RRreg implements univariate and multivariate analysis (correlation, linear, and logistic regression) for several variants of the randomized response technique, a survey method for eliminating response biases due to social desirability.
- RRTCS includes randomized response techniques for complex surveys.
- panelaggregation aggregates business tendency survey data (and other qualitative surveys) to time series at various aggregation levels.
- rtrim implements functions to study trends and indices for monitoring data. It provides tools for estimating animal/plant populations based on site counts, including occurrence of missing data.
- rjstat. Read and write data sets in the JSON-stat format.
- diffpriv implements the perturbation of statistics with differential privacy.
- easySdcTable provides a graphical interface to a small selection of functionality of package sdcTable.
- MicSim includes methods for microsimulations. Given a initial population, mortality rates, divorce rates, marriage rates, education changes, etc. and their transition matrix can be defined and included for the simulation of future states of the population. The package does not contain compiled code but functionality to run the microsimulation in parallel is provided.
|Core:||errorlocate, sae, sampling, SamplingStrata, sdcMicro, sdcTable, simPop, survey, surveysd, validate, validatetools, VIM.|
|Regular:||acs, anesrake, BalancedSampling, BayesSAE, BIFIEsurvey, blaise, CalibrateSSB, cancensus, CANSIM2R, cbsodataR, cdlTools, censusapi, collapse, convey, csodata, deducorrect, deductive, DHS.rates, diffpriv, DSI, easySdcTable, editrules, EdSurvey, emdi, eurostat, extremevalues, FAOSTAT, fastLink, FFD, Frames2, fuzzyjoin, GaussSuppression, gustave, hbsae, icarus, idbr, inca, inegiR, ineq, insee, iotables, ipumsr, JoSAE, laeken, lavaan, lavaan.survey, longCatEDA, MatchIt, MatchThem, MBHdesign, memisc, micEconIndex, MicSim, mind, mipfp, nlme, nomisr, npi, OECD, panelaggregation, PPRL, pps, PracTools, prevR, pxweb, PxWebApiData, quantification, questionr, R2BEAT, rdbnomics, rdhs, readabs, readsdmx, reclin, RecordLinkage, regions, Rilostat, rjstat, robsurvey, rpms, RRreg, RRTCS, rsae, rsdmx, rspa, rtrim, rworldmap, saeSim, SAEval, samplingbook, samplingVarEst, SDaA, sdcHierarchies, sdcSpatial, simputation, SimSurvey, SmallCountRounding, sms, sorvi, spsurvey, srvyr, statcanR, statcodelists, StatMatch, stratification, stringdist, surveybootstrap, surveydata, surveyoutliers, surveyplanning, svrep, synthpop, tidyBdE, tidycensus, tidyqwi, tmap, treemap, univOutl, vardpoor, weights, wINEQ, XBRL.|