# refset - subsets with reference semantics

## The skinny

### Installation

# stable version from CRAN:
install.packages("refset")
# development version from github:
library(devtools)
install_github("hughjonesd/refset")

### Creating a refset

library(refset)
employees <- data.frame(
id=1:4,
name=c("James", "Sylvia", "Meng Qi", "Luis"),
age=c(28,44,38, 23),
gender=factor(c("M", "F", "F", "M")),
stringsAsFactors=FALSE)

refset(rs, employees[1:2,])

### Refsets refer to the original

rs
##   id   name age gender
## 1  1  James  28      M
## 2  2 Sylvia  44      F
employees$age ## [1] 29 45 38 23 ### You can have refsets of refsets refset(rs2, rs$id)
rs2
## [1] 1 2
rs$id <- rs$id + 1000
rs2
## [1] 1001 1002
rs2 <- 101:102
employees$id ## [1] 101 102 3 4 ### Refset size can change dynamically # the multi-argument form. Note the empty argument, to select all columns: refset(rsd, employees, age < 30, , drop=FALSE) rsd ## id name age gender ## 1 101 Jimmy 29 M ## 4 4 Luis 23 M employees$age <- employees$age + 1 rsd ## id name age gender ## 4 4 Luis 24 M ### You can refset any subsettable object… vec <- 1:10 refset(rs, vec, 4:6) rs <- rs*10 vec ## [1] 1 2 3 40 50 60 7 8 9 10 ### … using any form of subsetting lst <- list(a="text", b=42, NA) refset(rsl, lst$b)
rsl <- "more text"
lst$b ## [1] "more text" ### The short form rs %r% employees[1:3,] # equivalent to refset(rs, employees[1:3,]) ### To pass a refset into a function, use wrapset to create a parcel: f <- function(x) { cx <- contents(x) contents(x)$name <- paste(cx$name, "the", sample(c("Kid", "Terrible", "Silent", "Fair"), nrow(cx), replace=TRUE)) } parcel <- wrapset(employees[]) f(parcel) employees ## id name age gender ## 1 101 Jimmy the Silent 30 M ## 2 102 Silvia the Kid 46 F ## 3 3 Meng Qi the Terrible 39 F ## 4 4 Luis the Kid 24 M ## Introduction Normally, R uses “pass by value”. This means that when you run b <- a you have two independent copies of the same data. Similarly, the code: f <- function(x) {x <- x*2} a <- 4 f(a) a ## [1] 4 does not change the value of a, since the function f gets passed the contents of a rather than the variable a itself. This is fine for most cases, especially for traditional uses of R in which the programmer or statistician passes in a value to a function, and sees the result on the command line. However, in some cases we would like to work with a single object, rather than multiple copies. For example: • working on a complex dataset, an analyst may wish to work with part of the dataset, but to have any changes reflected in the whole data frame. • if a data frame represents objects in a relational database, changes to the database on disk should be reflected in the data frame. • for large datasets, assigning into multiple copies can take up memory. The refset package allows you to do this, by creating objects that refer to other objects, or subsets of them. To create a refset, call refset with two arguments: dfr <- data.frame(x1=1:5, x2=rnorm(5), alpha=letters[1:5]) refset(rs, dfr[dfr$x1 <= 3, c("x1", "alpha")])

The call above creates a new variable rs in your environment. (Strictly, it creates a new binding, but we needn’t worry about that for now.) For comparison, we’ll also create a standard subset.

ss <- dfr[dfr$x1 <= 3, c("x1", "alpha")] rs ## x1 alpha ## 1 1 a ## 2 2 b ## 3 3 c ss ## x1 alpha ## 1 1 a ## 2 2 b ## 3 3 c rs and ss look and behave just the same: c(class(rs), class(ss)) ## [1] "data.frame" "data.frame" c(mean(rs$x1), mean(ss$x1)) ## [1] 2 2 To see the difference, let’s change the data in dfr: dfr$alpha <- c(NA, letters[23:26])
rs
##   x1 alpha
## 1  1  <NA>
## 2  2     w
## 3  3     x
ss
##   x1 alpha
## 1  1     a
## 2  2     b
## 3  3     c

As is normal, ss has not updated to reflect changes in the original data frame. But rs has.

The connection also works the other way, if you change rs.

rs$alpha <- LETTERS[1:3] rs ## x1 alpha ## 1 1 A ## 2 2 B ## 3 3 C dfr ## x1 x2 alpha ## 1 1 -1.4065121 A ## 2 2 0.9061236 B ## 3 3 -1.4289599 C ## 4 4 -0.1141033 y ## 5 5 1.2529014 z Everything that you do to rs will be reflected in the original data, and vice versa. Well, almost everything: remember that rs refers to a subset of the data. If you can’t do it to a subset, you probably can’t do it to a refset. For example, changing the names of a refset doesn’t work, because assigning to the names of a subset of your data doesn’t change the original names. ## Ways to call refset There are three ways to create a refset. The first you have already seen: call refset(name, data[indices]) where name is the variable name of the variable you want to create, and data[indices] is the subset you want to look at. You aren’t limited to using data frames. You can refset any object which you can subset, and you can use any of the three standard ways to subset data:$, [[ and [.

vec <- 1:10
refset(rvec, vec[2:3])
mylist <- list(a="some", b="more", c="data")
refset(rls, mylist$b) refset(rls2, mylist[["c"]]) rvec ## [1] 2 3 c(rls, rls2) ## [1] "more" "data" However, this won’t work: myss <- subset(dfr, x1>1) refset(rs, myss) ## Error in substitute(data)[[1]]: object of type 'symbol' is not subsettable You have to specifically write out the subset you want: you can’t put it in a variable. The second way to call refset is using the %r% infix operator. This is conveniently short, and also makes it clearer that you are assigning to a variable. top4 %r% dfr[1:4,] exists("top4") ## [1] TRUE The last way to create a refset is the 3-or-more argument form of the function. This works like the subset command in R base: you can refer to data frame columns by name directly. refset(large, dfr, x2 > 0,) large ## x1 x2 alpha ## 2 2 0.9061236 B ## 5 5 1.2529014 z Notice that we’ve included an empty argument. This is just the same as when you call dfr[dfr$x2 > 0, ] with an empty argument after the comma: it includes all the columns.

## Dynamic indexing

Refsets don’t just sync their data with their “parent”. They also update their indices dynamically. For example, suppose we have a database of employees, including hours worked in the past month.

employees <- data.frame(
id=1:4,
name=c("James", "Sylvia", "Meng Qi", "Luis"),
age=c(28,44,38, 23),
gender=factor(c("M", "F", "F", "M")),
hours=c(160, 130, 185, 145),
pay=c(60000, 50000, 70000, 60000),
stringsAsFactors=FALSE)

We can create a refset of employees who worked overtime:

overtimers %r% employees[employees$hours > 140,] overtimers ## id name age gender hours pay ## 1 1 James 28 M 160 60000 ## 3 3 Meng Qi 38 F 185 70000 ## 4 4 Luis 23 M 145 60000 When the new monthly data comes in, the set of people in overtimers will change: employees$hours <- c(135, 150, 70, 145)
overtimers
##   id   name age gender hours   pay
## 2  2 Sylvia  44      F   150 50000
## 4  4   Luis  23      M   145 60000

Sometimes you may wish to turn this behaviour off. For example, you may want to look at a particular subset that had a certain characteristic at a point in time. For this, use the argument dyn.idx=FALSE to refset.

# people who worked long hours last month:
refset(overtimers_static, employees, hours > 140, , dyn.idx=FALSE)
# give them a holiday...
overtimers_static$hours <- 0 # ... and a pay rise overtimers_static$pay <- overtimers_static$pay * 1.1 overtimers_static ## id name age gender hours pay ## 2 2 Sylvia 44 F 0 55000 ## 4 4 Luis 23 M 0 66000 Without the dyn.idx=FALSE argument, the refset would have zero rows after the call setting hours to 0. ## Delinking from the parent, and using parcels If you want to break the link to the parent dataset, simply assign your refset to a new variable. copy <- overtimers copy$pay <- copy$pay * 2 employees$pay # still the same :/
## [1] 60000 55000 70000 66000

Refsets are implemented using an R feature called “active binding”, which calls a function when you access or change a variable. Reassigning to a new variable reassigns the contents, rather than the binding.

This causes a problem if you want to pass a reference into functions, rather than passing the value of the refset – for example, if you would like to change the refset in the body of the function, and have this affect the original data. When you use a refset in a function argument, it binds it to a new value, breaking the link with the parent.

If you are writing your own code, you can avoid this problem by creating a refset which is “wrapped” in a parcel object. Parcels simply contain an expression and an environment in which the expression should be evaluated. For example, they can contain the name of a refset. When the contents function is called on a parcel, the expression is reevaluated. Here’s how to write a function that changes the name of our employees:

rs %r% employees[1:3,]
f <- function(x) {
cx <- contents(x)
contents(x)$name <- paste(cx$name, "the", sample(c("Kid", "Terrible", "Silent",
"Fair"), nrow(cx), replace=TRUE))
}

parcel <- wrapset(employees[])
f(parcel)
employees
##   id             name age gender hours   pay
## 1  1 James the Silent  28      M   135 60000
## 2  2   Sylvia the Kid  44      F     0 55000
## 3  3  Meng Qi the Kid  38      F    70 70000
## 4  4  Luis the Silent  23      M     0 66000

As the above shows, you can assign to contents(parcel) as well as read from it. You can also create a new variable from the parcel by using unwrap_as. Another way to write the function above would be:

f <- function(parcel) {
unwrap_as(emps, parcel)
emps$name <- paste(emps$name, "the", sample(c("Kid", "Terrible", "Silent",
"Fair"), nrow(emps), replace=TRUE))
}
f(parcel)
employees
##   id                         name age gender hours   pay
## 1  1     James the Silent the Kid  28      M   135 60000
## 2  2  Sylvia the Kid the Terrible  44      F     0 55000
## 3  3   Meng Qi the Kid the Silent  38      F    70 70000
## 4  4 Luis the Silent the Terrible  23      M     0 66000

Using parcels is a way to pass references around code. You could also do this using non-standard evaluation (NSE). Parcels have the nice feature that they store the environment where they should be evaluated.