pphR Documentation

Differentially Private Projected Histograms

Description

The pph package provides a way to create training data for classifiers in a differentially private manner.

pdata is a differentially private procedure that first projects data onto a subset of dimensions conditioned on a particular target dimension, and subsequently creates data by rounding a noisy truncated histogram of the projected data.

defk is used to estimate the number of covariate dimensions needed.

gamma is used to distribute the privacy budget between the two tasks of projecting and generate a noisy histogram.

kmda is a used to compute a size k projection onto predictor columns for a target attribute in a differentially private manner.

to01 is a convenience function that scales numeric columns of a data frame into the unit interval.

Usage

pdata(data, target = ncol(data), eps = 1, A = getOption('pph.A', 0.5),
      k = defk(data, target = target), gamma. = gamma(data, k = k),
      jitter = NULL, histogram.out=FALSE, verbose=FALSE)
defk(data, tau=0.1, target=ncol(data))
gamma(data, p=0.95, tau=0.05, B=getOption('pph.B', 0.5), k = NULL)
kmda(object, ...)
to01(data, warn=FALSE)

## S3 method for class 'numeric'
kmda(object, m, k, epsilon=1, verbose=FALSE)
## S3 method for class 'data.frame'
kmda(object, target=ncol(object), ...)
## S3 method for class 'matrix'
kmda(object, target=ncol(object), ...)
## S3 method for class 'formula'
kmda(object, data=NULL, ...)

Arguments

data

A data frame containing data intended for building a classifier for a target attribute. In kmda.formula it is the data frame for which we seek a set of size k of predictor columns.

target

The index of the target attribute in data (or object in kmda.data.frame and kmda.matrix).

eps

The privacy level.

epsilon

The privacy level for kmda. If this is set to NA, kmda will run in non-private mode.

A

A tuning parameter used to filter the histogram by removing entries that are equal to or less A*log(nrow(data))/eps.

k

The number of predictor (covariate, independent) dimensions. If NULL in gamma then defk will be called to provide a value.

tau

1 - tau is the fraction of target pairs discerned in the estimation of the needed k in defk.

p

the target probability that an original data point data makes it into the truncated histogram.

B

a tuning parameter for the importance of projection in the computations in gamma.

gamma.

Distributes the privacy budget epsilon between computing which k dimensions to project onto, and building the noisy histogram. The former gets (1 - gamma.)*eps, while the latter gets gamma. * eps.

jitter

If non-NULL, this adds small numeric noise to numeric values to smooth out the effects of discretization by calling base::jitter with a factor argument taking the value of jitter. The noise added is uniform in the range [-a, a] where a = jitter * d/5, where d is the smallest difference between two different discretized values.

histogram.out

if TRUE pdata returns a hash object representing the noisy truncated histogram. If FALSE a data frame created from the histogram is returned.

verbose

print information about progress and parameters if TRUE.

object

in kmda.numeric this is the index of the target column in m. In kmda.matrix and kmda.data.frame it is a matrix or data.frame respectively. In kmda.formula it is a formula that describes among which columns predictor columns for the target attribute should be sought among.

m

a matrix containing the attribute columns out of which object is the target column index for which we seek k predictor columns for.

...

Parameters to be passed to kmda.numeric from the other kmda methods.

warn

warn if columns were scaled.

Details

Columns in the data frame data are expected to be either numeric or factors. A numeric column in data must only contain values from the unit interval in order for pdata to be differentially private. For convenience, pdata will scale non unit interval numeric columns to the unit interval and issue a warning. Numeric data is discretized into bins of equal width. This width is computed as a function of the size and dimensionality of the data. A minimum width can be set by setting option pph.minbw using options.

Value

pdata returns a data frame.

kmda.formula returns a formula representation of the projection. The other kmda methods return a list containing two items S a vector of column indices into object or m of the computed predictors, lo a measure of pairs not discerned.

Warning

pdata is only differentially private if numeric data is constrained to the unit interval.

In kmda, the lo component is not produced in a differentially private manner.

The code currently applies exhaustive exploration of possible histogram entries as opposed to the more efficient sampling method presented in the reference on which the method is based.

Note

For now, this package can be installed by issuing install.packages('pph', repos='http://laats.github.io/sw/R').

This implementation was in part supported by NIH NLM grant 7R01LM007273-07 and NIH Roadmap for Medical Research grant U54 HL108460.

Author(s)

Staal A. Vinterbo <sav@ucsd.edu>

References

S. A. Vinterbo. Differentially Private Projected Histograms: Construction and Use for Prediction. Proc. ECML-PKDD 2012, to appear.

See Also

See also hash, options.

Examples

  data(iris)
  # scale numeric covariates into the unit interval
  iris <- to01(iris)

  # Differentially private logistic regression:
  model <- glm(I(Species == 'virginica') ~ ., binomial, pdata(iris))
  summary(model)
  p <- predict(model, iris, type='response')

  ## show results:
  boxplot(p ~ s, data.frame(p=p, s=iris$Species), ylab='P(virginica|x)',
          xlab='Actual Class')

  # Differentially private multinomial logistic regression
  # (not run due to nnet dependency)
  ## Not run: 
    library(nnet) # load multinom
    data(iris)
    iris <- to01(iris)
    model <- multinom(Species ~ ., data=pdata(iris))
    p <- predict(model, iris)
  
## End(Not run)
  # compute a projection
  m <- data.frame(matrix(sample(0:1, 100, replace=TRUE), ncol=5))
  pr <- kmda(m, k=3)
  pr <- kmda(m, target=5, k=3)
  pr <- kmda(X5 ~ ., m, k=3)
  # the above are all equivalent, except that the last one
  # returns a formula instead of a list.