Efficient Heuristics
Many of the data release algorithms we saw in class would not be
practical to implement on a real dataset. Several heuristics are
possible to speed up these algorithms, giving the same privacy
guarantees, but losing the worst-case utility guarantees. For example:
Hardt
and Rothblum suggest running the multiplicative weights
algorithm on a randomly selected (polynomially sized) subset of the
data universe. This gives average case guarantees for randomly selected
databases, but how does it do on real data?
How about random projections of the database, as used by
Blum and Roth? This gives guarantees for sparse queries, but
how about more realistic queries? What about multiplicative weights run
on a small projection of the database?
Privacy Preserving Machine Learning Algorithms
Kasiviswanathan,
Lee, Nissim, Raskhodnikova and Smith show that in principle,
private machine learning algorithms are essentially as powerful as
non-private machine learning algortihms (at least in the theoretical
PAC model of machine learning). These generic private learning
algorithms aren't efficient, however.
Blum,
Dwork, McSherry, and Nissim show that algorithms in the
SQ-model do have
efficient
versions that are differentially private. But what is the cost of
privacy? How does the performance of these algorithms degrade on your
favorite data set at different levels of privacy? What if you apply
composition theorems to get epsilon,delta differential privacy?
How about private versions of more sophisticated machine learning
algorithms like SVMs, like those given by
Chaudhuri, Monteleone,
and Sarwate, or
Rubinstein,
Bartlett, Huang, and Taft?
Distributed Differential Privacy
Most of what we saw in class concerned the
centralized model
of differential privacy, in which a trusted data curator holds (and
gets to look at) the entire private database, and compute on it in a
differentially private way. But what if the dataset is divided among
multiple curators who are mutually untrusting, and so they have to
compute by communicating differentially private messages between
themselves? What kinds of things can you do?
Kasiviswanathan,
Lee, Nissim, Raskhodnikova and Smith characterize what you
can learn in the
local
privacy model (i.e. everyone holds their own data -- we have n
databases each of size 1).
McGregor,
Mironov, Pitassi, Reingold, Talwar, and Vadhan show a
lowerbound for the problem of computing the hamming distance between
two databases in the two party setting (i.e. there are two data
curators, each of which holds half of the database). But almost nothing
is known when the number of curators lies between 2 and n. Even in the
local privacy setting and the 2-party setting, little is known beyond
the results in the cited papers. This is a good topic for open-ended
theoretical exploration.
Privacy and Game Theory
McSherry
and Talwar first proposed designing auction mechanisms using
differentially private mechanisms as a building block. This mechanism,
while private, is only
approximately
truthful.
Nissim,
Smorodinsky, and Tennenholtz show how to convert (in some
settings) differentially private mechanisms into
exactly truthful
mechanisms. However, in doing so, the mechanism loses its
privacy properties.
Xiao
asks how to design mechanisms that are both truthful and private, and
gives an answer in a setting in which individuals to not explicitely
model privacy in their utility function. But what about when they do?
Similar issues arise in the question of how to sell access to private
data, studied by
Ghosh and Roth.
Privacy and Approximation Algorithms
Gupta, Ligett, McSherry, Roth, and Talwar
give algorithms for various combinatorial optimization problems that
preserve differential privacy. However, this paper only analyzes
specific algorithms based on combinatorial (greedy) methods, without
giving any kind of general theory. What about linear-programming based
approximation algorithms (Perhaps solved approximately using a
multiplicative weights method)?
Can any of these be made private? Is there any class of approximation
algorithms that admits a generic reduction to privacy preserving
versions, while preserving some of its utility guarantees?
Pan-Privacy and Streaming Algorithms
Suppose we do have a trusted central database administrator.
Nevertheless, the threat of computer intrusions or government
subpoena might at some future state expose the internal records
and state of the database administrators algorithm. Pan-private
algorithms address this problem by requiring that the internal state of
the algorithm itself be differentially private. Because this means
storing only randomized "hashes" of the data, this setting is amenable
to problems usually considered for streaming algorithms, in which
hashes are often used because of space constraints. There is some work
in this area, beginning with
Dwork, Naor, Pitassi, Rothblum, and Yekhanin, and continuing with Mir, Muthukrishnan, Nokolov, and Wright (
1 and
2).
This is a good area for exploration. What can be computed in the
pan-private setting? Does it have any relation to what can be computed
in a distributed setting?
Computational Complexity in Differential Privacy
This course has mostly focused on information theoretic upper and lower
bounds. But even when a problem in data privacy is information
theoretically solvable, there may be computational barriers to solving
it quickly.
Dwork, Naor, Reingold, Rothblum, and Vadhan showed that under
certain cryptographic assumptions, general release mechanisms (such as
the net mechanism) cannot be implemented in polynomial time.
Ullman and Vadhan
then extended this result even to show hardness for algorithms that
release small conjunctions (of 2 literals!) using synthetic data as
their output representation. This is a PCP reduction from the synthetic
data hardness result of DNRRV. Of course, this hardness result is
specific to the output representation, since we can efficiently release
the (numeric) answers to all conjunctions of size 2 using the Laplace
mechanism...
Privacy and Statistics
Smith studies
the convergence rates of certain statistical estimators, and gives
differentially private versions of these estimators which have the same
(optimal) convergence rates. The theorem comes with certain technical
conditions though (i.e. the dimension of the statistics can't be too
large, epsilon can't be too small, etc.). Can you extend theorems like
this to hold with less restrictive conditions? The theorems are also
asymptotic, giving guarantees as the number of data points goes to
infinity. How do they work in practice, with (finite) samples of real
data? Compare the empirical performance of these optimal private
statistical estimators with non-private versions.
Axiomatic Approaches to Privacy
Is differential privacy a good
definition? Is it too strong? One way to formalize questions like this
is to derive differential privacy as the solution to a set of axioms;
then if you wish to weaken differential privacy, you can reduce the
problem to objecting to one of the basic axioms.
Kifer and Lin begin this process, but there's plenty of room here to explore.
Private Programming Languages and Implementations
Wouldn't
it be nice if you could just write a program and be guaranteed that it
was privacy preserving, instead of having to prove a theorem every time
you come up with some algorithm? Thats the idea behind differentially
private programming languages. There are now several such languages:
Pinq,
Airavat, and (here at Penn)
Fuzz. One thing you have to worry about in practice are side channel attacks, recently studied by
Haeberlen, Pierce, and Narayan.
What are the limitations of these languages? What can you implement in
them, and what can't you? Are these really limitations, or can you get
around them with clever implementations? How might you extend these
languages, and are there other attacks you might be able to mount?
Testing Function Sensitivity
Jha and Raskhodnikova
give algorithms for testing the global sensitivity of a function, and
for reconstructing insensitive functions given only black-box access to
a (possibly) sensitive function. But global sensitivity is not the only
relevant parameter in differential privacy. For example,
Nissim, Raskhodnikova, and Smith introduce
smooth sensitivity,
which can in many cases be much lower than the global sensitivity of a
function. Can similar techniques be applied to smooth sensitivity?