PENN CIS 625, SPRING 2018: THEORETICAL FOUNDATIONS OF MACHINE LEARNING (aka Computational Learning Theory)
Prof. Michael Kearns
mkearns@cis.upenn.edu
Time: Mondays
12-3 PM
Location: Active Learning Classroom, 3401 Walnut St., fourth floor.
(Elevator lobby just left of the Starbucks entrance)
IMPORTANT NOTE: As per the University schedule, The first course meeting will be on Weds Jan 10 from 12-3. From then on we will meet Mondays 12-3.
URL for this page:
www.cis.upenn.edu/~mkearns/teaching/COLT
Previous incarnations of this course:
www.cis.upenn.edu/~mkearns/teaching/COLT/colt17.html
www.cis.upenn.edu/~mkearns/teaching/COLT/colt16.html
www.cis.upenn.edu/~mkearns/teaching/COLT/colt15.html (with Grigory Yaroslavtsev)
www.cis.upenn.edu/~mkearns/teaching/COLT/colt12.html (with Jake Abernethy)
www.cis.upenn.edu/~mkearns/teaching/COLT/colt08.html (with Koby Crammer)
COURSE DESCRIPTION
This course is an introduction to the theory of machine learning, which attempts to provide algorithmic, complexity-theoretic and probabilistic foundations to modern machine learning and related topics.
The first part of the course will closely follow portions of An Introduction to Computational Learning Theory, by M. Kearns and U. Vazirani (MIT Press). We will cover perhaps 6 or 7 of the chapters in K&V over (approximately) the first half of the course, often supplementing with additional readings and materials. Copies of K&V will be available at the Penn bookstore. The second portion of the course will focus on a number of models and topics in learning theory and related areas not covered in K&V.
The course will give a broad overview of the kinds of problems and techniques typically studied in theoretical machine learning, and provide a basic arsenal of powerful mathematical tools for analyzing machine learning problems.
Topics likely to be covered include:
Additional topics we may also cover include:
COURSE FORMAT, REQUIREMENTS, AND PREREQUISITES
Much of the course will be in fairly traditional "chalk talk" lecture format, but with ample opportunity for discussion, participation, and critique. The course will meet once a week on Mondays from 12 to 3, with the first meeting on Weds Jan 10. Lunch will be served.
The course will involve advanced mathematical material and will cover formal proofs in detail, and will be taught at the doctoral level. While there are no specific formal prerequisites, background or courses in algorithms, complexity theory, discrete math, combinatorics, probability theory and statistics will prove helpful, as will "mathematical maturity" in general. We will be examining detailed proofs throughout the course. If you have questions about the desired background, please ask. Auditors and occasional participants are welcome.
The course requirements for registered students will be a mixture of active in-class participation, problem sets, possibly leading a class discussion, and a final project. The final projects can range from actual research work, to a literature survey, to solving some additional problems.
MEETING/TOPIC SCHEDULE
Wed Jan 10
READING:
K&V Chapter 1.
Mon Jan 22
READING:
K&V Chapter 1.
Mon Jan 29
READING:
K&V Chapter 1
and Chapter 2.
PROBLEM SET #1 (due in hardcopy form in class Feb 5):
1. As carefully as you can, prove the PAC learnability of axis-aligned rectangles
in n dimensions in time polynomial in n, 1/epsilon and 1/delta. Describe
the algorithm precisely and provide as detailed a proof as you can, and
calculate the sample size needed.
For problems 2. and 3. below, you may assume that the input distribution/density D is
uniform over the unit square [0,1] x [0,1]. For extra credit, sketch how you
might generalize your results to the case of unknown and arbitrary D. You might also find
it useful to know about Chernoff bounds and related inequalities, which are discussed both
here
and in the appendix of K&V.
2. Prove the PAC learnability of unions of 2 axis-aligned rectangles in the real plane in
time polynomial in 1/epsilon and 1/delta.
3. Consider the variant of the PAC model with classification noise: each time
the learner asks for a random example of the target concept c, instead
of receiving x,c(x) for x drawn from D, the learner receives x,y
where y = c(x) with probability 2/3, and y = -c(x) with probability
1/3. Here we assume that c(x) is +1 or -1, and that the noise is
independent for each example. Prove the PAC learnability of axis-aligned
rectangles in the real plane in this modified model.
Collaboration on the problem sets is permitted, but
 
everyone must turn in their own, independent writeup,
in which you acknowledge your collaborators.
Mon Feb 5
READING:
K&V Chapters 2 and 3.
Mon Feb 12
What Cannot Be Learned With Bounded Memory
How does computational learning change when one cannot store all the examples one sees in memory? This question has seen a burst of interest in the past couple of years, leading to the surprising theorem that there exist simple concepts (parities) that require an extraordinary amount of time to learn unless one has quite a lot of memory.
In this work we show that in fact most concepts cannot be learned without sufficient memory. This subsumes the aforementioned theorem and implies similar results for other concepts of interest. The new results follow from a general combinatorial framework that we developed to prove lower bounds for space bounded learning.
Joint work with Michal Moshkovitz, Hebrew University.
READING:
K&V Chapter 3, and here is a link to a
paper
related to Dana's talk.
Mon Feb 19
Complete proof of VC-dimension based upper bound on the sample complexity
of learning in the PAC model via Sauer's Lemma and the two-sample trick;
matching lower bound; extensions to unrealizable/agnostic setting; trading
off approximation error/model complexity with estimation error via
stuctural risk minimization.
READING:
K&V Chapter 3;
and here is a link to a very nice survey paper generalizing VC-style bounds
to a wide variety of other learning models:
"Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications",
D. Haussler 1992.
Mon Feb 26
Brief overview of cryptographic hardness results for PAC learning
boolean formulae, finite automata, and neural networks; boosting.
READING: K&V Chapter 4 and the following papers:
PROBLEM SET #2 (due in hardcopy form in class Mar 19):
1. Solve K&V Problem 3.2.
2. Consider the concept class of
parity functions
defined over n-bit strings x: for every subset $T$ of the indices {1,...,n},
the function f_T(x) = 1 if and only if the number of 1s in x on just the indices in T
is odd; otherwise f_T(x) = 0. For example, if n = 6 and T = {1,2,5} then
f_T(001011) = 1 and
f_T(101011) = 0.
Give the best upper and lower bounds that you can on the the
VC dimension of this class. Then give a computationally efficient algorithm for PAC learning parity functions.
(Hint: try viewing the problem from a linear algebra perspective.)
3. Define the
consistency dimension
of a concept class C to be the smallest d such that for any c in C, there
exists a sample S labeled by c of size at most d, and for which c is the only concept in C
consistent with S. In other words, the consistency dimension of C is the smallest d
such that every concept in C can be uniquely identified by a sample of d points. Note that
the sample S is allowed to depend on c.
Restricting attention to finite C, carefully describe and analyze
(a) a concept class C in which the consistency
dimension of C is much smaller than the VC dimension of C, and
(b) a concept class C in which the consistency
dimension of C is much larger than the VC dimension of C.
4. Solve K&V Problem 3.6, which is incorrect as stated --- you simply
need to find *some* set of labelings/concepts of size phi_d(m)
for which the VC dimension
is d.
5. Consider the problem of learning in the presence of
adversarial
errors or noise in the PAC model. In this setting, there is an error rate eta >= 0.
Each time the learning algorithm asks for an example, with probability 1-eta, it receives
a "correct" example (x,y) in which x is drawn from the target distribution D, and y = c(x),
where c is the target concept in C. But with probability eta, the algorithm receives a
pair (x,y) about which no assumptions whatsoever can be made. In particular, both x and y
can be chosen by an adversary who knows the current state of the algorithm, and is deliberately
trying to foil the algorithm. Let's call a concept class C
nontrivial
if there exist c1 and c2 in C, and inputs u, v, w, x such that
c1(u) = 1 and c2(u) = 1;
c1(v) = 1 and c2(v) = 0;
c1(w) = 0 and c2(w) = 1;
and
c1(x) = 0 and c2(x) = 0.
Show that in the adversarial noise model, PAC learning for any nontrivial C is impossible unless
eta < epsilon/(1+epsilon), where epsilon is the desired error
of the algorithm's hypothesis with respect to c and D.
(Hint: find a "bad" distribution D that allows the adversary to "confuse" c1 and c2.)
Mon Mar 5
Spring Break, no meeting.
Mon Mar 12
Boosting continued; introduction to PAC learning with classification noise.
READING: K&V Chapter 5.
The exact timing and set of topics below will depend on our progress and will be updated
as we proceed.
In the first meeting, we will go over course mechanics and present a course overview, then
immediately begin investigating the Probably Approximately Correct (PAC) model of learning.
Detailed topics covered: Learning rectangles in the real plane; definition of the PAC model;
PAC learnability of rectangles in d dimensions.
PAC learnability of boolean conjunctions;
intractability of PAC learning 3-term DNF.
Intractability of PAC learning 3-term DNF continued;
PAC learning 3-term DNF by 3CNF;
learning, consistency and compression;
Occam's Razor.
PAC learning yields weak compression;
consistency and learning in the agnostic/unrealizable setting;
introduction to VC dimension.
Today to start we will have a special guest lecture from Prof.
Dana Moshkovitz
of UT Austin on some
very exciting recent results in the PAC model; afterwards we will continue
with the VC dimension. Dana's abstract: