# Random words, longest increasing subsequences, and quantum PCA

Suppose you have access to iid samples from an unknown probability distribution $p = (p_1, …, p_d)$ on $[d]$, and you want to learn or test something about it. For example, if one wants to estimate $p$ itself, then the empirical distribution will suffice when the number of samples, $n$, is $O(d/epsilon^2)$. In general, you can ask many more specific questions about $p$---Is it close to some known distribution $q$? Does it have high entropy? etc. etc.----and for many of these questions the optimal sample complexity has only been determined over the last $10$ years in the computer science literature.

The natural quantum version of these problems involves being given samples of an unknown $d$-dimensional quantum mixed state $\rho$, which is a $d \times d$ PSD matrix with trace $1$. Many questions from learning and testing probability carry over naturally to this setting. In this talk, we will focus on the most basic of these questions: how many samples of $\rho$ are necessary to produce a good approximation of it? Our main result is an algorithm for learning $\rho$ with optimal sample complexity. Furthermore, in the case when $\rho$ is almost low-rank, we show how to perform PCA on it with optimal sample complexity.

Surprisingly, we are able to reduce the analysis of our algorithm to questions dealing with the combinatorics of longest increasing subsequences (LISes) in random words. In particular, the main technical question we have to solve is this: given a random word $w \in [d]^n$, where each letter $w_i$ is drawn iid from some distribution $p$, what do we expect $\mathrm{LIS}(w)$ to be? Answering this question requires diversions into the RSK algorithm, representation theory of the symmetric group, the theory of symmetric polynomials, and many other interesting areas.

This is joint work with Ryan O'Donnell.