Publications and Preprints

E. Aiken, G. Bedoya, A. Coville, and J. E. Blumenstock, “Targeting Development Aid with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan,” ACM SIGCAS Computing and Sustainable Societies (COMPASS '20), 2021. Publisher's VersionAbstract
Can mobile phone data improve program targeting? By combining rich survey data from a “big push” anti-poverty program in Afghanistan with detailed mobile phone logs from program beneficiaries, we study the extent to which machine learning methods can accurately differentiate ultra-poor households eligible for program benefits from ineligible households. We show that supervised learning methods leveraging mobile phone data can identify ultra-poor households nearly as accurately as survey-based measures of consumption and wealth; and that combining survey-based measures with mobile phone data produces classifications more accurate than those based on a single data source.
S. Kakade, P. Liang, V. Sharan, and G. Valiant, Prediction with a Short Memory. STOC: ArXiv Report, 2020. Publisher's VersionAbstract
We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by I, then a simple Markov model over the most recent I/ϵ observations obtains expected KL error ϵ---and hence ℓ1 error ϵ√---with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with n hidden states, I is bounded by logn, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length O(logn/ϵ) windows of observations achieves this error, provided the length of the sequence is dΩ(logn/ϵ), where d is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with n hidden states, a window length of logn/ϵ is information-theoretically necessary to achieve expected ℓ1 error ϵ√. Second, the dΘ(logn/ϵ) samples required to estimate the Markov model for an observation alphabet of size d is necessary for any computationally tractable learning algorithm, assuming the hardness of strongly refuting a certain class of CSPs.
W. Kong, R. Somani, Z. Song, S. Kakade, and S. Oh, Robust Meta-learning for Mixed Linear Regression with Small Batches. NeurIPS: ArXiv Report, 2020. Publisher's VersionAbstract
A common challenge faced in practical supervised learning, such as medical image processing and robotic interactions, is that there are plenty of tasks but each task cannot afford to collect enough labeled examples to be learned in isolation. However, by exploiting the similarities across those tasks, one can hope to overcome such data scarcity. Under a canonical scenario where each task is drawn from a mixture of k linear regressions, we study a fundamental question: can abundant small-data tasks compensate for the lack of big-data tasks? Existing second moment based approaches show that such a trade-off is efficiently achievable, with the help of medium-sized tasks with Ω(k1/2) examples each. However, this algorithm is brittle in two important scenarios. The predictions can be arbitrarily bad (i) even with only a few outliers in the dataset; or (ii) even if the medium-sized tasks are slightly smaller with o(k1/2) examples each. We introduce a spectral approach that is simultaneously robust under both scenarios. To this end, we first design a novel outlier-robust principal component analysis algorithm that achieves an optimal accuracy. This is followed by a sum-of-squares algorithm to exploit the information from higher order moments. Together, this approach is robust against outliers and achieves a graceful statistical trade-off; the lack of Ω(k1/2)-size tasks can be compensated for with smaller tasks, which can now be as small as O(logk).
W. Kong, R. Somani, Z. Song, S. Kakade, and S. Oh, Meta-learning for mixed linear regression. ICML: ArXiv Report, 2020. Publisher's VersionAbstract
In modern supervised learning, there are a large number of tasks, but many of them are associated with only a small amount of labeled data. These include data from medical image processing and robotic interaction. Even though each individual task cannot be meaningfully trained in isolation, one seeks to meta-learn across the tasks from past experiences by exploiting some similarities. We study a fundamental question of interest: When can abundant tasks with small data compensate for lack of tasks with big data? We focus on a canonical scenario where each task is drawn from a mixture of k linear regressions, and identify sufficient conditions for such a graceful exchange to hold; The total number of examples necessary with only small data tasks scales similarly as when big data tasks are available. To this end, we introduce a novel spectral approach and show that we can efficiently utilize small data tasks with the help of Ω̃ (k3/2) medium data tasks each with Ω̃ (k1/2) examples.
G. Cadamuro, R. K. Vinayak, J. Blumenstock, S. Kakade, and J. N. Shapiro, “The Illusion of Change: Correcting for Bias when Inferring Changes in Sparse, Societal-Scale Data,” The Web Conference 2019, pp. 2608–2615, 2019. Publisher's VersionAbstract
Societal-scale data is playing an increasingly prominent role in social science research; examples from research on geopolitical events include questions on how emergency events impact the diffusion of information or how new policies change patterns of social interaction. Such research often draws critical inferences from observing how an exogenous event changes meaningful metrics like network degree or network entropy. However, as we show in this work, standard estimation methodologies make systematically incorrect inferences when the event also changes the sparsity of the data. To address this issue, we provide a general framework for inferring changes in social metrics when dealing with non-stationary sparsity. We propose a plug-in correction that can be applied to any estimator, including several recently proposed procedures. Using both simulated and real data, we demonstrate that the correction significantly improves the accuracy of the estimated change under a variety of plausible data generating processes. In particular, using a large dataset of calls from Afghanistan, we show that whereas traditional methods substantially overestimate the impact of a violent event on social diversity, the plug-in correction reveals the true response to be much more modest.
R. K. Vinayak, W. Kong, G. Valiant, and S. M. Kakade, Maximum Likelihood Estimation for Learning Populations of Parameters. ArXiv Report, 2019. Publisher's VersionAbstract
Consider a setting with N independent individuals, each with an unknown parameter, pi∈[0,1] drawn from some unknown distribution P⋆. After observing the outcomes of t independent Bernoulli trials, i.e., Xi∼Binomial(t,pi) per individual, our objective is to accurately estimate P⋆. This problem arises in numerous domains, including the social sciences, psychology, health-care, and biology, where the size of the population under study is usually large while the number of observations per individual is often limited. Our main result shows that, in the regime where t≪N, the maximum likelihood estimator (MLE) is both statistically minimax optimal and efficiently computable. Precisely, for sufficiently large N, the MLE achieves the information theoretic optimal error bound of (1t) for t
Q. Huang, S. Kakade, W. Kong, and G. Valian, Recovering Structured Probability Matrices. ITCS: ArXiv Report, 2018. Publisher's VersionAbstract
We consider the problem of accurately recovering a matrix B of size M by M , which represents a probability distribution over M2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient algorithm that accurately recovers the underlying M by M matrix using Theta(M) samples. This result easily translates to Theta(M) sample algorithms for learning topic models and learning hidden Markov Models. These linear sample complexities are optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. We provide an even stronger lower bound where distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by an HMM with two hidden states requires Omega(M) observations. This precludes sublinear-sample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs.
M. R. Khan and J. Blumenstock, Determinants of Mobile Money Adoption in Pakistan. 2017. Publisher's VersionAbstract
In this work, we analyze the problem of adoption of mobile money in Pakistan by using the call detail records of a major telecom company as our input. Our results highlight the fact that different sections of the society have different patterns of adoption of digital financial services but user mobility related features are the most important one when it comes to adopting and using mobile money services.
M. R. Khan and J. Blumenstock, “Predictors without Borders: Behavioral Modeling of Product Adoption in Three Developing Countries,” KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 145–154, 2016. Publisher's VersionAbstract
Billions of people around the world live without access to banks or other formal financial institutions. In the past several years, many mobile operators have launched "Mobile Money" platforms that deliver basic financial services over the mobile phone network. While many believe that these services can improve the lives of the poor, in many countries adoption of Mobile Money still remains anemic. In this paper, we develop a predictive model of Mobile Money adoption that uses billions of mobile phone communications records to understand the behavioral determinants of adoption. We describe a novel approach to feature engineering that uses a Deterministic Finite Automaton to construct thousands of behavioral metrics of phone use from a concise set of recursive rules. These features provide the foundation for a predictive model that is tested on mobile phone operators logs from Ghana, Pakistan, and Zambia, three very different developing-country contexts. The results highlight the key correlates of Mobile Money use in each country, as well as the potential for such methods to predict and drive adoption. More generally, our analysis provides insight into the extent to which homogenized supervised learning methods can generalize across geographic contexts. We find that without careful tuning, a model that performs very well in one country frequently does not generalize to another.

Related technical methods for scalability, sparsity, and time series modeling

V. Ramanujan and R. Somani, Soft Threshold Weight Reparameterization for Learnable Sparsity. ICML: ArXiv Report, 2020. Publisher's VersionAbstract
Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus of maximizing prediction accuracy given an overall parameter budget. Existing methods rely on uniform or heuristic non-uniform sparsity budgets which have sub-optimal layer-wise parameter allocation resulting in a) lower prediction accuracy or b) higher inference cost (FLOPs). This work proposes Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights. STR smoothly induces sparsity while learning pruning thresholds thereby obtaining a non-uniform sparsity budget. Our method achieves state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy over existing results by up to 10% in the ultra sparse (99%) regime and can also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is a simple mechanism which learns effective sparsity budgets that contrast with popular heuristics. Code, pretrained models and sparsity budgets are at this https URL.
R. Kidambi, P. Netrapalli, P. Jain, and S. M. Kakade, On the insufficiency of existing momentum schemes for Stochastic Optimization. ICLR: ArXiv Report, 2018. Publisher's VersionAbstract
Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for their wide applicability is that when these fast gradient methods are applied in the stochastic case, they partially mimic their exact gradient counterparts, resulting in some practical gain. This work provides a counterpoint to this belief by proving that there exist simple problem instances where these methods cannot outperform SGD despite the best setting of its parameters. These negative problem instances are, in an informal sense, generic; they do not look like carefully constructed pathological instances. These results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of mini-batching. Furthermore, this work provides a viable (and provable) alternative, which, on the same set of problem instances, significantly improves over HB, NAG, and SGD's performance. This algorithm, referred to as Accelerated Stochastic Gradient Descent (ASGD), is a simple to implement stochastic algorithm, based on a relatively less popular variant of Nesterov's Acceleration. Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.
P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford, Accelerating Stochastic Gradient Descent. COLT: ArXiv Report, 2018. Publisher's VersionAbstract
There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes the conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and non-convex optimization problems.
V. Sharan, S. Kakade, P. Liang, and G. Valiant., Learning Overcomplete HMMs. NIPS: ArXiv Report, 2017. Publisher's VersionAbstract
We study the problem of learning overcomplete HMMs---those that have many hidden states but a small output alphabet. Despite having significant practical importance, such HMMs are poorly understood with no known positive or negative results for efficient learning. In this paper, we present several new results---both positive and negative---which help define the boundaries between the tractable and intractable settings. Specifically, we show positive results for a large subclass of HMMs whose transition matrices are sparse, well-conditioned, and have small probability mass on short cycles. On the other hand, we show that learning is impossible given only a polynomial number of samples for HMMs with a small output alphabet and whose transition matrices are random regular graphs with large degree. We also discuss these results in the context of learning HMMs which can capture long-term dependencies.