Date Published: January 30, 2017
Publisher: Public Library of Science
Author(s): Bartosz Hawelka, Izabela Sitko, Pavlos Kazakopoulos, Euro Beinat, Renaud Lambiotte.
We present and test a sequential learning algorithm for the prediction of human mobility that leverages large datasets of sequences to improve prediction accuracy, in particular for users with a short and non-repetitive data history such as tourists in a foreign country. The algorithm compensates for the difficulty of predicting the next location when there is limited evidence of past behavior by leveraging the availability of sequences of other users in the same system that provide redundant records of typical behavioral patterns. We test the method on a dataset of 10 million roaming mobile phone users in a European country. The average prediction accuracy is significantly higher than that of individual sequence prediction algorithms, primarily constant order Markov models derived from the user’s own data, that have been shown to achieve high accuracy in previous studies of human mobility. The proposed algorithm is generally applicable to improve any sequential prediction when there is a sufficiently rich and diverse dataset of sequences.
The problem of algorithmic prediction of human mobility has received significant attention in the literature in recent years, for its potential applications and its inherent theoretical value. The problem is posed as asking for a prediction of the short-term future location of an individual, given his or hers previous locations and possibly other side information. One can distinguish the algorithms that have been studied in the mobility prediction literature in two broad classes. In the first class we include algorithms such as Markov models that use only the single user’s past locations, without any other information, to estimate the next location. This individual sequence prediction is closely related to lossless compression of sequential data [1–3]. The main feature of this class of algorithms is the reliance on a single data source, which is a major practical advantage. The main disadvantage, however, is the cold start. When an agent is added to the system, a prediction algorithm utilizing only the past data of this agent to produce predictions faces the problem of having to wait enough time until a statistically significant amount of data has been accumulated, before reliable predictions can be produced. Furthermore, when the sequence of states of an agent is not stationary (the patterns of states of the agent cannot be modeled by a probability distribution function immutable in time), then a prediction algorithm based on sequentially approximating a single probability distribution will fail. This is the case for instance, when the sequences of data are short (agents are part of the system for a short amount of time, such as tourists in a foreign country). The second class comprises methods that take advantage of data beyond the user’s own past locations to improve prediction quality, mitigate the cold start and stationarity problems. This can include, for instance, prior knowledge of velocity (the user travels by car or walks) or information about the likely locations of interest for the user (e.g. the network of connections of the user’s social interactions [4–6]). The main disadvantages are of practical nature, as there are many cases in which useful secondary information is not available or is insufficient.
We have shown that a large number of individual sequence prediction algorithms derived from the mobility traces of mobile phone users can be combined by an Exponential Weights forecaster to provide accurate next-hour location predictions for individual users. Using a dataset of mobility traces, we have demonstrated the potential of the method to predict short trips of transient populations such as tourists, which, in general, are non-stationary. The method can easily be implemented for CDRs, even when the data is highly incomplete, with many gaps in time. It outperforms the Markov model standard for individual sequence human mobility prediction while requiring only time stamped locations at the input.