Date Published: July 18, 2017
Publisher: Public Library of Science
Author(s): Sanda Martinčić-Ipšić, Edvin Močibob, Matjaž Perc, Tobias Preis.
With over 300 million active users, Twitter is among the largest online news and social networking services in existence today. Open access to information on Twitter makes it a valuable source of data for research on social interactions, sentiment analysis, content diffusion, link prediction, and the dynamics behind human collective behaviour in general. Here we use Twitter data to construct co-occurrence language networks based on hashtags and based on all the words in tweets, and we use these networks to study link prediction by means of different methods and evaluation metrics. In addition to using five known methods, we propose two effective weighted similarity measures, and we compare the obtained outcomes in dependence on the selected semantic context of topics on Twitter. We find that hashtag networks yield to a large degree equal results as all-word networks, thus supporting the claim that hashtags alone robustly capture the semantic context of tweets, and as such are useful and suitable for studying the content and categorization. We also introduce ranking diagrams as an efficient tool for the comparison of the performance of different link prediction algorithms across multiple datasets. Our research indicates that successful link prediction algorithms work well in correctly foretelling highly probable links even if the information about a network structure is incomplete, and they do so even if the semantic context is rationalized to hashtags.
Our cumulative culture relies on our ability to carry the knowledge from previous generations forward. For millennia, we have been upholding a cumulative culture, which leads to an exponential increase in our cultural output , and it has given us evolutionary advantages that no other species on the planet can compete with. Unprecedented technological progress and scientific breakthroughs today make the amount of information to carry forward staggering. This requires information sharing, worldwide collaboration, the algorithmic prowess of search engines, as well as the selfless efforts of countless volunteers to maintain, categorize, and help navigate what we know. The task is made easier by the fact that much of what we know has been digitized [2, 3]. The combination of data deluge with recent advances in the theory and modeling of social systems and networks [4–12] enables quantitative explorations of our culture that were unimaginable even a decade ago. Recent research has been devoted to enhanced disease surveillance , the spreading of misinformation [14, 15], to study human mobility patterns [16, 17] and the dynamics of online popularity , to quantify trading behavior [19, 20] and the dynamics of our economic life , as well as to study universality in voting behavior , political polarity  and emotional blogging [24, 25], to name just some examples.
The network G = (V, E) is a pair of a set of nodes V (or vertices) and a set of links E (or edges), where N is the number of nodes and K is the number of links. In weighted networks every link connecting two nodes u and v has an associated weight wuv. A node degree deg(u) is the number of links incident to node u and the set of neighbor nodes to a node u is denoted as Γ(u). The strength of a node su is the sum of weights of all the links incident to u. More details about complex networks analysis can be found in  and all measures used for the quantification of the studied networks properties are listed in S1 Text.
In this section, we show all the results needed to communicate the main message of our research, while additional results are provided in the S1 Text, together with the definition of a standard set of network measures used for exploring the structure of networks.
The trend of decreasing precisions and F1 score values along the 25% to 75% links in networks is present for all-words’ and hashtags’ networks. In networks created from 25% of the data, many probable links are left out. At the same time the most probable links are the most likely to be predicted and the link prediction measures are the most successful in predicting highly-probable links. With more data in the 50% and 75% networks the majority of highly-probable links are already included in the network, therefore the prediction measure is expected to predict less-probable links, which causes the drop in the prediction precision and the F1 score. At the same time AUC is prone to this effect. Zhao et al. in  observe similar problems in the dataset for testing, which they overcome by computing the odds ratio for correcting the prediction results. Following the same principle we plan to introduce the odds ratio into the evaluation of link prediction in language networks.
In this work we analysed link prediction based on the local similarity measures on networks constructed from the content of tweets: all-words and hashtags. The main goal of this analysis is to find which measure performs better in the task of predicting the future linking of words and hashtags in the content of tweets, which can be utilized for the propagation of information and opinion in social networks.