On the Estimation and Prediction of Tie Strength in Social Networks
Abstract
Humans interact with each other both online and in-person, forming and dissolving social ties throughout our lives. The flexible architecture of networks or graphs make them a useful paradigm for modeling these complex relationships at the individual, group, and population levels. Social networks have been shown to have a direct impact on public health from leveraging network properties to target highly connected individuals in public health interventions to finding that households that refuse to have their children vaccinated against polio have a disproportionate number of social ties to other vaccine-reluctant and vaccine-refusing households. Social network data has traditionally been collected from surveys, mostly capturing small, static network snapshots at one point in time. Dozens of different metrics have been created to quantify and study the structure of these simple networks. However, with the recent availability of increasingly rich, complex network data, limitations of these metrics have become increasingly clear. In the first chapter of this dissertation, we extend definitions of edge overlap, the proportion of friends two connected individuals share, to weighted and directed networks, and we present closed-form expressions for the mean and variance of each version for the classic Erdos-Renyi random graph and its weighted and directed counterparts. We apply these results to social network data collected in rural villages in India, and we use our analytical results to quantify the extent to which the average edge overlap in the empirical social networks deviates from that of corresponding random graphs. Finally, we carry out comparisons across attribute categories including sex, caste, and age, finding that women tend to form more tightly clustered friendship circles than men, where the extent of overlap depends on the nature of social interaction in question.In social networks the notion of tie strength, and the factors that influence it, have received much attention in a myriad of disciplines for decades. With the internet and cellular phones providing additional avenues of communication, measuring and inferring tie strength has become much more complex. Measuring and predicting tie strength, and moreover, understanding the factors that drive tie strength, has been an expanding area of interest, with increasing utility and complexity in the digital age, i.e., the ever-increasing forms of communication via mobile phones and social media. Knowledge of the strength of a tie, as well as the social dynamics contributing to tie strength, has been shown to increase the accuracy of link prediction, enhance the modeling of the spread of disease and information, and lead to more targeted marketing. Numerous models incorporating indicators of tie strength have been proposed and used to quantify relationships in both online and offline social networks, and a standard set of structural network metrics have been applied to predominantly online social media sites to predict tie strength. The second chapter of this dissertation details tie strength prediction methodology. We introduce the concept of the ``social bow tie" framework, which for any given network tie is a small subgraph of the network that consists of a collection of nodes and ties that surround the tie of interest, forming a topological structure that resembles a bow tie. We also define several intuitive and interpretable metrics that quantify properties of the bow tie which enable us to investigate associations between the strength of the ``central" dyadic tie and properties of the bow tie. We combine the bow tie framework with machine learning to investigate what aspects of the bow tie are most predictive of tie strength in two very different types of social networks, a collection of medium-sized social networks from 75 rural villages in India and a nationwide call network of European mobile phone users. For two connected individuals, we find that the more their friendship circles overlap, the stronger the tie between them. Conversely, the more close-knit each individual's separate friendship network, the weaker the tie between them. Our findings also demonstrate that incorporating properties of the bow tie results in more accurate predictions of tie strength and a more nuanced understanding of the factors that are associated with it.
Missing data and non-response are common occurrences in, and great hindrances to, the analysis of social network data. While any kind of statistical analysis can be negatively affected by missingness, the effects can be even more detrimental in network data analysis due to the high sensitivity of missing data on network topology and the complexity of network surveys and data collection. Many imputation methods have been introduced in the classical statistics literature as a way to maintain power and sample size in the presence of missing data. However, the extension of these methods to the networks framework has been scarcely studied. The third chapter of this dissertation addresses the issue of missing data in statistical analyses of network data. We use Super Learner to impute both edge and nodal attributes of a nationwide call network of European mobile phone users with varying amounts of missingness. We impute the age, age category, and sex of individuals, and the total call duration and text message communication between two individuals over a three-month time period. We find that Super Learner performs better or as well as any individual learning algorithm alone for the imputation of each attribute, and that the amount of missingness does not significantly affect performance. Additionally, we find that the accuracy of age category imputation is sensitive to the choice of categorical thresholding. A thresholding scheme that results in approximately equal proportions of individuals in each category ensures a gain in age-stratified accuracy over the null accuracy of random assignment, but a lower overall accuracy when compared to thresholding resulting in imbalanced categories.
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAACitable link to this page
http://nrs.harvard.edu/urn-3:HUL.InstRepos:40046457
Collections
- FAS Theses and Dissertations [6136]
Contact administrator regarding this item (to report mistakes or request changes)