Wed. Dec 7th, 2022
Stylized version of Twitter's bird logo.

There have been a number of high-profile criminal cases that have been solved using the DNA that relatives of the suspects placed in public databases. A lesson in that is that our privacy is not entirely under our control; by sharing DNA with you, your family has the power to choose what everyone knows about you.

Now some researchers have shown that something similar is true with our words. Using a database of past tweets, they were able to effectively pick out the following words that a user was likely to use. But they were able to do this more effectively if they had easy access to what someone’s contacts were saying on Twitter.

Entropy is inevitable

The work was done by three researchers from the University of Vermont: James Bagrow, Xipei Liu and Lewis Mitchell. It revolves around three different concepts regarding the informational content of posts on Twitter. The first is the concept of entropy, which in this context describes how many bits are needed on average to describe the uncertainty about future word choices. One way of looking at this is that, if you are sure that the next word is chosen from a list of 16, the entropy will be four (24 16). The average social media user has a vocabulary of 5000 words, so choosing randomly would be an entropy of just over 12. They also took into account the bewilderment, the value that comes from the entropy – 16 in the example we just got used where the entropy is four.

The last concept they used is called predictability, which is simply the probability of accurately predicting the next word used.

To see how these concepts worked in the social media world, the researchers turned to a database of about 14,000 Twitter users who collectively produced more than 30 million tweets. Within this, they identified 927 users and the 15 users each of them interacted with the most. Their history of those interactions was incorporated into an algorithm that measures the predictability of future word usage, given what happened in the past.

In general, people were quite predictable. Most of these 927 users clustered in the range of an entropy between 5.5 and eight bits, meaning that the next word is usually found in a list of between 45 and 256 words. They then chose the user with whom the person interacted most often. The cross-user entropy was typically found to be in the range of six to twelve bits. The high end of that range is roughly equivalent to choosing random words, but the low end is way below random, which is equivalent to the word found in a list of 64. In other words, a user’s own history gave a 40-70 percent predictability, while their friend’s history yielded zero to 60 percent predictability.

So predictable

But most users interact with different people online, and some interactions may be more relevant than others. So the authors continued to add interactive users and found that each improved predictability (or, in other words, decreased entropy). In the ninth interaction user, the entropy was actually lower than when it was generated using the user’s own words. In other words, knowing what your friends said made you more predictable than knowing what you had said. The drop in entropy continued to the 15 user limit they set for work.

That’s not to say your friends know you better than you know yourself. If you instead include a user and their contacts, you can increase the predictability even more.

The authors thought some of this might be a product of language structure. So they mixed up the interaction between users and matched them with people they hadn’t interacted with. This dramatically reduced predictability, indicating that language wasn’t everything. In a similar fashion, they brought in unrelated tweets created at the same time to confirm that predictability wasn’t just a product of people talking about hot topics that were popular at the time.

The authors then analyzed whether a user’s behavior on Twitter predicted much about how predictable they were. People who posted regularly — eight or more tweets a day — were generally more predictable. In addition, their connected users who were active at a similar level didn’t contribute much to predictions as they often tweet about unrelated things. And a stronger social connection (measured by the number of connections users had) usually meant a stronger contribution to predictability.

If a connected user has regularly contacted the main user, that connection has increased predictability. But if the central user was the one making contact, that didn’t happen. This suggests that part of the key to predictability may be that a particular tweet comes in response to a prompt from a connection.

You can never leave

This has some obvious privacy implications. If a person leaves a social network, but his history persists (as is the case with Twitter, the network analyzed here), then it should be possible to reconstruct and analyze his social network in order to gain some understanding of the person who has tried to become anonymous. In addition, if you can reconstruct a person’s offline relationships and find them on social media, you may find out about a person who has never joined the service. As the authors of the article describe it: “If an individual forgoes using a social media platform or deletes their account, but their social ties remain, that platform owner may still own 95.1 ± 3.36% of the viable predictive accuracy of that person’s future activities.”

The companies that offer these social media services are obviously better able to analyze these networks. For example, Facebook can infer the existence of a sibling who has never joined and then build a profile of what that person’s posts would likely sound like.

But there are definitely limits here. This doesn’t indicate that we can predict much about a person other than their more likely social media posts, more specifically reactions to their connections’ social media posts. That’s pretty far from Minority Reportlike predictability. But given that everyone from marketers to Russian intelligence agencies seems to be interested in figuring out users’ tendencies on social media, the finding that you don’t even need to be on social media to get them to jump to conclusions isn’t particularly reassuring.

Nature Human behavior2019. DOI: 10.1038/s41562-018-0510-5 (About DOIs).

By akfire1

Leave a Reply

Your email address will not be published.