Here is the link to the article: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0128692
Ok, so the objective of this paper is to prove that economic indicators can be built in a cost-effective way using publicly available social media datasets. This can be done by extracting some behavioral patterns from social media data and relating that to the economical level of a city. The specific economic indicator that this report talks about is unemployment rate because they believe that unemployment is the most important signal for the socioeconomic status of a region. Currently unemployment rate is measured using tools like surveys and unemployment insurance claims etc, all of which certainly looks cost intensive and not easy to update. It is important to have a cost-effective way to gather unemployment data not just to save time and money but also so that governments can do economic planning, make education policies, do urban planning, transportation design etc in a well-informed way.
What has to be measured to make inferences about the unemployment rate?
According to the report there are three things that could be measured from social media data:
- Deviations in diurnal rhythm or circadian rhythm
- Deviations in mobility pattern
- Deviations in communication styles
Why only these three? They have not explained this in the abstract, maybe later.
What is their dataset?
“19 million (later they say 146m) geo-located messages distributed among more than 340 different Spanish economic regions”. How did they get this data, well I guess they used publicly available messages on Twitter.
To perform our analysis, we consider 19.6 million geo-located Twitter messages (tweets), collected through the public API provided by Twitter from continental Spain, ranging from 29th November 2012 to 30th June 2013. Tweets were posted by (properly anonymized) 0.57 Million unique users and geo-positioned in 7683 different municipalities.
What were these 340 regions in Spain, were they diverse enough to represent a general global pattern, I don’t know.
What were their findings?
“We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates”.
Not very shocking if you think about it. But I am curious how did they establish a causal relationship between unemployment rate and all the above symptoms, instead of just the observed correlation. Oops, here what they say about that:
Our goal is not to state causality between unemployment and the extracted metrics but to uncover the relationship emerging when we observe the economical metrics of cities and the social behavior at the same time.
How is it done?
So you have the data from Twitter and you have decided what metrics you want to populate using that data, how do you go about doing it?
- Check the sanity of your data. You have millions of geo-located tweets, check if they are distributed across all localities or concentrated in few. This can be done by checking if there is a relationship between number of tweets from a locality and population of that locality.
- Next what they did was that they rejected administrative boundaries for their analysis. Instead they created geographical communities of economic activities. They used mobility data and some math to do this. But more important is why was this needed. Why couldn’t they do their analysis on the administrative communities. I don’t understand this fully. From what I gather, one reason was that administrative communities were very diverse in terms of number of people living in those communities. So they wanted to even that out.
- Now, in order to get behavioral patterns, they focussed on extracting the following data-points:
- Social media technology adoption.
- Patterns in social media activity. Specifically what time of the day people are posting tweets.
- Social media content. Linking the language used in their posts with education level.
- Social media interaction pattern. If people from one community interacting with people from other communities. They used user-mentions to do this.
- Observe relationship of the data collected with unemployment. So:
- The larger the penetration rate the bigger the unemployment is.
- Areas with high unemployment saw more tweets during afternoon and late night as compared to areas with low unemployment. Unemployed people wake up late.
- Communities with low education level have more misspelled tweets.
- Unemployment is inversely related to diversity of communication. Less number of people in high unemployment communities @mention users from other communities. This is the weakest indicator.
Hence Proved! Yeah, there were many limitations of this exercise, Twitter data is not very reliable, it is filled with bots and spans, the economic communities built were based on mobility without knowing the reasons for that mobility and so on. But even with all that, they were able to prove that data from social media can be used to get a good sense of the reality on ground, without going into extensive surveys and formally available government data. Something that is not available during times of crisis.