2015 is a significant year for Singapore. The country celebrates her 50th birthday and mourns the passing of her first Prime Minister in the same year. While Singaporeans are generally proud of our progress – from third to first-world in 50 years – one wonders how this progress has been portrayed though the lens of global media. By examining how Singapore has been featured in the news, such as in which news categories and how many times, could we possibly:
- Predict Singapore’s growth trends, such as her rise to an Asian economic tiger? (e.g. steady increase in business news mentions)
- Infer how much Singapore is affected by world events, such as a disease outbreak? (e.g. abrupt increase in health news mentions)
To answer these questions, news articles containing mentions of Singapore from 1955 to 2014 were harvested using the New York Times API, which also includes articles from AFP and Reuters. Contents from headlines and abstracts of 20,160 articles were condensed into topics using topic modeling analysis executed in MALLET. Topic modeling analysis derives a list of topics that best categorizes a given corpus of articles. Words that appear together often across articles are coalesced to form topics. This is an example of a generated topic:
company singapore billion million group global stake deal buy percent bid crossing sell largest exchange sale stock telecommunications
While the algorithm derives topic compositions, the onus is on the researcher to interpret and label it appropriately. It is possible for words with multiple meanings to be represented in multiple topics. For example, the word “stock” could appear in both topics of “trading” and “cooking”. Topics generated from analyzing Singapore-related news articles were summarized in an interactive chart (click image below for access, opens in new tab/window):
chart template by mbstock
The chart shows the median proportions of news articles represented by each topic, for each year. Trend lines show fluctuations in how heavily each topic is represented in Singapore-related articles across time. See the words associated with each topic by hovering your mouse over trend lines. Notice how spikes in topic representations coincide with related events in Singapore (marked in red).
- To ensure economic survival, Singapore relies on global trade, facilitated by her strategic seaport as well as an open air hub. However, topic modelling across time allows for a closer chronological analysis, which in turn reveals that while business from air travel was increasing in news relevance, the opposite was true for seaport activities. This could be attributed to the growth of rival maritime ports in the region and increased competition for sea trade. Apart from economy-related terms, constituent topics contained frequent co-mentions of other countries as well, subtly revealing one key factor in Singapore’s success – global connectivity.
- While economic activity may be a substantial subject for the international media, topic modelling also allows us to detect less frequently occurring topics that are easily overlooked but which could provide useful insights. For example, lifestyle topics of food and travel have seen moderate increases over time, likely in line with Singapore’s rising standards of living.
- Apart from economic matters, topic modelling also provides appreciation of Singapore’s style of political governance. There is a peak in the year 1995 for a topic relating to the International Herald Tribune being ordered to pay damages to Singapore, for publishing an opinion article held to have libeled the country’s government officials. This is not an isolated incident of Singapore’s government bringing charges against opinion writers, and could reflect a darker side of the country’s history that is not otherwise publicized.
- With a deeper appreciation of historical trends in news topics, one may be able to better predict future trends. For example, Singapore’s positioning as a research and smart-data capital is reflected in its increasing presence in technology articles. On hindsight, one might have anticipated Singapore’s bold announcement in 2016 to become the world’s first smart nation.
More events reflected by trend line peaks:
- 1963–66: Indonesian–Malaysian confrontation (Konfrontasi)
- 1975: End of Vietnam War
- 1979: Vietnamese border raids in Thailand
- 1987: Black Monday stock market crash
- 2001: September 11 attacks
- 2002: Bali bombings
- 2008: Global financial crisis
While examining trending topics might be informative, several significant global events were noticeably absent, such as the beginning of the Space Age in 1957. Might missing topics help to identify untapped industries and anticipate demand for related expertise, such as Singapore’s current demand for aerospace talents?
Notes on Methodology:
Of course, topics might be “missing” due to the choice of parameters. In topic modeling analysis, the number of topics is predetermined, and that was set at 25 for the above analysis.
Increasing the number of topics would undoubtedly ensure that more words are represented. For example, the word “space” starts to appear in results generated from >100 topics. Here are the words associated with the “space” topic from the analysis of 125 topics:
port jersey authority world space anniversary satellite st moon york expense yesterday anniv birthday ronan marks commercial celebrated dr
With more topics to populate, the algorithm would help us capture more thematic words, but the resulting topics may be less interpretable, or may account for only a miniscule number of articles. Using the 125-topic analysis, the “space” topic was the primary topic for only 0.3% of the article corpus. However, its meager representation in the corpus is not the main issue. The article identified by the algorithm as best-representing this topic had the following headline and abstract:
CELEBRATION SET FOR ST. PATRICK’S – Forty-fifth Anniversary of Cathedral Consecration to Be Marked Tomorrow. The forty-fifth anniversary of the consecration of St. Patrick’s Cathedral will be celebrated tomorrow. Rev Dr Palen made Educ Bd pres, 1st clergyman to hold office.
Ok. And you wonder what could possibly be the reference to Singapore?
So with further investigation, it seems that this “space” topic might instead be a more obscure “celebration” topic. As the occasions for celebration covered by NYTimes would likely be far-removed from Singapore’s context (e.g. the consecration of a cathedral in New York), it is no surprise that this topic is poorly represented in the corpus.
In summary, one would have to balance between the number of topics captured and the significance of each topic. Casting the net wide by setting a high initial number of topics might pull in more topics, but each topic might be less significant. Depending on the research question, results might have to be filtered again to pick out relevant topics.
For the current analysis, a manageable number of topics was chosen for parsimony, and to ensure that resulting topics were interpretable.
Did you learn something useful today? We would be glad to inform you when we have new tutorials, so that your learning continues!
Sign up below to get bite-sized tutorials delivered to your inbox:
Copyright © 2015-Present Algobeans.com. All rights reserved. Be a cool bean.
Annalyn, N. (2016, September). Automated Biography for a Country – Using Computational Methods to Study Historical Trends. Paper presented at the 6th Annual International Conference on Political Science, Sociology and International Relations, Singapore. [PDF]
7 thoughts on “Automated Biography for a Nation”
That’s a neat idea. I did a similar approach, using BBC country profile data and GDELT, but in a much more brute force way (https://csaladenes.wordpress.com/2015/05/23/insurgent-dynamics-a-systematic-analysis-of-social-unrest-using-the-gdelt-event-database/) May I use your methodology (of course by referencing you) in my update? Thanks!
Sure Dénes, thanks for dropping by. I read your latest posts – your data visualizations are stunning.
hey, thanks 🙂
1. The “kink” on the scree plot is indeed unconvincing. Then again the plot doesn’t exactly straighten out. Seems to be asymptotic to infinity.
2. How does the algorithm decide if a set of words belong well enough together to form a topic?
3. How is topic prevalence determined at a particular time-point?
4. Can the algorithm be sharpened/ trained by having users edit the topic words after auto-generation and by selectively discarding or adding articles from the corpus to particular topics and in so doing change the way the algorithm selects articles in future? (i.e. are there machine learning elements inbuilt?)
5. In determining the predictive capability of such a tool, perhaps it will be useful to see if there is a pattern to topic prevalence just before a significance event. To do so, possibly the time window n time scale needs to be smaller. E.g. are there localized increases in prevalence in the months or few years leading up to SG embarking on its ultimately not so successful focus on life sciences, or the launch of the smart cities initiative, etc.
6. Another possibly useful idea is to correlate topics with other measures, such as life expectancy, infant mortality, GDP, Gini coefficient, etc.