How Big Data News Analytics leads to Data Driven Journalism

23 November, 2012

In the old days reading news was easy. We read what the journalists wrote and we trusted them that they analyzed the news and selected the most important developments assuming proper fact finding before they publish what was about to be the truth. The internet has changed this forever. The internet is a ‘free’ source of information for journalists as well as their readers. The journalist needs to demonstrate he did his fact finding using up-to-date sources of information and important news.

A new Era in News finding and creation

Data driven journalism is bringing news a level further with insight from data. The journalist is becoming a data scientist and analyzes available data sources to be able to support a story with indepth analysis of relevant data and thus add much more value to the story compared to the free information on the internet. Additionally data driven journalism provides the readers an insight look on important trends and developments in society based on the analysis and interpretation of public data often provided by goverment agencies.

The Guardian (UK) is one of the most well know quality news papers and supports data driven journalism strongly by providing access to their data.

The quality of news is determined by the quality of the data that has to take in to account many aspects including accuracy and completeness of data. Data Driven Journalism is going more in the direction of analyzing Big Data; to derive signal from noise a journalist needs to have access to professional reliable data, analysis methods and software tools.

Analysing Big News from The Guardian

To demonstrate how data journalism could work using advanced search tools we at Treparel have used The Guardian API to extract data sets of specific topics to analyze them. As an example we searched in the The Guardian API for all documents related to “big data” which resulted in a total of 10.521 documents. We have to analysed this in more detail as you can image the amount of noise in the data.

Visualization of all the 10521 news articles where big data is mentioned in.

The annotation terms on top of each cluster provide an overview of the different topic areas where big data is mentioned in the news articles provided by the Guardian.

From this landscape visualization we immediately determine important topic areas (called ‘clusters of text’). These clusters with the most important words (‘annotations’) help us to easily identify the most addressed topics.

A closer look on Google vs Microsoft

Based on this visualization we notice that Google, Apple and Microsoft are mentioned often in ‘big data’ articles. We decide to filter where in these clusters these companies are mentioned to understand the relationship between the company and the topics of the clusters.

All articles on ‘big data’ where Google is mentioned (shown as green dots)

Google is mentioned in 1274 out of the 10.521 news articles.

All articles on big data where Microsoft is mentioned (shown as red dots)

Microsoft is mentioned 928 times as shown by the red dots in the visualization where these documents are more concentrated around games and Facebook, video and search and mobile internet. The articles where Google is dominantly mentioned are focussed more around search/video and mobile internet.

All articles on big data where apple is mentioned (shown as blue dots)

Apple is mentioned 1051 times as shown by the blue documents which is about 10% of the full set of 10521 articles. Apple is mentioned much broader with a focus on mobile phone and internet.

To exclude the least relevant documents we decide to select all news articles documents that have a calculated relevance ranking above 80%: this helps us limiting the set of articles to the most important (or relevant) articles on ‘big data’ about Google, Apple (white/yellow dots) and Microsoft (red dots) (in total 215 articles).

News articles ranked by relevance on internet technology.

Now we have excluded irrelevant articles we can much better analyze what the most important articles are in respect to internet technologies.

Is there a story?

Through these analysis we are finding some insightful relevant articles about a general thema like ‘big data’.

The is a cluster on ‘cloud computing’ and on ‘videos of Youtube’ but dominant in the centre are the articles about Apple’s technologies on tablets and mobile phones (iPad and iPhone). Related to this are the articles on patents where Nokia is important because they own many basic patents. When we look for the topics that are important in relation to Microsoft we see that this consists of their OS Windows but also games and the Xbox (where Sony pops up as well). If we then look what are the important articles related to Google we find ‘search technologies’ and ‘privacy’ related topics.

We could ask ourselves now how this evolved over the years from 2000 to 2012. Since the most talked topics are related to Apple we select Apple and visualize all articles over time using blue rings for Apple and a color mapping from red (2000) to white (2012) which gives the final visualization shown below.

Trend of all articles from 2000 (red) to 2012 (white) related to Apple (blue rings) and all other news articles.

Trend of all articles from 2000 (red) to 2012 (white) related to Apple (blue rings) and all other news articles.

Conclusion? Microsoft is entering the market late ….

By looking at both visualizations we noticed that ‘big data’ is getting rapidly more media attention. But it also shows that Apple is gaining more interest from the The Guardian versus Google and Microsoft since 2010. Given the fact that the articles are about technology this demonstrates that also in ‘big data’ competition in gearing up.

(Disclaimer: for this analysis we used KMX on a simple Window desktop PC. It took us les then 10 minutes to do the analysis of over 10.000 documents)


Use Case: Data driven journalism on Guardian News articles

A recent example of KMX used on open data for data driven journalism is the Weyeser Explorer. The tool allows you to interact with the Cluster analysis of Guardian articles about ‘Obama’ from 2010.

The visualisation presents all Guardian Open Platform search results for ‘Obama’ in 2010; a year with a lot of big different challenges for the US President. The stories are represented by dots, and laid out in clusters according to how related they are to each other. So, for example, all stories about the BP oil spill will be positioned near the annotation ‘oil, spill, gulf’, and stories about the Iraq war fall near ‘iraq, troops, military’. These three word annotations indicate common keywords in that area of the map.

Big Data Media analytics example on Obama news from Guardian open data
Weyeser Explorer: KMX cluster analysis of Guardian articles about ‘Obama’ from 2010 (click for interactive version)

Clusters of documents that are more closely related are also positioned closer together, so ‘muslim, centre, protest’ with stories about the ground zero protests is close to ‘quida, bomb, terrorism’ and ‘intelligence, attack, terrorism’.

When you click on an annotation the stories that most relate to the keywords are highlighted, and listed on the right. This will include stories in different locations of the map, when they also relates to other annotations. If you click ‘labour, cameron, tories’, you see stories relating to David Cameron near ‘taliban, Afghanistan, afghan’ and ‘iraq, troops, military’.

The cluster map and annotations were produced by Weyeser who specialise in clustering and classifying text documents. The software is particularly well suited to the process of navigation and discovery in large sets of unstructured data.

Post a Comment

Your email address will not be published. Required fields are marked *