Playing with archived news articles

There is a lot of unrest in the world today. It feels like things deteriorate a little further everyday, for the weak, the poor and the marginalized sections of our society. But it is not the first time that this is happening. Twentieth century was one of the bloodiest of all. World wars, proxy wars, de-colonization, mass migration, ethnic cleansing, brutal dictatorships, financial crisis; sometimes we forget how recently most of the world has stabilized. So I thought it would be interesting to revisit the stories of the days when people thought that the world was ending, just like we are thinking now. Maybe there are some lessons in those stories for us. Maybe there is some relief in those stories for us, if we survived then, we will survive now.

“So I wanted to make a twitter bot that would talk to the APIs from NYT and Guardian and pull up the most important stories from their archives.”

As I built the bot, I ran into one big issue. Pulling stories from the archives of NYT and Guardian is easy but there is no way to know which stories are “most important”. For example here is how one article from NYT’s archive looks like:

NYT lets you query its archive one month at a time and you get around 6000 such articles for every month. How do you identify the most important articles from this? I tried playing with several parameters and failed to establish a defining rule for importance.

You can try it yourself, I built a tool to find patterns in NYT data, you can find it here:

As you can see by playing with the tool, there is no definite parameter to check if a particular article was very important. Pulling all articles from page 1 don’t help:

So is the case with type of material:

And so on. So instead of taking an approach of enumerating all the important articles one by one, I did what best I could do in this situation, I took to random.

Here’s what the bot does now:

  1. Every one hour, it randomly selects dates and pulls data from Guardian and NYT.
  2. Every 5 mins, it randomly selects a headline from the downloaded datasets. Converts the headline in tweet and tweets it out.
  3. Additionally, I have made it interactive. The bot would reply to you with an article if you follow it or tweet at it.
  4. One more thing, if you tweet a date to the bot in the form of mm/yy, it would give you an article ether from NYT or Guardian from that date.

Here is the bot:

Here is the code for the bot:

Here is the code for the NYT API tool:

Final Word:

The bot does not achieve its objective without the capability of identifying the most important stories.



Leave a Reply