Scraping Wikipedia for Recent Contributions, a project post-mortem

This past Saturday there was a Wikipedia Edit-thon at the Brooklyn Museum. A few weeks ago, we were wondering how to make a group of people sitting at their laptops for visual appealing to any museum visitors walking by. I suggested that it might be interesting to have a data visualization [1] of live changes ). Also, I thought the project would be good excuse to play around with Node.js and web sockets (via Sockets.io).My manager gave me the go ahead because I said it wouldn't take me very long [2].

There were a few problems needed to be solved:

How do we get data about changes to Wikipedia articles via ip address? (with the assumption that changes coming from the museum's block of ip addresses is coming from the Edit-Thon).
How do we make the dashboard update in real-time?

The first problem was bit involved, as Wikipedia doesn't really have a user-friendly consumable API. Wikipedia runs on Mediawiki; it's documentation does mention an API but it didn't really didn't give me the information I wanted.

What i ended up doing was writing a scraper that visited a User Contribution page (if the contributor is Anonymous, their IP Address is considered the User) every few seconds to grab the changes.

The Museum had several ip address blocks, so I wrote some awkward-sauce code to generate a list of ip addresses:


getIP: function(block){
//method for creating an array out of an IP block
//block: a JSON object, that contains the starter IP address, the last two digits, and the size of the IP block

var list = [];
for (var i = 0; i < block.length; i++) {
//build ip address
var base=block[i].ip;
var last = block[i].last;
    for (var j = 0; j < block[i].blockSize; j++) {
        var lastHex=last+j;
        var ip = base+lastHex.toString();
        var data;
        console.log(ip);
        list.push(ip);
    }
}
    console.log("finished parsing ip blocks!");
    return list;
  }

So the app flow so far went like this:

generate a list of IP addresses
go to the "user" page for each of those ip addresses
parse that page for the information we wanted\n4. then grab only the changes that happened after a certain time [3].

So now, the app is able to get the information we want, but how do we push this data to our web page?

This turned out to be easier than figuring out the best way to grab data from Wikipedia. I just set-up a Socket.io server:

var io = require("io);
io.sockets.on('connection', function (socket){
    console.log("sockets.io initialized");
    // if data has been polled before, append this content
    if (cache.length>0) {
        console.log("downloading cache");
    for (var i = 0; i < cache.length; i++) {
      io.sockets.emit('entry', cache[i]);
    };
    io.sockets.emit('pageCount',cache.length);
  };
  scrape();
  //polling Wikipedia site every 5 seconds
  setInterval(function(){
    console.log("ping");
    scrape();
  }, 5000);
});

Each time Wikipedia is scraped, our io server pings (the emit function) the front-end (the HTML page) with the data.

So the app flow so far is this:

when the server starts up, it starts scraping Wikipedia
when there are new Wikipedia page changes (determined by comparing the date of the edit to a date thresold), the server sends a JSON object to our HTML page
when the HTML page received that json object, it appends that information to the DOM

Now at this point in the project (about a week out from the event), I think I could chill and wait around for the designer to give me comps.

After I've implemented the design [4], I deployed the app to the dev server and then realized that Sockets.io didn't work within subfolders. It was a reported issue and I've tried to implement the solution that others have tried but with no luck. I even delved into the source code of the socket library and found that the code that was part of the solution, wasn't even in the source code. I proceeded to freak the geek out.

Status for the last two days of the project.

My workaround (this was one day out) was to abandon web sockets and write an API that the front-end would ping for changes.\n \n It was straightforward to serve JSON:

if (path=='/api') {
    response.writeHead(200,{"Content-Type":'application/json','Access-Control-Allow-Origin' : '*'});
    if (output.length>0) {
      response.end(JSON.stringify(null));
    }else{
   //output is the data we're scraping from Wikipedia
    response.end(JSON.stringify(output));
    };
  };

The app isn't perfect, one bug is inconsistent scrapped data from Wikipedia. The code is on my GitHub account, but I don't feel comfortable promoting the git repo until I've clean up the code a bit. >_<

Important Takeaways

I should've checked if the production server had everything I needed for my app to work.
I need to be better (and honest!) at scoping out a project.
I should've made some dummy edits to Wikipedia to see if the app picked it up.

Asides

I started to use Test Driven Development in the beginning of this project but abandoned writing and testing against unit tests as I got closer to the project deadline. I wonder if I could've saved myself some pain (like with date parsing) if I had committed to TDD. I wonder if there is some TDD workshop I can take?

[1] Though data visualization is a misnomer, it was more like a dashboard.

[2] Of course it took longer than expected. I have a very bad habit of underestimating the scope of a project, unless it's just a static web site with vanilla HTML and CSS.

[3] This was alot more involved that one would think, due to the fact that dates in Javascript are written in incomphrehensible UNIX code. I used the Moment.js library to make it easier to compare dates.

[4] I made the design responsive, even though it was doubtful that we needed it, I thought "Might as well".",

Code and Craft

A blog by Amy Cheng

Scraping Wikipedia for Recent Contributions, a project post-mortem