Saturday, September 13, 2014

The Ethical Challenge of "Passive Predation" in Data Science: Can Data Science Provide the Solution, and Not Just the Problem?

I recently ran across an intriguing blog post from Michael Malek, on "Predatory Data Science". Malek notes that data science methods, especially "black box" machine learning, can unintentionally create what he calls "passive predation"—that is, taking advantage of some vulnerable group despite having no intention to do so. He uses the example of a machine learning model, created for a gun manufacturer, that ends up targeting marketing efforts at the suicidal, by identifying keywords associated with depression. The data scientist using the tool in question wouldn't have intended that result, and probably would never even be aware of it, because the group of suicidal depressives would be buried amidst thousands of other micro-segments identified by the same application.

Malek perhaps overdraws his point in the middle part of the post—a historical account of the dehumanizing effects of technology that's reminiscent of Marx's condemnation of working for money in "The Alienation of Labor"—but his main argument is quite sound, and not a little scary.

I wonder, though, if data science itself could provide a solution to this problem. I hereby announce a very unofficial contest, with prizes that will prove trivial at best (I might take a winner out to lunch, or talk about his or her idea at a Data Comunity DC meetup). Pretty much any method of accomplishing this goal, technical or non-technical, is fair game. Any takers?

Thursday, September 11, 2014

Online Course Review: Udacity's Intro to Hadoop and MapReduce

For my first course on Udacity, I decided to take Intro to Hadoop and MapReduce, a course created in conjunction with Cloudera, a company whose business model is based on the open-source Apache Hadoop. To sum up my asseessment, the course was useful, but could have been done much better.

The four-lesson course (short by Udacity standards) is supposed to take about a month to complete—like all Udacity course, and unlike those of Coursera, this is not a true MOOC, taken alongside other students in real time, but rather an interactive tutorial. However, Udacity's model does feature student discussion forums; customers who pay (at the rate of $150/month) also get help from live coaches, feedback on their final projects, and the opportunity to earn a "verified certificate", similar to Coursera's Signature Track, with the difference that Udacity, unlike Coursera, no longer offers certificates for non-paying students. (As I've mentioned before, a verified certificate and two dollars may buy you a cup of coffee, but I wouldn't count on its having any greater worth.)

Before I delve into the specifics of this course, let me say that I'm not a real fan of the Udacity interface. While both providers break each lesson up into a series of short videos, Coursera labels each of those videos with a topic, making it relatively easy to go back and find the material you need; by contrast, Udacity strings all the videos for a particular lesson together under a single heading, and so you have to hunt through all of them to find something (you can click on individual videos, and each one has its own label, but you have to click on or hover over a video to see the label). In addition, whenever the video stops for a quiz, it drops out of fullscreen (assuming you're in fullscreen, of course). Moreover, Udacity's discussion forum (note the singular there) has no organization whatsoever, aside from keyword tags—making a search for specific information rather laborious.

Thr first three lessons of this particular course, which features two instructors from Cloudera, are structured in a manner that the director of a music video would appreciate: many of the videos are very short, and switch jarringly from one instructor to the other. Nonetheless, the instructors are engaging, and there's a nice interview with Doug Cutting about how he helped to create Hadoop, and named it after his toddler son's stuffed elephant. The first two lessons, which explain the basics of how Hadoop and HDFS work, can best be described as "lite"—unchallenging nearly to the point of tedium.

Lesson 3 marks an abrupt change: this is where the programming exercises began. The class requires previous experience with Python, which I lacked, and so the exercises took more time for me than they should have, but I managed. One student in the forum questioned whether this was a course on Hadoop, or a course on Python regular expressions, but doing the exercises helped me learn some Python, and, much as I hate the language, it does have a very powerful vocabulary of regular expressions. Unfortunately, the instructor blew by the concept of Hadoop streaming so fast (in Lesson 2) that I wasn't entirely sure for a while what exactly I was doing, though I was managing to get it to work—and once I looked up Hadoop streaming on my own (it is, for the record, an API that allows Hadoop mappers and reducers to be written to be written in any language), I realized that the interface would work just as well for R.

Although the simpler exercises use an online Python compiler, for the exercises that require large datasets, the course's creators deserve kudos for having students install a virtual UNIX box on which a virtual two-machine Hadoop cluster has already been set up, and then manipulate data and write code in this realistic environment. Unfortunately, the exercises that require this virtual machine seem half-baked.

First off, the instructors haven't actually detailed how to write and execute Python scripts on the UNIX machine (the class discussion forum was very helpful here). Second, the syntax needed to make the scripts work is different from the syntax presented in the video lectures (though, fortunately, there are working sample scripts saved on the virtual machine). Third, and most seriously, one particularly tricky exercise requires knowledge that students could not possibly get from the instructions, or, in all probability, the data itself, but could only get from the hints that emerge from a trial-and-error process of submitting answers to the automated grader—it was an interesting little mystery to solve, but there are no automated graders in real life, and so I'm not sure what I gained from the effort.

Yes, figuring out ambiguous instructions does have some pedagogical value, and in the end, completing the exercises was very satisfying, but, especially in the case of the problem that was insoluble without the automated grader, I got the feeling that the difficultes I faced were the result, not of a pedagological choice, but of a simple lack of effort on the part of the instructors—and I felt like I had wasted part of my time.

According to posts in the forum, Lesson 4 was not part of the original class, though I'm not sure if it was planned all along, or tacked on later. To paraphrase Monty Python and the Holy Grail, the course was completed in an entirely different style at great expense and at the last minute. The lectures feature a different intstructor, a Udacity employee, in place of the Cloudera instructors. This lesson covers design patterns, specifically filtering patterns (more regular expressions), summarization patterns (minimums, maximums, and means, for example), and structural patterns (combining data sets); one lecture also deals with combiners, scripts inserted between mappers and reducers to make things more efficient by doing some of the reduction locally on each machine in the cluster.

I found these lectures better than the previous ones, and the exercises better prepared. I will say, though, that I eventually got bored with writing new and different regular expressions in Python, and didn't finish the last few exercises (or the final project, which isn't graded for non-paying students in any case), though I did watch all of the lectures.

In the end, this half-baked pastiche of a course at least gave me a decent idea of how Hadoop works, and removed the mystique of manipulating data stored on a Hadoop cluster. I wouldn't know how to set up a cluster myself (that wasn't the intent of the class, though I don't think it would be all that hard to do), but I do know how to use Hadoop streaming—and I've realized it's not exactly rocket science.