At some point in the last few years, I learned a new term: “training data.” Training data is what’s fed into machine learning algorithms to make them more accurate — to help them learn. Put another way, if you show an image recognition algorithm one million pictures that you know contain cats, the algorithm will be able to more accurately identify, well, cats.
Dave pointed me toward a really great anecdote from a making-of video for Spider-Man: Into the Spider-Verse, where some of the lead animators talked about using an algorithm to automatically generate their linework.
Rather than drawing by hand all the furrows, dimples, or strokes that might outline a nose or an ear, the effects team used an algorithm to produce many of those lines programatically. And it improved over time, as it referenced more sample images — more training data.
The thing is, training data has to come from somewhere. In the case of the Spider-Verse algorithm, it was trained on drawings created by the lead character designer. But as I understand it, most training data is manually labeled. That is, a person will sift through massive amounts of data — whether it’s images, video, audio, or text — and then tag or label that data, so that an algorithm can understand it. This work, and the workers that do it, have powered the advances we’ve seen in machine learning in recent years. I’d bet good money that real, live people tagged large swathes of that “corpus of anonymized phone conversation data” that trained Google’s “Duplex” experiment.
Last year, the BBC produced a feature on Brenda Monyangi, a single mother who lives in Kibera, an urban slum in Nairobi. Brenda commutes two hours by bus to her job, where she and her fellow employees produce training data. During her eight-hour shift, she and her fellow “trainers” will view many, many images on their computer screens. Each of them will highlight areas of the image with their mouse, and then label each region. The data produced from Brenda’s work, and the work of her coworkers, will fuel the artificial intelligence research at companies like Microsoft, IBM, Facebook, and Google.
Employees at Brenda’s company make around $9 a day. The most accurate “trainers” will receive a shopping voucher.
This is exploitation, of course — an especially old strain of it. And what’s more, imagine how precarious Brenda’s job is. As image recognition technology improves, will her job still exist in a few years’ time? It’s quite possible that the data work she’s doing is being fed into automated solutions that will eventually replace her.
As I read stories like Brenda’s, and about new developments in using (I can’t believe I’m typing this) prison labor to produce training data, this is the thing I keep coming back to: our industry’s excelled at creating new classes of work, and then deciding those workers are effectively invisible. And then we often decide that work, those workers, matter less than the automated solutions they’ve helped create — and perhaps, in time, we decide they’re ideal candidates for automation themselves.
And here’s the other thing I come back to: who’s next?