What happens next will amaze you

57-year-old engineer shares the secret to social media!

I generally don't click "Like" on Facebook posts. I've always felt that it was bad enough that Facebook gets to know who my friends are. I haven't had a really well thought-out reason for not clicking "Like," really, just a sort of disdain for how creepy it felt that Facebook had that data.

I didn't know about the research here.

These researchers worked out, from a data set of 100-question psychological questionnaires answered by 80K+ volunteers, that predictions based on Facebook Likes are more accurate than predictions based on detailed psychological self-description. The researchers' model can accurately predict what you will like, based on what your friends like. It can accurately predict how you will describe your friends. It can accurately predict substance abuse.

What can you do if you have this information about someone? Well, take a look at this, from an abstract of an APS Psychological Science journal article from 2012: "For a single product, we constructed five advertisements, each designed to target one of the five major trait domains of human personality. In a sample of 324 survey respondents, advertisements were evaluated more positively the more they cohered with participants’ dispositional motives."

If you combine a scaleable predictive model of individuals' personality types, the technique of crafting framing narratives that conform to those personality types, the ability to target advertising to specific individuals, and a method for capturing Facebook profiles at scale, you get...Cambridge Analytica.

We're going to need to have a conversation about Cambridge Analytica, and Facebook, and user data.

Try this one weird trick to steal the US presidential election!

The outlines of the Cambridge Analytica story are simple enough: CA sold its services to Steve Bannon during the Trump campaign. It created a shitty Facebook app that let people take a personality test; all that you needed to do to use this free app was give it access to your social graph, your likes and your friends' likes. CA used the app to harvest this data from many tens of millions of American Facebook users.

Using the techniques described in the PNAS article linked above, CA built a model that could sort people into psychological categories. They then crafted ads and memes and social-media posts and used Facebook to target specific subsets of users with specific types of ads, using the techniques in the APS Journal article I mentioned. THEN - and this is shows that these guys are data scientists and not just run of the mill trolls - they re-collected the user data and measured the effectiveness of the ads in changing peoples' perceptions and behavior.

Without access to the same data that CA used, it's really hard to know how effective they actually were. One problem is that pretty much everyone involved in the CA story - the company, the whistle-blowers, and the Guardian itself - have incentives to make CA appear as effective as possible. Even Facebook has an incentive to not say "no, this really wasn't effective," since one of the things they're selling is the utility of the user data that they've collected.

But if you assume that CA was even minimally effective, you have to start thinking about what purposes Facebook is actually serving. Even if you assume that Facebook's intentions are entirely benign, it's clear that their lackadaisical approach towards protection of user data can be phenomenally harmful. And Facebook has no real incentive to change their behavior significantly. Their entire business model is based on monetizing the data set that the PNAS article describes.

We got access to your Facebook profiles.  You won't believe what happened next!

There are a couple of other things that you need to understand before you get too deeply into the Cambridge Analytica story.

The work CA did is bog-standard data science, based on publicly-available psychological research. The exploit that they used to get access to Facebook user data hardly warrants being called an exploit: Pretty much anyone who writes those "Which celebrity are you most like?" apps that you all like to share the results of is getting you to give them access to your likes and friends.

There's absolutely no reason to believe that CA is the only, or even the most effective, agent out there that's doing this kind of thing.

It's also abundantly clear from the whistle-blower's interview that CA is fraudulent. They (kind of hilariously) misrepresented themselves to Steve Bannon to get the work in the first place. Their CEO is very good at pretending to be whoever his customers want him to be.

From my (somewhat experienced) software engineering perspective, what CA seems to have brought to the table was the ability to execute very quickly. The way that you execute very quickly in data science is by producing inconclusive results and then cherry-picking data to make your work look as good as possible.

So I'd caution against thinking of CA as the bogeyman here. They're a bogeyman. There are almost certainly other entities doing what CA did. Those entities are probably at least as good at it than CA is.

Some of those entities are probably state-sponsored, in a much more direct way than CA. CA was funded by the Trump campaign, who got money from the RNC, who got money from the NRA, who got money from Russia. There are a lot of cut-outs in that chain. We know that Russia directly manipulated social media in the Ukraine in the run up to the 2014 revolution and annexation of Crimea; they were almost certainly employing the same kinds of techniques that CA did in 2016.

It's hard to imagine that China would just sit by idly and watch as Russia manipulated social media, especially since China is pouring billions into the kind of AI and social-media research that yields these kinds of results.

10 ways that social media uses the network effect to insert itself into your life.  #4 is hilarious!

I also want to point out that while Facebook is obviously a bad actor here, the paradox that we're facing is much more extensive.

Social media is useful. Any application that lets you interact with acquaintances while you are using it becomes more useful. The ability to incorporate feedback from people you know into your use of a tool - pretty much irrespective of what that tool does - is very powerful.

Building applications is expensive. It takes a lot of skill and time to make something that people want to use. If you're building an application, you need to recoup the cost of building it. The most obvious way to do that is to make the application available to as many people as possible, and to charge as little as you can for it, or to support yourself through advertising.

So the universe of computer applications inclines towards social apps that run at scale.

Facebook is an extremely successful example of this, but there are many, many others: big hitters like Snapchat and Tinder, weird things like Foursquare, not-obviously-social-media apps like Google Docs, and so on.

If you want to sell advertising to support your app, you want the advertising to be high value. And hey, look how you have this repository of user-preference data that you can use to target ads more effectively. In fact, you could start collecting more fine-grained user preference data. It wouldn't make your app more useful, but it'd make it easier to make money off of the app, and then you'll be able to afford to add new features for your users, right?

This is the path that Facebook went down, with great success. It's the path that just about anyone building anything with a social component is trying to go down: You want your app to have a lot of users, you want to know what your users like, and you want to be able to sell targeted ads to those users.

This set of incentives encourages everyone who's trying to make a buck in the world of apps to create platforms that, if they're sufficiently successful, can be exploited by bad actors.

And we encourage this, of course, because we all want useful tools, and our tools are more useful and more fun if our friends are using them too. Our instincts as social beings are leading us to congregate at scale, and the technologies we're using are allowing bad actors to fragment and target us.

I think the most important thing about the CA story is that it gives us a tangible, real example of the harm that social media - social features in applications - makes possible. This isn't old-fartist "millennials stare at their phones all day" bullshit harm.  This is real harm, with real effects.

We need to have a real conversation about what's going on here, and figure out what we can do about it.  How we're going to cope with it.

This intense paragraph will change even the most skeptical non-believer.

There's another aspect of the CA story that we need to think about.

So, the thing about the Facebook data set is that all of the user-profile data has your name and location in it. This is a pretty strong signifier of personal identity. There aren't a lot of Robert Rossneys in the nation.  There's probably only one in California. Any data set that has my personal identity in it can be combined with any other data set that has my personal identity in it.   So my Facebook profile may not have my political party in it, but the voter registration rolls for my state do, and by joining those two data sets you now know much more about me.

The thing is, there are data sets with your personal identity in them everywhere. Remember the Equifax leak? If you were in that, CA has that information too.  So do its smarter, creepier competitors. (I'm just going to go out on a limb and guess that Palantir has been doing this sort of thing for years. Their motto: "Palantir builds software that connects data, technologies, humans and environments." Not explicitly specified: "By their personal identifiers.")

In the US, we are - to use the phrase I used about FB in general - phenomenally lackadaisical about our personally-identifiable information. Such data-protection laws as we have in this country are toothless. (Compare with Europe's General Data Protection Regulation, which calls for companies to be fined up to 4% of annual revenue for egregious breaches.) We hand out our Social Security number to collectors of data left and right.

Any asshole can start a mail-order business, fail to encrypt the data at rest, and bam, now hundreds of thousands of shopping preferences are out in the world for bad actors to use. Aggregating personally-identifiable information at scale is a serious risk.

The consequences of any breach are much larger than we think they are.  And there are breaches all the time.


  1. I thought this episode of The Weeds from Vox was pretty good:
    It gives a lot of background on CA and why one should be skeptical about how magical their abilities specifically were, given that everyone who dealt with them before Trump thought they were kind of useless. On the other hand, the spotlight on Facebook's data sharing is a good side-effect of the attention to CA.

    It may be though that having access to this graph/profile data let CA do something like Facebook's "Lookalike Audience" targeting at much lower cost, or do other correlations Facebook doesn't specifically offer. As they note on the podcast, it should also be pointed out that Trump's campaign wasn't exactly about subtle and discerning messaging. Though very specifically targeted and toxic messages that went way under the radar might have been - barely concealed antisemitism for closet Nazis, highlighting Hillary's "superpredator" remarks for young black men in swing states, etc.


Post a Comment

Popular Posts