To investigate the impact of Artificial Intelligence in the media landscape, ACED has initiated the Reverb Channel programme, a trans-disciplinary programme where artists, data scientists and media professionals can carry out investigative experiments and share results to evoke debate with the field and the general public. In collaboration with dr. Peter van der Putten and Timo Kats of the LIACS institute of Leiden University, we investigated whether machines can aid you distinguishing editorial content from advertorial content. It turns out that this is the case. Timo Kats reports on his investigation.

Advertising is an important part of news in every medium. Whether you open up your morning paper or visit an online news site, it’s very probable that you see advertisements all over the place trying to sell you goods and services. But what about the advertisements that you don’t see, yet still read?

Those advertisements are called advertorials, and even though you might not have heard of them, you’ve probably read them at this point. Advertorials are advertisements written and formatted as news articles, often branded by a small disclaimer. So although they might look like editorial articles, they’re still written with the intent of selling you something instead of telling you something. Are you sold yet on the idea that this may be an issue?

Probably yes, Studies have shown that only 8% of readers recognize advertorials as sponsored content. This makes them a lot more effective than regular banner ads, which used to dominate the ad space in the news. As a result of this, large Dutch news organizations like Mediahuis and DPG Media have adopted this type of advertising in their products.

Needless to say, the more serious media outlets such as major newspapers will make more of an effort to let you know certain content is sponsored, but still overestimate the ability of readers to understand the difference. And with the proliferation of digital and online channels, it becomes harder and harder to make the distinction as a reader.

So why exactly could this be a problem? It could be a problem because in the news there’s supposed to be a clear line between commercial and editorial content. In fact, this is even commonly referred to as their separation of church and state. Advertorials may cross that line by being commercially driven yet editorially written and formatted, making them almost ambiguous to readers, despite the small print disclaimers.

Note, we’re not against advertorials in principle — we understand that paper and digital media need to make a living — but it would be good for everyone if readers know whether they’re reading something that’s being sponsored or not. Therefore we aim to redraw that line between commercial and editorial content with a machine learning model. Moreover, we will also use this model to create a lexicon. This is a dictionary that in our case contains words and scores that show how commercial or editorial individual words are according to our model. This allows us to publish our results in a more generally applicable and understandable way, which we did here.

The first step in creating the model and lexicon is to acquire the necessary data. The Reverb Channel corpus contains millions of links to articles, but unfortunately not to advertorials. Hence we collected the data ourselves using a web crawler. This web crawler in total collected 2.000 entries (of which 1.000 are advertorials) from 4 different news sources (NRC, de Telegraaf, and de Ondernemer).

Next, we used this data to train and test the machine learning model with. Given the fact that we want this model to be able to differentiate between two different categories (advertorial or article) this is a classification problem. This means that our model aims to predict whether an entry belongs to an article or an advertorial based on its text. To find out which model does this best we experimented with multiple classification algorithms and their settings.

After completing these experiments we found that a support vector machine (with optimized parameters) gave the highest test accuracy for this classification problem, namely 90.29%. This means that if we present our model the text from a random entry in our data set, it correctly predicts the corresponding category (i.e. article or advertorial) 90.29% of the time!

However, this model, albeit successful, can’t easily be explored or published. That’s why we continued our research through making a lexicon based on the features of this model (i.e. words along with scores that represent how commercial or editorial they are according to our model). As a result, this lexicon can be used to find out which words are more likely to be used in a commercial or an editorial context.

Finally, after finding our results, we explored them through making a graph. This graph (that we published here) visualizes the words from our lexicon as nodes (of which the blue nodes are editorial and the red nodes are commercial) and the connection between these words as edges. We’ve computed these edges based on how often two words appear in the same sentence. As a result, this graph allows us to find clusters and patterns that could be used to gain a better insight into the differences and overlap between commercial and editorial language as a whole.

In conclusion, the rise of advertorials requires a new line to be drawn between commercial and editorial content. As shown in our research machine learning can play a positive role in doing that but there are still limitations that need to be taken into account. What you can do as a news consumer is be aware that this exists and make yourself acquainted more with it. There’s nothing wrong with advertising nor advertorials when you know what you’re reading.

So, are you sold yet?

For a full writeup of Timo’s research have a look at the paper that was published at the 33rd Benelux Conference on Artificial Intelligence and the 30th Belgian Dutch Conference on Machine Learning (BNAIC/BENELEARN 2021), Luxembourg, November 10-12, 2021