Automating power: Social bot interference in global politics
Article in First Monday 21(4) · March 2016 with 348 Reads · Download citation
DOI: 10.5210/fm.v21i4.6161
Samuel C. Woolley
Abstract
Over the last several years political actors worldwide have begun harnessing the digital power of social bots - software programs designed to mimic human social media users on platforms like Facebook, Twitter, and Reddit. Increasingly, politicians, militaries, and government-contracted firms use these automated actors in online attempts to manipulate public opinion and disrupt organizational communication. Politicized social bots - here 'political bots' - are used to massively boost politicians' follower levels on social media sites in attempts to generate false impressions of popularity. They are programmed to actively and automatically flood news streams with spam during political crises, elections, and conflicts in order to interrupt the efforts of activists and political dissidents who publicize and organize online. They are used by regimes to send out sophisticated computational propaganda. This paper conducts a content analysis of available media articles on political bots in order to build an event dataset of global political bot deployment that codes for usage, capability, and history. This information is then analyzed, generating a global outline of this phenomenon. This outline seeks to explain the variety of political bot-oriented strategies and presents details crucial to building understandings of these automated software actors in the humanities, social and computer sciences.
https://www.google.com.bd/amp/s/www....l_politics/amp
How Political Campaigns Weaponize Social Media Bots
By Philip N. Howard
Posted 18 Oct 2018 | 15:00 GMT
In the summer of 2017, a group of young political activists in the United Kingdom figured out how to use the popular dating app Tinder to attract new supporters. They understood how Tinder’s social networking platform worked, how its users tended to use the app, and how its algorithms distributed content, and so they built a bot to automate flirty exchanges with real people. Over time, those flirty conversations would turn to politics—and to the strengths of the U.K.’s Labour Party.
To send its messages, the bot would take over a Tinder profile owned by a Labour-friendly user who’d agreed to the temporary repurposing of his or her account. Eventually, the bot sent somewhere between 30,000 and 40,000 messages, targeting 18- to 25-year-olds in constituencies where the Labour candidates were running in tight races. It’s impossible to know precisely how many votes are won through social media campaigns, but in several targeted districts, the Labour Party did prevail by just a few votes. In celebrating their victory, campaigners took to Twitter to thank their team—with a special nod to the Tinder election bot.
How a Political Social Media Bot Works
Illustration: Jude Buffum
1. The bot automatically sets up an account on a social media platform. 2. The bot’s account may appear to be that of an actual person, with personal details and even family photos. 3. The bot crawls through content on the site, scanning for posts and comments of interest. 4. The bot posts its own content to engage other human users. 5. Networks of bots act in concert to promote a candidate or message, to muddy political debate, or to disrupt support for an opponent.
By now, it’s no surprise that social media is one of the most widely used applications online. Close to 70 percent of U.S. adults are on Facebook, with three-quarters of that group using it at least once a day. To be sure, most of the time people aren’t using Facebook, Instagram, and other apps for politics but for self-expression, sharing content, and finding articles and video.
But with social media so deeply embedded in people’s lives and so unregulated, trusted, and targetable, these platforms weren’t going to be ignored by political operators for long. And there is mounting evidence that social media is being used to manipulate and deceive voters and thus to degrade public life.
To be sure, the technology doesn’t always have this effect. It’s difficult to tell the story of the Arab Spring [PDF] without acknowledging how social media platforms allowed democracy advocates to coordinate themselves in surprising new ways, and to send their inspiring calls for political change cascading across North Africa and the Middle East.
But the highly automated nature of news feeds also makes it easy for political actors to manipulate those social networks. Studies done by my group at the Oxford Internet Institute’s Computational Propaganda Research Project have found, for example, that about half of Twitter conversations originating in Russia [PDF] involve highly automated accounts. Such accounts push out vast amounts of political content, and many are so well programmed that the targets never realize that they’re chatting with a piece of software.
We’ve also discovered that professional trolls and bots have been aggressively used in Brazil [PDF] during two presidential campaigns, one presidential impeachment campaign, and the mayoral race in Rio. We’ve seen that political leaders in many young democracies are actively using automation to spread misinformation and junk news.
And in the United States, we have found evidence that active-duty military personnel have been targeted [PDF] with misinformation on national security issues and that the dissemination of junk news was concentrated in swing states [PDF] during the U.S. presidential election in 2016.
The earliest reports of organized social media manipulation emerged in 2010. So in less than a decade, social media has become an ever-evolving tool for social control, exploited by crafty political operatives and unapologetic autocrats. Can democracy survive such sophisticated propaganda?
The 2016 U.S. presidential election was a watershed moment in the evolution of computational techniques for spreading political propaganda via social networks. Initially, the operators of the platforms failed to appreciate what they were up against. When Facebook was first asked how the Russian government may have contributed to the Trump campaign, the company dismissed such foreign interference as negligible. Some months later, Facebook recharacterized the influence as minimal, with only 3,000 ads costing US $100,000 linked to some 470 accounts.
Tactics of Political Social Media Bots
Illustration: Jude Buffum
Zombie Electioneering: Gives the appearance of broad support for an issue or candidate through automated commenting, scripted dialogues, and other means
Finally, in late October 2017, nearly a year after the election, Facebook revealed that Russia’s propaganda machine had actually reached 126 million Facebook users with its ad campaign. What’s more, the Internet Research Agency, a shadowy Russian company linked to the Kremlin, posted roughly 80,000 pieces of divisive content on Facebook, which reached about 29 million U.S. users between January 2015 and August 2017.
Facebook was not the only social media platform affected. Foreign agents published more than 131,000 tweets from 2,700 Twitter accounts [PDF] and uploaded over 1,100 videos [PDF] to Google’s YouTube.
What propagandists love about social media is a network structure that’s ripe for abuse. Each platform’s distributed system of users operates largely without editors. There is nobody to control the production and circulation of content, to maintain quality, or to check the facts.
The propagandists can fool a few key people, and then stand back and let them do most of the work. The Facebook posts from the Internet Research Agency, for instance, were liked, shared, and followed by authentic users, which allowed the posts to organically spread to tens of millions of others.
Facebook eventually shut down the accounts where the Internet Research Agency posts originated, along with more than 170 suspicious accounts on its photo-sharing app, Instagram. Each of these accounts was designed to look like that of a real social media user, a real neighbor, or a real voter, and engineered to distribute disinformation and divisive messages to unsuspecting users’ news feeds. The Facebook algorithm aids this process by identifying popular posts—those that have been widely liked, shared, and followed—and helping them to go viral by placing them in the news feeds of more people.
Illustration: Jude Buffum
AstroTurf Campaign: Makes an electoral or legislative campaign appear to be a grassroots effort
As research by our group and others has revealed, computational propaganda takes many forms: networks of highly automated Twitter accounts; fake users on Facebook, YouTube, and Instagram; chatbots on Tinder, Snapchat, and Reddit. Often the people running these campaigns find ways to game the algorithms that the social media platforms use to distribute news.
Doing so usually means breaking terms-of-service agreements, violating community norms, and otherwise using the platforms in ways that their designers didn’t intend. It may also mean running afoul of election guidelines, privacy regulations, or consumer protection rules. But it happens anyway.
Images: Oxford Internet Institute
Election Botnets: During the November 2016 U.S. election, the largest Trump Twitter botnet [bottom] consisted of 944 bots, compared with 264 bots in the largest pro-Clinton botnet [top]. What’s more, the Trump botnet was more centralized and interconnected, suggesting a higher degree of strategic organization.
Another common tactic is to simply pay for advertising and take advantage of the extensive marketing services that social media companies offer their advertisers. These services let buyers precisely target their audience according to thousands of different parameters—not just basic information, such as location, age, and gender, but also more nuanced attributes, including political beliefs, relationship status, finances, purchasing history, and the like. Facebook recently removed more than 5,000 of these categories to discourage discriminatory job ads—which gives you an idea of how many categories there are in total.
Images: Oxford Internet Institute
Election Botnets: During the November 2016 U.S. election, the largest Trump Twitter botnet [right] consisted of 944 bots, compared with 264 bots in the largest pro-Clinton botnet [left]. What’s more, the Trump botnet was more centralized and interconnected, suggesting a higher degree of strategic organization.
One of the chief ways to track political social media manipulation is to look at the hashtags that both human users and bots use to tag their messages and posts. The main hashtags will reference candidates’ names, party affiliations, and the big campaign issues and themes—#TrumpPence, #LivingWage, #Hillary2016, and so on. An obvious shortcoming of this approach is that we don’t know in advance which hashtags will prove most popular, and so we may miss political conversations that either have hashtags that emerged later in the campaign or that don’t carry any hashtag.
Nonetheless, we can use the hashtags that we do know to identify networks of highly automated accounts. Twitter data is for the most part public, so we can periodically access it directly through the company’s application programming interface (API), which is a type of server that connects Twitter with its developers and other customers. For a 10-day period starting on 1 November 2016, we collected about 17 million tweets from 1,798,127 users. We also sampled Twitter data during each of the three presidential debates.
Illustration: Jude Buffum
Hashtag Hijacking: Appropriates an opponent’s hashtag to distribute spam or otherwise undermine support
Sifting through the data, we saw patterns in who was liking and retweeting posts, which candidates were getting the most social media traffic, how much of that traffic came from highly automated accounts, and what sources of political news and information were being used. We constructed a retweeting network that included only connections where a human user retweeted a bot. This network consisted of 15,904 humans and 695 bots. The average human user in this network shared information from a bot five times.
We then focused on accounts that were behaving badly. Bots aren’t sinister in themselves, of course. They’re just bits of software used to automate and scale up repetitive processes, such as following, linking, replying, and tagging on social media. But they can nevertheless affect public discourse by pushing content from extremist, conspiratorial, or sensationalist sources, or by pumping out thousands of pro-candidate or anti-opponent tweets a day. These automated actions can give the false impression of a groundswell of support, muddy public debate, or overwhelm the opponent’s own messages.
We have found that accounts tweeting more than 50 times a day using a political hashtag are almost invariably bots or accounts that mix automated techniques with occasional human curation. Very few humans—even journalists and politicians—can consistently generate dozens of fresh political tweets each day for days on end.
Once we’ve identified individual bots, we can map the bot networks—bots that follow each other and act in concert, often exactly reproducing content coming from one another. In our modeling of Twitter interactions, the individual accounts represented the network’s nodes, and retweets represented the network’s connections.
Gratitude: Campaign workers tweeted about their successful chatbot, which rallied support for the UK Labour Party through automated conversations on the dating platform Tinder.
What did we learn about the 2016 U.S. election? Both of the major presidential candidates attracted networks of automated Twitter accounts that pushed around their content. Our team mapped these botnet structures over time by tracking the retweeting of the most prominent hashtags—Clinton-related and Trump-related as well as politically neutral.
The Trump and Clinton bot networks looked and behaved very differently, as can be seen in the illustration “Election Botnets,” which depicts the largest botnet associated with each campaign. The much larger Trump botnet consisted of 944 bots and was highly centralized and interconnected, suggesting a greater degree of strategic organization and coordination. The Clinton botnet had just 264 bots and was more randomly arranged and diffuse, suggesting more organic growth.
The pro-Trump Twitter botnets were also far more prolific during the three presidential debates. After each debate, highly automated accounts supporting both Clinton and Trump tweeted about their candidate’s victory. But on average, pro-Trump automated accounts released seven tweets for every tweet from a pro-Clinton automated account. The pro-Trump botnets grew more active in the hours leading up to the final debate, some of them declaring Trump the winner even before the debate had started.
Illustration: Jude Buffum
Retweet Storm: Simultaneous reposts or retweets of a post by hundreds or thousands of other bots
Another successful strategy for the Trump botnets was strategically colonizing pro-Clinton hashtags by using them in anti-Clinton messages. For the most part, each candidate’s human and bot followers used particular hashtags associated with their candidate. But Trump followers tended to also mix in Clinton hashtags. By Election Day, about a quarter of the pro-Trump Twitter traffic was being generated by highly automated accounts, and about a fifth of those tweets contained both Clinton and Trump hashtags. This resulted in negative messages generated by Trump’s supporters (using such hashtags as #benghazi, #CrookedHillary, #lockherup) being injected into the stream of positive messages being traded by Clinton supporters (tagged with #Clinton, #hillarysupporter, and the like).
Finally, we noticed that most of the bots went into hibernation immediately following the election. In general, social media bots tend to have a clear rhythm of content production. Bots that work in concert with humans will be active in the day and dormant at night. More automated bots will be front-loaded with content and then push out messages around the clock. A day after the election, these same bots, which had been pumping out hundreds of posts a day, fell silent. Whoever was behind them had switched them off. Their job was done.
In the run-up to the U.S. midterm election, the big question is not whether social media will be exploited to manipulate voters, but rather what new tricks and tactics and what new actors will emerge. In August, Facebook announced that it had already shut down Iranian and Russian botnets trying to undermine the U.S. elections. As such activity tends to spike in the month or so right before an election, we can be certain that won’t be the end of it.
Meanwhile, Twitter, Facebook, and other social media platforms have implemented a number of new practices to try to curtail political manipulation on their platforms. Facebook, for example, disabled over 1 billion fake accounts, and its safety and security team has doubled to more than 20,000 people handling content in 50 languages. Twitter reports that it blocks half a million suspicious log-ins per day. Social media companies are also investing in machine learning and artificial intelligence that can automatically spot and remove “fake news” and other undesirable activity.
Illustration: Jude Buffum
Strategic Flagging: Tools intended to flag inappropriate content are instead used to flag an opponent’s legitimate content, which may then be erroneously deleted by a social media platform
But the problem is now a global one. In 2017, our researchers inventoried international trends in computational propaganda [PDF], and we were surprised to find organized ventures in each of the 28 countries we looked at. Every authoritarian regime in the sample targeted its own citizens with social media campaigns, but only a few targeted other countries. By contrast, almost every democratic country in the sample conducted such campaigns to try to influence other countries.
In a follow-up survey of 48 countries, we again saw political social media manipulation in every country in our sample. We are also seeing tactics spreading from one campaign cycle or political consultant or regime to another.
Voters have always relied on many sources of political information; family, friends, news organizations, and charismatic politicians obviously predate the Internet. The difference now is that social media platforms provide the structure for political conversation. And when these technologies permit too much fake news and divisive messages and encourage our herding instinct, they undermine democratic processes without regard for the public good.
We haven’t yet seen true artificial intelligence applied to the production of political messages. The prospect of armies of AI bots that more closely mimic human users, and therefore resist detection, is both worrisome and probably inevitable.
Protecting democracy from social media manipulation will require some sort of public policy oversight. Social media companies cannot be expected to regulate themselves. They are in the business of selling information about their users to advertisers and others, information they gather through the conversations that take place on their platforms. Filtering and policing that content will cause their traffic to shrink, their expenses to rise, and their revenues to fall.
To defend our democratic institutions, we need to continue to independently evaluate social media practices as they evolve, and then implement policies that protect legitimate discourse. Above all, we need to stay vigilant, because the real threats to democracy still lie ahead.
This article appears in the November 2018 print issue as “The Rise of Computational Propaganda.”
To Probe Further
For further details on social media manipulation in political campaigns, see
“Computational Propaganda in Russia: The Origins of Digital Misinformation,” by Sergey Sanovich (June 2017)
“Junk News on Military Affairs and National Security: Social Media Disinformation Campaigns Against U.S. Military Personnel and Veterans,” by John D. Gallacher, Vlad Barash, Philip N. Howard, and John Kelly (October 2017)
“Polarization, Partisanship and Junk News Consumption Over Social Media in the U.S.,” by Vidya Narayanan, Vlad Barash, John Kelly, Bence Kollanyi, Lisa-Maria Neudert, and Philip N. Howard (February 2018)
“Computational Propaganda in the United States of America: Manufacturing Consensus Online,” [PDF] by Samuel C. Woolley and Douglas Guilbeault (2017)
“Algorithms, Bots, and Political Communication in the U.S. 2016 Election: The Challenge of Automated Political Communication for Election Law and Administration,” by Philip N. Howard, Samuel Woolley, and Ryan Calo, Journal of Information Technology & Politics (April 2018)
“Challenging Truth and Trust: A Global Inventory of Organized Social Media Manipulation,” by Samantha Bradshaw and Philip N. Howard (2018)
About the Author
Philip N. Howard is the director of the Oxford Internet Institute at the University of Oxford and principal investigator of the Computational Propaganda Project.
READ NEXT
Mayhem, the Machine That Finds Software Vulnerabilities, Then Patches Them
Not Your Father’s Analog Computer
Why Software Fails
Did Bill Gates Steal the Heart of DOS?
WebAssembly Will Finally Let You Run High-Performance Applications in Your Browser
The Desperate Quest for Genomic Compression Algorithms.
https://www.google.com.bd/amp/s/spec...-bots.amp.html
Applying Interestingness Measures to Ansar Forum Texts D.B. Skillicorn School of Computing Queen’s University Canada
skill@cs.queensu.ca
ABSTRACT Documents from the Ansar aljihad forum are ranked using a number of word-usage models. Analysis of overall content shows that postings fall strongly into two categories. A model describing Salafist-jihadi content generates a very clear single-factor ranking of postings. This ranking could be interpreted as selecting the most radical postings, and so could direct analyst attention to the most significant documents. A model for deception creates a multifactor ranking that produces a similar ordering, with low-deception postings identified with highly Salafist-jihadi ones. This suggests either that such postings are extremely sincere, or that personal pronoun use and intricate structuring are also markers of Salafist-jihadi language. Although the overall approach is relatively straightforward, the choice of parameters to maximize the usefulness of the results is intricate.
1.
DATASET DETAILS
The Ansar aljihad forum is a mostly English language forum, with limited access – at one time registration, but now only by referral from an existing member. Of the 29,056 posts in the dataset, about half come from a small subset of members. This paper applies ‘bag of words’ textual analysis of various kinds to the contents of the forum postings. It does not examine characteristics of the authors or timings of posts. In general, it is not obvious what aspects of a set of forum data such as this will be interesting in any given context, so the approach is purely inductive. The lack of ground truth about these postings and the inductive approach makes it hard to draw firm conclusions that would be useful in an intelligence setting. On the other hand, analysis using several different models focuses attention on the same small set of postings. This approach is therefore useful, because it is usually not practical for an analyst to read all of the postings in a timely way (and, of course, many relevant datasets would be much larger).
2.
ISSUES
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISI-KDD 2010, July 25, 2010, Washington, D.C., USA Copyright 2010 ACM 978-1-4503-0223-4 ...$10.00.
The texts of postings are mapped to a series of documentword matrices using sets of words that model phenomena of interest. The raw data for each analysis is therefore a matrix whose rows correspond to the 29,056 postings, whose columns correspond to members of the set of chosen words, and whose entries are counts of the frequency of each word in each document. The data is very skewed in both dimensions. There are a small set of authors who have made very large numbers of postings; and the size of postings varies from literally a few words to tens of thousands of words. Nevertheless, even for models built using a small set of words, the matrices are extremely sparse. A major difficulty with data in this form is deciding what kind of normalization is appropriate. The goal is to maximize the variation in markers relevant to the analysis, while minimizing irrelevant artifacts. The problem is that some words, typically function words, occur often and it is changes in their rate of occurrence that is potentially significant. Other words, typically nouns, are significant if they occur at all in a document, but their subsequent repetition in the document is much less significant (the document is already ‘about’ the noun). This big difference in the significance of frequency makes normalization challenging. Some of the possible normalizations are: • tfidf, a conventional normalization from IR. The intent of this normalization is to ‘spread’ or differentiate the stored documents more evenly, making it easier to find the appropriate neighbors when a query is mapped into the representation space (vector or LSI). This would disturb the inherent cluster structure in this data and so should not be used. • Normalize based on the length of each document, either in terms of total number of words present, or the total number of model-specific words present. Such a normalization is usually motivated by assuming a generative model for postings. The particular set of postings observed is regarded as a sample from an underlying distribution that describes the hypothetical process that creates postings, and so contains artifacts due to sampling. For example, long documents have drawn more often from a word-choice distribution and so contain more, and more different, words. Taking length of documents into account allows some of this bias to be discounted (not all, because long documents, even when normalized, can take on a greater set of values; so the analysis technique should also be resistant
to quantization variation – which eigendecompositions are). There remains the question of which document length to use to get normalized frequencies. Some of the choices are: – Normalize to the unit hypersphere. This is the standard approach in information retrieval, partly because queries are modelled as really short documents, and this normalization makes documents of all lengths comparable. The problem in a dataset that contains both very short and very long documents, as this one does, is that this normalization blurs the structure substantially. – Divide by the document length. This provides a more gentle increase in similarity between short and long documents. In short documents, though, this increases the apparent signal strength of individual words well above that of any word in a long document, and so is still quite distorting. – Choose some quantum of length, say k. Documents longer than k have their word frequencies divided by their length; documents k words or shorter have their word frequencies divided by k. k then behaves as the unit within which words that occur at predictable rates will have occurred a stable number of times. It should probably be chosen so that most documents are shorter than k. The document length distribution is very skewed for this dataset, with the ‘knee’ of the frequency histogram at about 500, so relatively few documents would be normalized by their actual length. – Do not normalize by length at all. This amounts to an assertion that word frequencies are attributes all on a comparable scale, and so raw values are meaningful. From a different perspective, this also asserts that documents are not simply samples from a distribution of word frequencies, but that document length is also a meaningful choice by the author. This has some plausibility for this dataset, because length correlates with the style of posting quite strongly, but it certainly emphasizes longer documents in the results. • Apply a flattening transformation such a taking logarithms of word frequencies. This does not explicitly take length of documents into account but compresses the distribution of points corresponding to documents into something closer to a hyperspheric annulus. Such a normalization implicitly asserts that presence of a word in a document is more important than absence, but significance does not increase linearly with frequency. For example, many authors have stylistic tics in which they use particular words frequently without altering overall meaning. Different choices of normalization will completely alter the resulting analysis, but there seems no deeply principled way to make this choice. Fortunately, although different choices change the medium-scale structures for this dataset, they seem to have little effect on the extremal structure. There is also an issue of how to normalize
the columns of the matrix. After row normalization, the matrix entries
remain non-negative, so measures of similarity between documents (rows) are always positive, and there is no concept of dissimilarity, just weaker similarity. However, an alternate view is that similarity measures should also allow dissimilarity, based on deviations from some ‘typical’ base frequency. In this case, normalizing the columns to z-scores is appropriate. However, there is a further complication. As these matrices are sparse, computing means and standard deviations based on all entries of a column loses available information – the denominators are large regardless of how many documents each word appears in. Furthermore, the mass of zero entries typically ends up very slightly on one side of the origin. As a result, computing correlations in standard ways constructs similarity based partly on the absence of word use between two documents, which is not usually sensible. It is better to compute means, standard deviations, and so z-scores only from the non-zero entries of each column. Assessment of content of the postings is complicated by the fact that they are written in extremely different registers. Many postings, especially the shorter ones, are written in a very informal style typical of many write-once postings on the web (chat, comments, etc.). At the other extreme, there are postings, typically long, that are written in a very ornate and flowery style, often coupled with religious ornamentation. There is a tendency to express religious thought in English from the 17th Century, for example using archaic words like “thou” and “doth”. Tools that were register-aware would be helpful for data such as this, but the lack of practical systemic functional parsing tools limits this level of sophistication at present. In common with much informal writing, spelling is not standard across the postings. This is further complicated by the numerous possible transliterations of Arabic words into English, especially greetings, slogans, and names. These different spellings of English words are not conflated in the analysis, as it is probable that they reflect cultural and geographical variations among authors that might be significant. Examining the list of statistically significant phrases extracted from the document set using the Logik tool suggests that the majority of the discussion is driven by news. The focus is on incidents, people, and places that were likely to have been discussed in the media over the relevant time period, rather than discussions of people within the jihadi movement (for example, mentions of Qari Mohammad Yousuf, one of the Taliban press spokesmen, are far more frequent than mentions of Mullah Omar). The most frequent phrases extracted, in decreasing order, are: Allah, Quote, Afghanistan, Taliban, Islamic Emirate, Mujahideen, government, soldiers, military, American, Pakistan, attack, Brother, Iraq, police, video, militants, troops, Somalia, district, mujahid, local time, Salaam, President, attacks, army, brothers, wa, Mogadishu, country, Baghdad, Islamic, vehicle, Afghan, Iraqi, city, officials, puppet army, Islam, Peace, news, Obama, download, Acer, alaykum, terrorists, landmine, Security, MB [presumably megabyte], Muslims, Muslim, http, tank, fighters, war, alaikum, Pakistani, British, Swat, Soldier, Insha Allah, Somali, Bomb, civilians, enemy, report, html, rapidshare [a file sharing service], capital, Kandahar, explosion, ameen, killing, view, fight, akhi [brother], Jihad, Reuters, Qari Muhammad Yousuf, Israel, Sheikh, insurgents, amir, Gaza, Islamist, Assalamu, release, Zabihullah, God, Media,
Aswat, Israeli, WMV, convoy, fileflyer [another file sharing service], al-Iraq, NATO. The country focus is on Afghanistan (307 documents), Pakistan (202), Somalia (148), Israel (54), Iran (39) and not America (23), Britain (12), Canada (5) and Australia (2). Adjectival country names are more common, for example American (204 documents). Frequencies of references to news sources are: BBC (16 documents), al Jazeera (12), CNN (13), Reuters (55) (an interesting sidelight on technology), and Associated Press Writer (24).
3.
ANALYSIS METHODOLOGY
The analysis that follows uses singular value decomposition applied to document-word matrices using different combinations of possible words and their frequencies. Suitably normalized, a singular value decomposition discovers axes (essentially, eigenvectors) along which the set of documents exhibits variation. The resulting space is then projected into few dimensions (typically, 2 or 3). In the resulting similarity space, proximity corresponds to global similarity (among both documents and words, since SVD is completely symmetric); direction corresponds to global differences (that is, clusters); and distance from the origin corresponds to interestingness. This last is because projection both of points that correlate well with many other points, and points that correlate with few other points places them close to the origin. Points that correlate only moderately with other points are mapped far from the origin in lowdimensional space. This kind of moderate correlation often captures useful notions of interestingness, since it avoids both documents whose word usage is exceedingly typical and those whose word usage is unique. When frequencies of large numbers of words are used, this approach is a kind of clustering. When particular words associated with some property of texts are used, the projection is more typically a spectrum representing the intensity of the property captured by the set of words. When the property is truly single factorial, then the resulting space contains a 1-dimensional manifold, or spectrum, along which points corresponding to each posting are placed. Most complex properties are multifactorial, so it is more typical to see a structure in two or three dimensions. Such a structure can be projected onto a line passing through the first two or three singular values to create a single score, representing interestingness with respect to the model words. In both cases, a plot provides a visualization of global similarities and differences, and distance from the origin provides a visualization of interestingness. The distance from the origin in some k dimensions can be computed, and the documents ranked according to it. Such a ranking loses information about direction, but provides a quick method of focusing attention on a small subset of the records.
4. 4.1
ANALYSIS RESULTS Overall word use
Word extraction on the set of forum postings produced a set of 198,211 distinct “words”. Typical of informal multilingual documents, a substantial fraction of these do not appear in any dictionary; some are ‘wrapper’ words from the context (such as “http”), many are typos or transliterations
Figure 1: Based on words occurring frequently, the postings fall into two well-separated classes. of Arabic words or fragments of Arabic words. Stop words were not removed because, in second-language contexts, differences in stop word use might be significant, reflecting, for example, relative fluency in English. From this set of words, the 779 words that occurred more than 50 times overall were retained to produce a documentword matrix. This threshold was chosen pragmatically, although it is in a linear range of the threshold versus words curve, and the resulting structure did not seem very sensitive to the choice. The matrix was processed as discussed above, without row normalization, but normalizing columns to nonzero z-scores. Figure 1 shows that the documents form two very distinct clusters. One cluster, oriented vertically in the figure, contains postings about military and insurgent activity, focused on Afghanistan and Pakistan. These postings tend to be news or reportage of various kinds, some copied from mainstream organizations. The words associated with it are largely content words, and are visible in Figure 3; words such as “killed”, “province”, “district”, “mujahideen” (apparently the preferred transliteration by mainstream news organizations), “America”, “Islamic”, and “enemy”. The second cluster, oriented horizontally in the figure, contains postings that might be called Jihadi-religious. Figure 3 shows that the words associated with this cluster are primarily function words. This suggests that, for this cluster, it is not content that matters, so much as persuasion, sentiment, power, and emotion. It is surprising that the content of the forum separates so strongly into two clusters – the existence of these two distinct topics is not at all obvious from reading a subset of the postings. To human readers, both the tone and content of postings from the different clusters do not seem markedly different. Because there is no normalization by length, the extremal documents tend to be the longer ones. Normalizing using a boundary of 500, which is about the knee of the curve of lengths, produces very similar structure in the words, but makes it clear that the number of postings in the horizontal cluster is much larger than in the vertical cluster. The overall structure does not change much, providing reassurance that normalization choices here do not dominate the results.
Figure 4: For most-frequent words, the mutual similarity between words and documents. Because of the symmetry of the SVD, rows of both matrices can be plotted as points in the same space. A word and a document are attracted to similar locations when the frequency of the word in the document is large. One interpretation of the SVD is that it is a global integration of this pairwise attraction.
Figure 2: For most-frequent words, document clusters labelled with their posting number. Document 25606 lies between the two clusters; it is a long list of insurgent activities, in the style of the horizontal cluster, but with the content of the vertical cluster. As expected, it uses the word “in” at extremely high rates.
Salafist, al Qaeda, or jihadist content. Koppel et al. [1] built an empirical model of Salafist-Jihadi ideological word use in contrast to that of other ideologies (mainstream, wahhabist, and Muslim Brotherhood) which we use as a surrogate for Salafist-jihadi content in this forum’s postings. At best, this is only a rough approximation to the desired content and style; in particular, it is not designed to discriminate between Salafist-jihadi language and ‘ordinary’ language such as news reports. We begin with the top 100 words from the Koppel model. Several of these Arabic words translate to the same English word, so we end up with 85 English words in the model (shown in Table 1) The frequencies of these words are extracted from the forum data, and the resulting matrix is row normalized by replacing each entry by log(aij + 1). The columns are then normalized to non-zero z-scores as before. Different forms of normalization were tried, but made little difference to the qualitative structure. The results are shown in Figure 5. This model appears to be working well in the sense that it projects postings almost entirely to a 1-dimensional structure that can be interpreted as a continuum from plentiful non-jihadist postings to rarer but more extreme jihadist postings. There is a second, roughly orthogonal component of postings with differentiated use of the words “said”, “were”, “the”, and a few others. Especially the presence of “the” as such a strong marker suggests that second-language issues are relevant here; perhaps the postings in this smaller component are primarily quotations from mainstream news organizations. Removing this component, by removing the associated words from the model, leaves the large component almost unaffected. Some of the extremal postings at the Salafist-jihadist end of the spectrum are: 15646 – “words for jihadis”; 14621 – Book of a Mujahid; 9916 – an extensive political/religious argument; 14736, 17431 – pro-jihadi religious tracts by alMaqdisi, the spiritual mentor of al-Zarqawi; At the other end of the spectrum are postings that are quite vicious in tone, but about other subjects (and shorter which is partly why the extent of the spectrum is not symmetric around the origin). For example:
Figure 3: For most-frequent words, global word similarity. Use of a tool such as Palantir would enable much of the basic content structure in this set of documents to be extracted in sophisticated ways. The advantage of the analysis here is that (a) it is purely inductive rather than analyst-driven, (b) it shows the high-level structure very directly, and (c) using distance from the origin as a surrogate for ‘interestingness’ allows the documents to be ranked, so that analyst attention can automatically be focused on the most significant postings within the set.
4.2
Finding radical postings
It would be most useful to exploit projection and ranking to select those postings with the greatest signs of radical
13494 – comment on a visit by Huckabee to Jerusalem; 10416 – a brief comment suggesting that backlash to insurgent attacks came from drug lords, rather than the general population; 3201 – a posting about Kashmir; 23314 – almost entirely transliterated Arabic, so relevant words not captured; 22406 – a brief news report; Figure 6 shows that the words most strongly associated with Salafist-jihadi postings are function words such as “those” “who”“these”, “they”, and “when”, suggesting that it is relationship and conviction rather than propositional discussion that are important. Content words are not strong markers, but there is perhaps a characteristic style associated with radical postings. This is supported by the results of Koppel et al. [1] who were able to classify documents with different ideologies with about 75% accuracy using only function words. The existence of inflammatory postings at both ends of the spectrum suggests that sentiment analysis could be
Figure 5: For words related to Salafist-jihadi radicalism, the mutual similarity between words and documents.
Figure 6: Structure in the words associated with radicalism.
Figure 7: Postings and words using the deception model.
helpful for this problem, but it would need to be sophisticated since the relevant words go far beyond adjectives and negations.
postings that ranked as highly jihadist. Examination of these extremal postings shows that they are off the charts in terms of first-person singular pronoun and exclusive word frequency. In other words, the reason that Salafist-jihadist postings look low in deceptiveness is that they tend to be intricate yet personal discussions/arguments. This may be a signal of passionate belief, or it may be a stylistic signature developed from particular kinds of religious activity. The postings at the other end of the deception spectrum are primarily news reports copied from mainstream media. In the context of typical forum postings, such documents contain first-person singular pronouns only when someone is being quoted, and are written in a simple, expository style that uses very few exclusive words. Couple this with steady use of action verbs to keep the story moving, and a generally negative tone about war-relevant reporting, and it is clear why such stories rank at the deceptive end of the spectrum. This again emphasizes the need to consider the pool of documents when interpreting relative deceptiveness. The structure of the words from the deception model, shown in Figure 8, shows a 1-dimensional structure, aligned with the axis of deception in the postings, except for a small set of words roughly orthogonal to it. These words, “me”, “my”, and “I” tend to be strongly associated both with relative power and with deception in Western documents. It seems plausible that these pronouns are not so routinely used in Islamic culture so their use frequencies may be author related.
4.3
Looking for deception
We now turn to consider deception. The work of, amongothers, Pennebaker’s group [2, 3] has shown that (a) deception causes characteristic changes in text or speech, and (b) these same changes can be observed over a large range of different activities that have an element of deceptiveness, from outright lies to negotiation. Since propaganda has an element of deception built into it, we consider whether postings that rank highly using Pennebaker’s deception model are of interest. The model, which is determined empirically but has been widely validated, posits that the characteristic signature of deception is changes in the frequencies of four classes of words: • first-person singular pronouns decrease; • exclusive words, words that introduce a subsidiary phrase or clause that make a sentence more complex, decrease; • negative emotion words increase; and • action verbs increase. The model uses 86 words in all; they are listed in Table 2. As before, the frequencies of the words in the model were extracted from the forum dataset. The entries were scaled by logarithms, and non-zero z-scoring was applied. The column entries, now symmetric about zero, were negated for columns 1–20, which correspond to the first-person singular pronouns and exclusive words for which decreased frequencies are signals of deception. In the resulting matrix, a larger magnitude always represents a positive signal of deception. The same analysis process as before was applied to the resulting matrix. The results are shown in Figure 7. The basic structure is fan-like, resulting from variation in the use of the words shown towards the right of the figure: “I”, “or” and “but”. However, the most striking feature is that the extremal postings to the left, the putatively least deceptive, are the same
5.
DISCUSSION
The goal of this analysis is to provide shortcuts for analysts by ranking postings in order of properties of interest, so that only some top part of the ranking need be examined in detail. Ranking using the content of documents shows that postings to this forum are of two quite distinct kinds. Ranking is of limited usefulness, since length plays a large role in distance from the origin. Different normalizations are possible and might produce an interestingness ranking, but “interesting” here means roughly “on topic” so this may not be very
Figure 8: Structure of the words in the deception model. useful. Ranking using an existing model of Salafist-jihadist word usage patterns turns out to be surprisingly useful, producing a single-factorial ranking of postings where the top-ranked documents do indeed seem to be significant. Ranking using the deception model also turns out to be useful, although in a slightly surprising way. Documents that rank highly on the Salafist-jihadi scale rank low on the deception scale. This may be a signal for sincerity, or a result of stylistic markers acquired during radicalization.
6.
REFERENCES
[1] M. Koppel, N. Akiva, E. Alshech, and K. Bar. Automatically classifying documents by ideological and organizational affiliation. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI 2009), pages 176–178, 2009. [2] M. Newman, J. Pennebaker, D. Berry, and J. Richards. Lying words: Predicting deception from linguistic style. Personality and Social Psychology Bulletin, 29:665–675, 2003. [3] J. Pennebaker, M. Francis, and R. Booth. Linguistic inquiry and word count (LIWC). Erlbaum Publishers, 2001.
Words in the Koppel model, ranked from top left to bottom right Jihad Parents How Platform Religion Much Monotheism Muslim Family Mujahideen Worlds Ye Way Oppressors Alone Unbelievers Word Understand Infidelity Idolaters Say Faithful Nation Was Tyrants War Rahim They Abi The Fighting Rahman Were God More Revealed Themselves Jewish Taymiyyah Faith Command When Juggernaut Right Earth Folk Greater Mercy Believers Those Prophet Combat Under Struggler Killing Iraq Them America Falsehood Companions Some You Governance Almighty Kfar Minimum Country Shirk These Afghanistan Who Youth Enemy People Terrorism Messenger O Said Including Entire Force Islam Trial Illusion Name Table 1: Top ranked words indicative of Salafist-jihadi ideology in contrast to other forms of Islamic thought, from Koppel et al. [1]. The word set is in Arabic and was translated using Google Translate, introducing some artifacts. For example, “rahman” would usually be written as “merciful” in English, but “kuffar” could appear either transliterated or translated as “infidel”. We ignored such effects, since repeating the experiments with a set translated by a human made little difference. In practice, if an automated tool is ‘good enough’ it should be preferred, since Arabic speakers remain rare in intelligence settings. “Shirk” here is the Arabic word meaning “associating partners in the worship of Allah”; Taymiyyah was a 14th Century Islamic theologian whose ideas have strongly influenced the most conservative versions of Islam.
Categories First-person pronouns Exclusive words Negative-emotion words
Motion verbs
Keywords I, me, my, mine, myself, I’d, I’ll, I’m, I’ve but, except, without, although, besides, however, nor, or, rather, unless, whereas hate, anger, enemy, despise, dislike, abandon, afraid, agony, anguish, -------, -----, boring, crazy, dumb, disappointed, disappointing, f-word, suspicious, stressed, sorry, jerk, tragedy, weak, worthless, ignorant, inadequate, inferior, jerked, lie, lied, lies, lonely, loss, terrible, hated, hates, greed, fear, devil, lame, vain, wicked walk, move, go, carry, run, lead, going, taking, action, arrive, arrives, arrived, bringing, driven, carrying, fled, flew, follow, followed, look, take, moved, goes, drive
Table 2: The 86 words used by the Pennebaker model of deception.
https://mafiadoc.com/applying-intere...5358b4567.html