VK public news pages have become quite a political and cultural battleground recently. People will always engage in a heated discussion in the comments, especially when their point of view is not shared by others. The question posed is - are they real people? This project explores a possibility of bot account detection in those kinds of scenarios.
For this project, a parsing script based on VK Api and VKScript was created to collect posts data from a group of biggest and most active VK news (general and political) pages:
1 канал, Лентач, РБК, Роскомсвобода, Дождь, Вести, Топор, Медуза, РенТВ, Плохие Новости, Life, РИА, Mash
Plus, we collected data about comments and respective user profiles (1,5M+ accounts total) which accounted to 15+ GB of parquet data stored on cluster via HDFS.
Using Spark, we devised a series of criteria based on profile and activity data, and most intriguingly - comments sentiment analysis, performed with Dostoevsky - a library for analysis of russian text (which I adore for its speed, accuracy and ease of use). With the gathered info we used a percentile score which gave us the final verdict.
Chosen groups come in varying degrees of negativity in audience
Below is the rating of groups based on spam comment ratio
Want to take a deep dive into the pool of chaos? Here are the top-5 posts that caused the most toxic, hateful and controversial discussions between people, trolls and bots:
№1: "Плохие Новости", topic: Scandalous behavior in social media
№2: "1 Канал", topic: Religious insults
This post is somewhat of an outlier, as it dates back to 2012 and a bigger amount of accounts are deleted
№3: "РИА", topic: Event involving injury and awarding of a police officer
№4: "Плохие Новости", topic: Questional race-based statements
№5: "Дождь"*, topic: Alexei Navalny's return to Russia
*banned in Russia, only accessible with VPN
Made with Python as a course project for 1st year of masters, FDT ITMO