Warning, this post contains mindless text and some foul language.
There are Facebook pages that are now havens for trolls and generally very angry commenters. Mocha Uson's Facebook page is the classic example of an FB page loaded with trolls.
We've scraped Mocha Uson's Facebook page comments and figured out an algorithmic way to generate text that follows the style of her commenters. Just for fun.
The cover photo was sadly not generated by an algorithm. We're still working on that.
There are multiple ways of generating text; a classic example is through the usage of Markov Chains. We won't get into the nitty-gritty of Markov Chains, but know that this algorithm was used popularly by many users for the SubredditSimulator on Reddit.
To put it simply, Markov Chains predict the next word based on the current word it's looking at and excludes the weights of the words that come before the one it's currently looking at.
For example, if you made a Markov chain model of a baby's behavior, you might include "playing," "eating", "sleeping," and "crying" as states, which together with other behaviors could form a 'state space': a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probabilitiy of hopping, or "transitioning," from one state to any other state---e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.
Here's one of our favorite explainers on Markov Chains.
With neural networks getting more popular, a relatively nouveau approach to generate text called a Long Short-Term Memory network was invented. A Recurrent Neural Network (RNN) called the LSTM can be used to imitate the style of a writer and generate text based on it. Unlike Markov Chains, LSTMs or Long Short-Term Memory networks have “memories” for learning or remembering past experience or a priori information.
LSTMs are not just limited to generating text. They're also good for tasks like automatic music composition, speech recognition, action recognition, or tasks that can generally involve sequential data.
For this post on generating Facebook comments, an LSTM was used to generate Facebook comments from a particularly specific page.
Dataset and Training
We at Indigo Research have already used LSTMs before for various generative text experiments. We've done a post that went relatively viral last year on generating text in the style of Jose Rizal. We used the Noli me Tangere and the El Filibusterismo for it. It was a pretty fun experiment to see text generated as if it was written during Rizal's time.
Check out the post here.
A few experiments on scraping Filipino erotica and State of the Nation Addresses have also been done in order to try and generate text based off singular authors or voices.
In this post, Facebook comments that were scraped from Mocha Uson's page were used for training data. The dataset includes comments from posts starting January 1, 2017 to September 14, 2017. The resulting text file contained 1.5M+ comments and was around 100MB in size. This was the largest dataset so far that we've tried training on an LSTM.
A popular open source framework called Torch-RNN was used for this post. The training was done on an Ubuntu 16 machine with an Nvidia GeForce GTX 750 Ti as the GPU. Training took almost 24 hours even with GPU and CUDA running for optimization. For perspective, the erotica dataset we experimented on before, which only was around 50MB in size took 5 days to train using only a MacBook Pro.
The dataset training took 1437 minutes or 23.95 hours.
The settings used for Torch-RNN were the following (mostly defaults):
- RNN Size: 256
- Layers: 3
- Sequence Length: 50
- Max Epochs: 50
The LSTM generated some pretty interesting comments given that large dataset. A sample command line level output can be seen in the screenshot we've included below.
Now, to make things a bit more interesting, we've taken the liberty of extracting some generated comments from the LSTM and mixing them in with actual comments from Mocha Uson's Facebook Page. Try and spot the differences--if you can spot any strong ones.
ang lahat basta may demonyo lang nakapansin si gusto nilang adik
Oo ng my limit..barangay captain dating body guards..kurrap kasi..tarot mamatay..
Happy Father's Day Mr, Uson...
Impeach leni iba ang iba nag vice music dock from high na kasi walang walang kwenta.
#LeniPowerGrabber . Indulute Fight You You can 100% it's never!!!
We are watching from Saudi arabia ...more power
Samahan ng contract duterte.. from riyadh
Ang dami pa rin naitulong mo robredo tawa sa baho na sayo buhay, but said lang po tayo sa gera
Tuloy ang laban marcos
Tuloy ang laban mga ka DDS.
Dpt nga zero budget p kc puro LNG tlk like tilling
ang sarap batukan.. aso. ulol. tang mo tota. bubo pa.
god bless sir bless us always Tatay Tatay Digong
ha, goodmorning martial law idol na more power DDS saan
Hay katok nga sinador mka bwisit
Is it difficult to distinguish the actual comments from the generated ones? Is it because the LSTM model performed well? Or is it because the real comments are so poor in semantic quality?
We leave the conclusion up to the you. It's pretty hard to make a call given the variety (or lack thereof) of the scraped training dataset. These are pretty interesting results though.
As a side note, it was pretty curious to see that the model actually learned how to use hashtags properly. It also even produced various comments about Marcos, Robredo, and Uson. This should be material for future experiments later on.
In the comments above, only 5 were from real comments (#2, #10, #11, #12, #15). The rest were automatically generated using LSTM.
For more AI related updates, follow us on Facebook.