Merry Christmas!

If you, like us, plan on spending your vacation experimenting with data sets, here's our Christmas gift to you. We believe that sharing research is important, so we've collated all of our toy datasets and posted them on Google Drive. We'll go through one of them later.

We started Indigo Research to push the AI technology further. One of our goals is to share what we are learning by publishing research, organizing meetups, and sharing datasets. While our first research paper is still in the works, we managed to organized our first AI Nights two weeks ago at Launchgarage. It was a successful event with a mix of students and industry professionals attending.

Our friend, Carl Calub of DataSeer dicussing his Fete de la Musique optimization

Christmas is for sharing

As Christmas is nearing, we're sharing the datasets that we used for some our fun projects. We think this is a good opportunity for those who are planning to study or experiment with AI this coming holidays. We understand that one of the first few problems in doing AI experiments is obtaining a dataset to work with. Hopefully, this post can help solve that problem.

Let's start.

1. Mocha Uson Page Comments Dataset

This is the dataset we used for our blog post that went viral few weeks ago. The dataset contains comments obtained from Mocha Uson's Facebook Page using the Facebook API. The dataset includes comments from posts starting January 1, 2017 to September 14, 2017. The resulting text file contained 1.5M+ comments and was around 100MB in size.

Download here

2. Lechon Images Dataset


This is the dataset used for our first ever post at the Indigo Research blog. It contains different images of lechon manually downloaded from Google Images Search.

Download here

3. Congress Images Dataset

The dataset contains images of congressmen and women from various years. The images were cropped and resized to 80x80 using ImageMagick. In total, the dataset contains 809 images. This was used to generate faces of politician using HyperGAN.

Download here

4. Juan Luna Paintings


While the paintings were also available in Wikipedia, we are also including this dataset because this was used for our neural style transfer post.

Download here

5. Erotica Dataset

This is one of the most interesting dataset we've encountered. The dataset was scraped from various Filipino erotica websites and we managed to obtain around 45mb in size. It was used to generate erotica text using RNN and LSTM. The result was readable but barely understandable (see screenshot below).


For better results, we think that more erotica text datasets are needed.

Download here

6. Aquino / Duterte Speeches

This dataset contains speeches scraped from The speeches were by President Aquino during his term in presidency. It also contains some speeches by President Duterte he conducted during his first few months.

Download here

7. SONA dataset

This dataset contains SONA by Philippines presidents starting from 1935 to 2016 (missing a few years). This was scraped in the Official Gazette of the Philippines.

Download here

8. MRT Images


This dataset contains images from different MRT stations in various camera angles. These were scraped in the DOTC website using the MRT CCTV live feed. An interesting tidbit about this is that the live feed was delivered as sequence of images rather than a video. This makes it easier to scrape and store. This was intended for a machine vision experiment.

Download here

9. Coffee Shops


This dataset contains listings of coffee shops within NCR. The GPS coordinates, description, name, and check-in counts are included here. The locations were geoscraped from Foursquare last 2016.

Download here