If you, like us, plan on spending your vacation experimenting with data sets, here's our Christmas gift to you. We believe that sharing research is important, so we've collated all of our toy datasets and posted them on Google Drive. We'll go through one of them later.
We started Indigo Research to push the AI technology further. One of our goals is to share what we are learning by publishing research, organizing meetups, and sharing datasets. While our first research paper is still in the works, we managed to organized our first AI Nights two weeks ago at Launchgarage. It was a successful event with a mix of students and industry professionals attending.
Our friend, Carl Calub of DataSeer dicussing his Fete de la Musique optimization
Christmas is for sharing
As Christmas is nearing, we're sharing the datasets that we used for some our fun projects. We think this is a good opportunity for those who are planning to study or experiment with AI this coming holidays. We understand that one of the first few problems in doing AI experiments is obtaining a dataset to work with. Hopefully, this post can help solve that problem.
1. Mocha Uson Page Comments Dataset
This is the dataset we used for our blog post that went viral few weeks ago. The dataset contains comments obtained from Mocha Uson's Facebook Page using the Facebook API. The dataset includes comments from posts starting January 1, 2017 to September 14, 2017. The resulting text file contained 1.5M+ comments and was around 100MB in size.
2. Lechon Images Dataset
This is the dataset used for our first ever post at the Indigo Research blog. It contains different images of lechon manually downloaded from Google Images Search.
3. Congress Images Dataset
The dataset contains images of congressmen and women from various years. The images were cropped and resized to 80x80 using ImageMagick. In total, the dataset contains 809 images. This was used to generate faces of politician using HyperGAN.
4. Juan Luna Paintings
5. Erotica Dataset
This is one of the most interesting dataset we've encountered. The dataset was scraped from various Filipino erotica websites and we managed to obtain around 45mb in size. It was used to generate erotica text using RNN and LSTM. The result was readable but barely understandable (see screenshot below).
For better results, we think that more erotica text datasets are needed.
6. Aquino / Duterte Speeches
This dataset contains speeches scraped from gov.ph The speeches were by President Aquino during his term in presidency. It also contains some speeches by President Duterte he conducted during his first few months.
7. SONA dataset
This dataset contains SONA by Philippines presidents starting from 1935 to 2016 (missing a few years). This was scraped in the Official Gazette of the Philippines.
8. MRT Images
This dataset contains images from different MRT stations in various camera angles. These were scraped in the DOTC website using the MRT CCTV live feed. An interesting tidbit about this is that the live feed was delivered as sequence of images rather than a video. This makes it easier to scrape and store. This was intended for a machine vision experiment.
9. Coffee Shops
This dataset contains listings of coffee shops within NCR. The GPS coordinates, description, name, and check-in counts are included here. The locations were geoscraped from Foursquare last 2016.