A Reflection on Data Scraping.

This was my first time getting my hands dirty with data, nervousness is an understatement to what I felt while working on this. I scraped data using python on google colab. I chose google colab since it’s fast and reliable, with accelerated features such as Tensorflow GPU’s and Tensorflow TPU’s. I was required to scrape data from websites and twitter, but my main focus was on twitter to get real-time reactions concerning this case studies; M-Pesa in Kenya, Adani mines in Australia, and Gun laws in New Zealand.

To extract data from twitter I had to obtain twitter credentials, this was frustrating to get and terribly slowed down my project. I had the option of mimicking an existing data set but this would not have given me the outcome I needed in terms of decision making and sensemaking. I eventually got the credentials after waiting for weeks, during this period I researched on the case studies and different approaches that have been used by decisionmakers, and how this could be implemented technically.

To scrape twitter data, I discovered the power of tweepy and sklearn, which will be used with numpy, pandas and nltk. So I used tweepy to access the Twitter API, with the credentials I had been given by twitter.

I got a lot of errors while on this stage of the project, and to be quite honest it took me nearly two days to get it done and achieve a clean set of data that can be analyzed and understood. Stack Overflow and GitHub came through for me, with these platforms I was able to see what others have done and how they were able to tackle challenges attached to scraping data. Now I understand why it’s said that 80% of the time if not careful might be spent on scraping. I spent an awful amount of time finding techniques or ways of acquiring data in simpler or understandable ways but I realized the trick is to just do it, data obtained can’t be perfect but it can be manipulated to get insights.

Medium helped me understand how to scrape data from twitter. Reading is one of the ways I tackled uncertainty with this project, and on data scraping, I found out there’s no big difference between selenium and beautiful soup, these libraries nearly have the same functionality. But Selenium has a browser mocking tool, which enables the user to control the amount of data obtained from twitter or the website.

So what did I learn?

Time management is key — I spent a lot of time working on this and didn’t notice how not being organized during this stage would consume my time, after getting the credentials, I was overly excited to use as many techniques as possible to get the perfect data, instead of strategizing and considering all factors that would be affected.

Communication — This should have been the time to communicate my issues to my mentor more but instead I worked on it by myself instead of seeking help, also as much as I reflected on this on my journal, I should have had it here on a blog to easily track my progress.

Learn to be a quick learner — I trust the process of the learning, but what I got during this time, is to be swift, I should have acknowledged the time frame I had in hand and plan against it.

Data Science and other shenanigans!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store