Guest blog by Mór Kapronczay
If you are lucky, you may have your data in a handy format, like excel or .csv from some source. Nevertheless, this is rarely the case. In most analyses, you have to collect your data — generally from a website. This methodology is called web scraping.
It’s important to note that not all accessible data is collectible. Just because you can see something in your browser does not necessarily mean that you are allowed to legally scrape it. Some websites protect themselves against web scrapers. Always make sure that what you do is legal! For instance, scraping Wikipedia is perfectly fine, while scraping social media websites is illegal in most cases if not done through public APIs of these websites.
It may sound intimidating, but basically scraping is just mimicking what your favorite browser does:
- Sends an http request to a site.
- Parses the response it gets.
In some cases, websites protect their data from scrapers, but a quite common source of information is Wikipedia, where no such protection is present, and information there is free to use. Therefore, you can scrape anything you want from Wikipedia. Python even has a package for that. For didactic reasons let’s not use the package, but scrape the information the old fashioned way!
Let’s say we want to assess the gender of composers and lyricists of anthems from around the world. We go to this site and press Ctrl+Shift+I (in Google Chrome) or right-click on virtually any place of the website and click inspect. This is what you will see (you may have to switch to Elements in the upper panel on the right):
On the right-hand side, you can inspect the structure of the website, which will be important for how you parse your response. The purple text refers to the tag of this element, through which you will be able to find it when you parse the response of the page.
In this code snippet, you can see what I did in this case: send a request and parse the response into a searchable a BeautifulSoup object. In this object, one can easily find the specific part of information you are looking for. In this case, a row corresponding to an anthem is stored in <tr> tag, in which <td> tags contain the specific information I need. Do not hesitate to check my github for the full code!
To further clarify, here is what you need to do to collect the data you want:
- Navigate to the page where the information is to be found.
- Inspect the structure of the website, find the tags where the information is stored.
- Using python, send a http request to the site.
- Using the BeautifoulSoup object created from the response, and the learned structure from point 2, create the algorithm, to extract and store the information you need.
In order to make Python guess genders for us, the only thing we need is to supply it with a first name. gender-guesser is a Python package written for this purpose. It can return 6 different values: unknown (name not found), Andy (androgynous), female, male, mostly_male, or mostly_female. The difference between Andy and unknown is that the former is found to have the same probability to be male than to be female, while the latter means that the name wasn’t found in the database.
In this snippet, you can see what I did. After instantiating the detector, I created a function that takes a pandas DataFrame column, extracts the first name then performs gender guessing on it. Finally, it creates a column with the “_gender” (or any arbitrary) suffix and fills it with the guessed genders.
Nevertheless, do not forget to check the results manually at the end! In some cases, you must do a google search to clarify unknown or andy cases, and it is always good to double-check anyway. I think these are great tools to assess any gender-related aspect you’d like whatsoever for speeding up the process substantially.
Hope you enjoy reading this short blog! For Feedback reach me out on Twitter or use the comments below.