Every data related project starts with a question. A question that you can answer with the data you have. As a data scientist, your goal is to find an answer to those questions.
This project started as outlined above. My wife and I were thinking of a name for our daughter. We needed a name that could be easily pronounceable in both English and Turkish, since my wife is American and I am from the Black Sea region of Turkey.
I found the US baby names dataset from the Social Security Administration. This file can be downloaded from the SSA, and you get the US names for each year. However, I chose to download the dataset from kaggle’s website because this dataset has only four columns (name, year, gender and count) which I found to fit my research needs and be easier to work with. It is a 177MB file. I used the NationalNames.csv file in this project.
The names are from 1880 to 2014 and there are 1,825,433 rows in the file. I could do data exploration on an Excel file but since there are millions of rows, it would be almost impossible to work with the data. Therefore, Python’s Pandas library comes handy with Jupyter notebook. The dataframe looks like this in my notebook.
There are 167,070,477 female names, 170,064,949 male names and a total of 337,135,426 names. There are roughly 300,000 more male names than female names.
After knowing there are millions of names in the data set we wanted to see if/how often specific names have been given to a baby. There are 581,900 Nicoles and only 69 Numans – not surprisingly. My wife’s name was most popular around the time she was born.
My name is not a common name in this data set, so I will use a bar chart to display the data.
The Social Security Administration does not list names in the dataset if there are less than 5, but I did find my name.
Instead of searching for the most or the least common names, we wanted to research about the names that we are considering to give to our baby. Esma is at the top of our list.
Apparently, Esma is becoming more popular since 2000.
Joann is the name of my wife’s grandmother. She was born in 1936, which is around the time that her name was popular.
Our baby was born on December 3rd and her name is Esma Joann.