Introduction to Python Data Analysis

17/9/2021 7-minute read

Introduction to Python

Before we begin running any analysis, lets import a few libraries that will be useful during our analysis. We will import pandas and numpy to conduct our data manipulation and analysis and the matplotlib library for visualisations.

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

To begin the analysis we will load our dataset using the read_csv function from the pandas library. We can then use the head function to preview our data.

data = pd.read_csv("/Users/jonahthomas/R_projects/academic_blog/content/english/post/2021-09-06-introduction-to-data-analysis-in-python/basic_income_dataset_dalia.csv")
data.head()
country_code uuid age gender rural dem_education_level dem_full_time_job dem_has_children question_bbi_2016wave4_basicincome_awareness question_bbi_2016wave4_basicincome_vote question_bbi_2016wave4_basicincome_effect question_bbi_2016wave4_basicincome_argumentsfor question_bbi_2016wave4_basicincome_argumentsagainst age_group weight
AT f6e7ee00-deac-0133-4de8-0a81e8b09a82 61 male rural no no no I know something about it I would not vote None of the above None of the above None of the above 40_65 1.105.534.474
AT 54f0f1c0-dda1-0133-a559-0a81e8b09a82 57 male urban high yes yes I understand it fully I would probably vote for it A basic income would not affect my work choices It increases appreciation for household work a… It might encourage people to stop working 40_65 1.533.248.826
AT 83127080-da3d-0133-c74f-0a81e8b09a82 32 male urban NaN no no I have heard just a little about it I would not vote ‰Û_ gain additional skills It creates more equality of opportunity Foreigners might come to my country and take a… 26_39 0.9775919155
AT 15626d40-db13-0133-ea5c-0a81e8b09a82 45 male rural high yes yes I have heard just a little about it I would probably vote for it ‰Û_ work less It reduces anxiety about financing basic needs None of the above 40_65 1.105.534.474
AT 24954a70-db98-0133-4a64-0a81e8b09a82 41 female urban high yes yes I have heard just a little about it I would probably vote for it None of the above It reduces anxiety about financing basic needs It is impossible to finance It might encoura… 40_65

Now we have loaded the data, we can explore some of the variables. For the purpose of demonstration, we will look at age (a numeric variable), gender, rural and dem_full_time_job (categorical variables) using the describe function.

data['age'].describe()
data['gender'].describe()
data['rural'].describe()
data['dem_full_time_job'].describe()
count    9649.000000
mean       37.712716
std        12.270630
min        14.000000
25%        28.000000
50%        40.000000
75%        46.000000
max        65.000000
Name: age, dtype: float64






count     9649
unique       2
top       male
freq      5094
Name: gender, dtype: object






count      9649
unique        2
top       urban
freq       6878
Name: rural, dtype: object






count     9649
unique       2
top        yes
freq      5702
Name: dem_full_time_job, dtype: object

As we can see above the describe function gives a summary of each variable whether it is numeric or categoric. We can see that there is a total of 9649 responses to the questionnaire with an average respondent age of 37.7 years. 5094 respondents were male, 6878 lived in an urban area and 5702 were in full time employment. Whilst this information is useful to give us a basic understanding of our data, we may also be interested in looking at whether gender has an affect on whether a respondent is employed full time.

data.groupby('gender').agg({'dem_full_time_job': 'count'})
dem_full_time_job
4555
5094

We can see more men and in full time employment than women. We could group by multiple variables if we wanted to see how they both affected another variable(s).

data.groupby(['gender', 'rural']).agg({'dem_full_time_job': 'count'})
dem_full_time_job
gender rural
female rural 1410
urban 3145
male rural 1361
urban 3733

We can see from this summary that the employment rate between males and females in rural areas are very similar (with more women employed than men) however the differences in urban areas are more stark with 600 more men in full time employment than women. It is worth noting, these findings are in no way causitive, they are simply an exploration to show potentially interesting trends.

Basic Data Visualisation

data['age'].plot.hist(bins = 10)

png

vote_against = pd.DataFrame(data['question_bbi_2016wave4_basicincome_argumentsagainst'].str.split('|').explode())
vote_against['question_bbi_2016wave4_basicincome_argumentsagainst'] = vote_against['question_bbi_2016wave4_basicincome_argumentsagainst'].str.lower()
vote_against['question_bbi_2016wave4_basicincome_argumentsagainst'] = vote_against['question_bbi_2016wave4_basicincome_argumentsagainst'].str.strip()
vote_against = vote_against.groupby('question_bbi_2016wave4_basicincome_argumentsagainst').agg(count=('question_bbi_2016wave4_basicincome_argumentsagainst', 'count')).sort_values('count', ascending = False)
vote_against.index.name = 'argument_against'
vote_against.reset_index(inplace=True)
ax = sns.barplot(x = "count", y = "argument_against", data = vote_against)

png

sns.catplot(x = 'age', y = 'question_bbi_2016wave4_basicincome_vote', kind = 'box', data = data)

png

This boxplot highlights one key factor in our data; those who say they would not vote tend to be younger than those that say they would vote. It seems that amongst those that said they would vote, the direction of this vote is not highly dependent upon age.

Would the vote pass?

data['vote'] = data['question_bbi_2016wave4_basicincome_vote'].str.contains("for")
vote = data.groupby('vote').agg(count = ('vote', 'count'))
vote.index.name = 'vote'
vote.reset_index(inplace=True)
sns.barplot(x = 'vote', y = 'count', data = vote)

png

From the graph above, we can say that it is likely the vote would pass! To calculate this, we used a simplistic approach. First, we create a column called vote that is set to true if the respondent says they either would vote for basic income or are likely to vote for basic income. All other values are set to false. The graph clearly shows the majority of individuals would vote for a basic income suggesting the vote would pass.

we could also now look at this voting status by age group and see the differences.

vote_age = data.groupby(['vote', 'age_group']).agg(count = ('vote', 'count'))
vote_age.index.name = 'vote'
vote_age.reset_index(inplace=True)
sns.barplot(x = 'age_group', y = 'count', hue = 'vote', data = vote_age)

png

The most notable pattern from this graph is the majority of respondents come from the 40-65 year old category. It also suggests that individuals in this category may slightly more likely to vote for a basic income than younger individuals (shown by the greater height difference between the bars).

Now we have looked at voting status by age, we can also look at voting status by whether the respondent has children.

vote_child = data.groupby(['vote', 'dem_has_children']).agg(count = ('vote', 'count'))
vote_child.index.name = 'vote'
vote_child.reset_index(inplace=True)
sns.barplot(x = 'dem_has_children', y = 'count', hue = 'vote', data = vote_child)

png

We can see from this that having children seems to make people slightly more likely to vote for a basic income than not. Again, this is not a causitive relationship, just an exploration of the datasets.

A basic model

Python has a number of libraries that allow us to build models to represent our data. The following is by no means an in depth look at any of these libraries, merely an initial look at some basic linear models. I hope to create a more comprehensive post on modelling in Python in the near future.

Today, we will use Python’s Patsy library to create a linear model. We will start with a very simple model looking to explain the way an individual votes based on their age.

from sklearn.linear_model import LinearRegression

vote = np.array(data['vote']).reshape(-1,1)
age = np.array(data['age']).reshape(-1,1)

model = LinearRegression()
model.fit(vote, age)

print('Intercept:', model.intercept_)
print('Coefficient:', model.coef_)
print('R squared:', model.score(vote, age))
LinearRegression()



Intercept: [37.57560427]
Coefficient: [[0.21720479]]
R squared: 7.294238969335343e-05

Lets talk through this one line at a time. First, we import the linear regression model from the scikit learn library. Next, we take the vote and age columns from our dataframe and reshape them each into a single column. Then, we create a linear regression model before using the fit function to fit the model to the vote and age data.

We can then examine the model intercept and coefficient which we print as the output. Finally, we can examine the R squared value. In this model, we can see that age explains a very very samll amount of the variation in voting pattern. Therefore, further analysis would need to look at other variables to explain the trends in voting pattern.

Conclusions

Above we have walked through a very basic data analysis in Python. We start by creating some summary data, before creating some visualisations of our data and finally creating a basic linear model. Whilst this analysis is by no means extensive, I hope it highlights how a simple dataset could be explored using Python. In future, I hope to perform some more complex analyses in Python including experimenting with machine learning models.