This report takes a look at the Financial Contributions made to Presenditial Campaigns in the state of New York for 2016. The primary dataset was downloaded from datasource, however I also created a list of cities in New York along with the latitude and longitude. This data was extracted from city_datasource
The Financial Contribution dataset (after cleaning) contains 167,902 and contains 23 variables, made up of:
Variable | Name | Meaning / Use |
---|---|---|
cmte_id | Committee ID | A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission. |
cand_id | Candidate ID | A 9-character alpha-numeric code assigned to a candidate by the Federal Election Commission. |
cand_nm | Candidate Name | Recorded name of the candidate |
contbr_nm | Contributor Name | Reported name of the contributor. |
contbr_city | Contributor City | Reported city of the contributor |
contbr_state | Contributor State | Reported state of the contributor |
contbr_zip | Contributor Zip Code | Reported zip code of the contributor |
contbr_employer | Contributor Employer | Reported employer of the contributor |
contbr_occupation | Contributor Occupation | Reported occupation of the contributor |
contb_receipt_amt | Contribution Receipt Amount | Reported contribution amount |
contb_receipt_dt | Contribution Receipt Date | Reported contribution date |
receipt_desc | Receipt Description | Additional information reported by the committee about a specific contribution |
memo_cd | Memo Code | ‘X’ indicates the committee has provided additional text to describe a specific |
memo_text | Memo Text | Additional information reported by the committee about a specific contribution |
form_tp | Form Type | Indicates what schedule and line number the reporting committee reported a specific transaction |
file_num | File Number | A unique number assigned to a report and all its associated transactions |
tran_id | Transaction ID | A unique identifier for each transaction |
election_tp | Election Type | This code indicates the election for which the contribution was made. EYYYY (election plus election year) |
To help with the analysis I have added some additional fields to the dataset
Variable | Meaning / Use |
---|---|
month | used for grouping data by month |
week | used for grouping data by week |
year | used for grouping data by year |
latitude | stores the latitude based on the reported contribution city |
longitude | stores the longitude based on the reported contribution city |
employment_status | stores the employment status of each contributor based on the listed employer |
Note: In order to get the latitude and longitude to match the city, I needed to match the city from the cities dataframe to the cities in the financial dataframe. However initially there was an issue with some of the names not matching up. I was able to use a Python script from a previous project to create matches and fix differences in the spelling of cities.
Once I had added the additional fields, I was able to start analysing the data and to help with this I created a couple of grouping / summaries of the data:
The table below provides details for each of the candidates
cand_id | cand_nm |
---|---|
P60008059 | Bush, Jeb |
P60005915 | Carson, Benjamin S. |
P60008521 | Christie, Christopher J. |
P00003392 | Clinton, Hillary Rodham |
P60006111 | Cruz, Rafael Edward ‘Ted’ |
P60007242 | Fiorina, Carly |
P60007697 | Graham, Lindsey O. |
P80003478 | Huckabee, Mike |
P60008398 | Jindal, Bobby |
P60003670 | Kasich, John R. |
P60009685 | Lessig, Lawrence |
P60007671 | O’Malley, Martin Joseph |
P60007572 | Pataki, George E. |
P40003576 | Paul, Rand |
P20003281 | Perry, James R. (Rick) |
P60006723 | Rubio, Marco |
P60007168 | Sanders, Bernard |
P20002721 | Santorum, Richard J. |
P20003984 | Stein, Jill |
P80001571 | Trump, Donald J. |
P60006046 | Walker, Scott |
P60008885 | Webb, James Henry Jr. |
The main features of this dataset include the candidate and the value of the contributions that they received. The data below shows the break down of the contributions.
## Total Value of Contributions: 46072566
## Total Number of Contributions: 167902
## Average Value of Contribution: 274.4015
## Maximum Contribution Value: 10800
## Minimum Contribution Value: 0.08
## Number of Candidates: 22
## Number of Contributors: 35955
The plot and table below show us that there are 2 candidates that received the highest number of contributions compared to the rest of the candidates. These candidates received a total of 80% of the number of contributions with:
cand_nm | contribution_count | percent_count |
---|---|---|
Clinton, Hillary Rodham | 59621 | 35.509404 |
Sanders, Bernard | 74676 | 44.475944 |
Bush, Jeb | 2582 | 1.537802 |
Rubio, Marco | 5068 | 3.018427 |
Cruz, Rafael Edward ‘Ted’ | 10730 | 6.390633 |
The first plot in the group above shows that the data is right skewed with the majority of the contributions been less than or equal to $500. So in order to see the spread of data better I performed a log transform on the transaction amount, which can be seen in the second plot of the group. The red bar on each of the plots above show the average / mean contribution amount of $274.40 and the blue line shows the median contribution amount of $50.00. The summary / statistics of the contribution amounts can be seen below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08 25.00 50.00 274.40 100.00 10800.00
Other features in the dataset that would be useful to investigate are:
When first looking at the grouping of the employer and occupation of the employer, I could see that there were:
## Employers: 14735
## Occupations: 6732
This posed an issue with been able to determine if there were any distinct patterns, as some of these could have been similar occupations with different titles or the same employers with different names or recorded differently. So in order to determine if there were any patterns I created an additional variable in the dataset for employment status, based on the listed employer. From the plot and tables below we are able to see that the bulk (55%) of the contributions came from contributors that are employed at the time of the contribution.
employment_status | contribution_count | contribution_value | avg_contribution | max_contribution | percent_amount |
---|---|---|---|---|---|
EMPLOYED | 92693 | 29817871 | 321.68418 | 10800 | 55.206609 |
SELF EMPLOYED | 25169 | 6525750 | 259.27727 | 10800 | 14.990292 |
NOT EMPLOYED | 20650 | 1731867 | 83.86767 | 5400 | 12.298841 |
RETIRED | 15542 | 2525894 | 162.52052 | 5400 | 9.256590 |
UNKNOWN | 13848 | 5471183 | 395.08835 | 10800 | 8.247668 |
employment_status | percent_amount |
---|---|
EMPLOYED | 55.206609 |
SELF EMPLOYED | 14.990292 |
NOT EMPLOYED | 12.298841 |
RETIRED | 9.256590 |
UNKNOWN | 8.247668 |
The plot below shows us that the location of contributions were generally spread out across the state of New York, with a couple of districts (Capital District and CentralNew York), that had a larger number of contributions. This is probably reflective of the population spread across the state of New York and where businesses are generally located.
contbr_city | contribution_count | contribution_value |
---|---|---|
BROOKLYN | 40650 | 7873430.7 |
NEW YORK | 37610 | 20274071.1 |
STONY BROOK | 4922 | 452823.3 |
BRONX | 2098 | 389668.5 |
ROCHESTER | 1788 | 180695.5 |
## Min. 1st Qu. Median Mean 3rd Qu.
## "2013-10-11" "2015-10-19" "2016-01-13" "2015-12-08" "2016-02-11"
## Max.
## "2016-02-29"
From this plot we can see that the number of contributions has increased overtime. However the data and summary show an outlier in 2013. I believe that this could be related to possible data entry errors or data recorded later than the transaction occurred.
When performing the initial review of the data, I found that there were a couple of outliers that affected the spread of the value of the contributions of the data. These outliers included:
To help see the spread of contributions made, I also used either a SQRT or LOG10 transfrom on the scales when I found that the data was too close together to analyse and interpret. This helped to see the data in more detail. The risk of this when looking at plots, there is a potential for the misinterpreting the data. In order to prevent this you need to look at the scales carefully.
The first relationship I analysed was the relationship between the total value of the contributions made to each candidate. The next plot shows the total value of the contribution per candidate. This plot shows that whilst candidate P60006723 (Marco Rubio), had the highest number of contributions the candidate with the highest value of contributions was P0003392 (Hillary Clinton), making up 56% of the total contribution amount.
cand_nm | contribution_value | avg_contribution | max_contribution | min_contribution | median_contribution | percent_amount |
---|---|---|---|---|---|---|
Clinton, Hillary Rodham | 25716436.18 | 431.33185 | 5400 | 0.08 | 75.00 | 55.8172435 |
Sanders, Bernard | 5691346.41 | 76.21386 | 5400 | 1.00 | 35.00 | 12.3530052 |
Bush, Jeb | 4086579.31 | 1582.71856 | 5400 | 1.00 | 2700.00 | 8.8698757 |
Rubio, Marco | 3433298.04 | 677.44634 | 10800 | 2.00 | 100.00 | 7.4519359 |
Cruz, Rafael Edward ‘Ted’ | 1579858.70 | 147.23753 | 10800 | 1.00 | 50.00 | 3.4290660 |
Christie, Christopher J. | 1085187.00 | 2009.60556 | 6000 | 1.00 | 2700.00 | 2.3553865 |
Carson, Benjamin S. | 878933.94 | 101.68139 | 5400 | 1.00 | 50.00 | 1.9077165 |
Kasich, John R. | 599840.00 | 938.71674 | 5000 | 3.00 | 250.00 | 1.3019462 |
Graham, Lindsey O. | 530982.07 | 1613.92726 | 7700 | 25.00 | 1000.00 | 1.1524908 |
Fiorina, Carly | 417277.23 | 302.37480 | 5400 | 3.00 | 75.00 | 0.9056957 |
O’Malley, Martin Joseph | 362917.00 | 760.83229 | 2700 | 3.00 | 250.00 | 0.7877074 |
Walker, Scott | 350576.00 | 1608.14679 | 10800 | 50.00 | 1000.00 | 0.7609214 |
Pataki, George E. | 346145.82 | 1625.09775 | 2700 | 20.16 | 1350.00 | 0.7513057 |
Paul, Rand | 302144.35 | 209.67686 | 7300 | 1.00 | 50.00 | 0.6558010 |
Huckabee, Mike | 268089.50 | 837.77969 | 5400 | 1.00 | 250.00 | 0.5818853 |
Trump, Donald J. | 208977.00 | 322.49537 | 2700 | 1.00 | 203.98 | 0.4535823 |
Lessig, Lawrence | 94926.59 | 571.84693 | 2700 | 6.98 | 250.00 | 0.2060371 |
Jindal, Bobby | 32426.42 | 1473.92818 | 2700 | 11.10 | 1900.00 | 0.0703812 |
Webb, James Henry Jr. | 28700.00 | 610.63830 | 2700 | 50.00 | 250.00 | 0.0622930 |
Santorum, Richard J. | 27980.10 | 417.61343 | 2700 | 8.00 | 200.00 | 0.0607305 |
Perry, James R. (Rick) | 17200.00 | 593.10345 | 2700 | 25.00 | 250.00 | 0.0373324 |
Stein, Jill | 12744.00 | 283.20000 | 2800 | 10.00 | 100.00 | 0.0276607 |
In order to see the spread of the value of the contributions I applied a SQRT coordinate transformation on the y-axis, which can be seen on the plot below.
The plot below shows the spread of contribution values and the number of times a particular value was made. From this plot we can start to see that as the contribution value increases the number of times the contribution is made decreases.
The plots below show that the bulk of the total contributions were made by empoyed contributors, however the highest average contributions came from the unknown employment status. This employment status is made up of contributors who did not have an employer recorded against their contribution. The employed status also had the highest number of contributions, which resonates with the fact that this group has the highest contribution total, but not the highest average. The unknown group also has the lowest number of contributions which pushes up their average value of contributions.
When looking at the contribution values by location we can see similar patterns that occurred with the count of contributions by location. The areas with the higher value correlate with the locations with the higher number of contributions.
The strongest relationship that I observed was the number of contributions been received over time in the dataset. For example as the campaigning process ramps up / progresses further the number of contributions increase. What I had exepcted, but didn’t see was the increase in the total amount been contributed each time period.
During this part of the analysis, I decided to see how the timing or progression of the campaign impacted the amount and number of the contributions. The first part I wanted to investigate was seeing if different buckets / bins of contribution amounts increased or decreased more than others. In order to determine this I broke the contribution amounts into the relevant quartile and from the plot below I could see that whilst the value of contributions per month is increasing for each quartile, the quartile with the greatest increase is occurring for the lowest bucket (0-25).
The next analysis I wanted to look at here was the top 5 candidates and to see how their total contribution amounts varied over time.The plots below show that for the most of the top 5 candidates they all have ups and downs, with all candidates dropping around the holiday season. The candidate with the highest / most consistent trend of growth was P60007168 (Sanders, Bernard). The only candidate with the reverse trend was P60008059 (Bush, Jeb). The heatmaps further down below show a similar story for all candidates.
I also wanted to see if different candidates received contributions from different areas more than others, however the plots below show that the top 5 candidates were receiving contributions from similar areas. This may be based on the population spread in New York.
The plot below shows us the following:
The plot below shows us that for most of the candidates the bulk of their total contributions came from contributions that were between $100 - $10,800. When you compare this to the contributions over time it shows that people were generally contributing higher amounts earlier in the campaign and the smaller values were happening at the end of the campaign.
The plot below shows the breakdown of the percentage of both the number of contributions and value of contributions per candidate. From this plot we can see that whilst Bernard Sanders received the highest percentage of the number of contributions he did not receive the highest value. Which shows that his contributions were of lower value compared to other candidates, like Hilary Clinton. This is also supported in the quartile plot above as the bulk of Bernard Sanders contributions were in the lowest quartile, where Hilary Clinton’s contribution values were more spread out. Other candidates like Bush Jeb and Marco Rubio received a low number of contributions, however they received higher value contributions, which pushed their % of the Total Value up, but kept their % of the Total count lower.
When looking at the location of where contributions are made from, I believe the greatest benefit in this would occur when looking at the USA overall, as we would be able to draw a link between the numbers of contributions and the popularity of each candidate by state.
Another aspect that could have been looked at which might have provided some benefit in analysing is to look at the gender and age of the contributors, as this could have helped to see if there were and groups of peole more likely to contribut to other candidates thn others.
The hardest part of this investigation was trying to determine which data to compare to each other and I found that there was greater beneift in grouping the contributions by groups, for example employment status.