Introduction

This report takes a look at the Financial Contributions made to Presenditial Campaigns in the state of New York for 2016. The primary dataset was downloaded from datasource, however I also created a list of cities in New York along with the latitude and longitude. This data was extracted from city_datasource

Data Structure

The Financial Contribution dataset (after cleaning) contains 167,902 and contains 23 variables, made up of:

Variable Name Meaning / Use
cmte_id Committee ID A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission.
cand_id Candidate ID A 9-character alpha-numeric code assigned to a candidate by the Federal Election Commission.
cand_nm Candidate Name Recorded name of the candidate
contbr_nm Contributor Name Reported name of the contributor.
contbr_city Contributor City Reported city of the contributor
contbr_state Contributor State Reported state of the contributor
contbr_zip Contributor Zip Code Reported zip code of the contributor
contbr_employer Contributor Employer Reported employer of the contributor
contbr_occupation Contributor Occupation Reported occupation of the contributor
contb_receipt_amt Contribution Receipt Amount Reported contribution amount
contb_receipt_dt Contribution Receipt Date Reported contribution date
receipt_desc Receipt Description Additional information reported by the committee about a specific contribution
memo_cd Memo Code ‘X’ indicates the committee has provided additional text to describe a specific
memo_text Memo Text Additional information reported by the committee about a specific contribution
form_tp Form Type Indicates what schedule and line number the reporting committee reported a specific transaction
file_num File Number A unique number assigned to a report and all its associated transactions
tran_id Transaction ID A unique identifier for each transaction
election_tp Election Type This code indicates the election for which the contribution was made. EYYYY (election plus election year)

To help with the analysis I have added some additional fields to the dataset

Variable Meaning / Use
month used for grouping data by month
week used for grouping data by week
year used for grouping data by year
latitude stores the latitude based on the reported contribution city
longitude stores the longitude based on the reported contribution city
employment_status stores the employment status of each contributor based on the listed employer

Note: In order to get the latitude and longitude to match the city, I needed to match the city from the cities dataframe to the cities in the financial dataframe. However initially there was an issue with some of the names not matching up. I was able to use a Python script from a previous project to create matches and fix differences in the spelling of cities.

Once I had added the additional fields, I was able to start analysing the data and to help with this I created a couple of grouping / summaries of the data:

  • financial.group_by_city - used to plot and analyse the number and value of contributions by city
  • financial.group_by_candidate - used to plot and analyse the contributions made to different candidates
  • financial.group_by_employment - used to plot and analyse contributions made based on the status of employment

Candidates

The table below provides details for each of the candidates

cand_id cand_nm
P60008059 Bush, Jeb
P60005915 Carson, Benjamin S.
P60008521 Christie, Christopher J.
P00003392 Clinton, Hillary Rodham
P60006111 Cruz, Rafael Edward ‘Ted’
P60007242 Fiorina, Carly
P60007697 Graham, Lindsey O.
P80003478 Huckabee, Mike
P60008398 Jindal, Bobby
P60003670 Kasich, John R.
P60009685 Lessig, Lawrence
P60007671 O’Malley, Martin Joseph
P60007572 Pataki, George E.
P40003576 Paul, Rand
P20003281 Perry, James R. (Rick)
P60006723 Rubio, Marco
P60007168 Sanders, Bernard
P20002721 Santorum, Richard J.
P20003984 Stein, Jill
P80001571 Trump, Donald J.
P60006046 Walker, Scott
P60008885 Webb, James Henry Jr.

Top of Page

Univariate Analysis

What are the main feature(s) of interest in your dataset?

The main features of this dataset include the candidate and the value of the contributions that they received. The data below shows the break down of the contributions.

## Total Value of Contributions:  46072566 
## Total Number of Contributions:   167902 
## Average Value of Contribution: 274.4015 
## Maximum Contribution Value:       10800 
## Minimum Contribution Value:        0.08 
## Number of Candidates:                22 
## Number of Contributors:           35955

The plot and table below show us that there are 2 candidates that received the highest number of contributions compared to the rest of the candidates. These candidates received a total of 80% of the number of contributions with:

  • P60006723 (Sanders, Bernard) receiving 44.5%
  • P0003392 (Clinton, Hillary Rodham) receiving 35.5%

Percentage of Contributions by the top 5 candidates
cand_nm contribution_count percent_count
Clinton, Hillary Rodham 59621 35.509404
Sanders, Bernard 74676 44.475944
Bush, Jeb 2582 1.537802
Rubio, Marco 5068 3.018427
Cruz, Rafael Edward ‘Ted’ 10730 6.390633

The first plot in the group above shows that the data is right skewed with the majority of the contributions been less than or equal to $500. So in order to see the spread of data better I performed a log transform on the transaction amount, which can be seen in the second plot of the group. The red bar on each of the plots above show the average / mean contribution amount of $274.40 and the blue line shows the median contribution amount of $50.00. The summary / statistics of the contribution amounts can be seen below.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.08    25.00    50.00   274.40   100.00 10800.00

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features in the dataset that would be useful to investigate are:

Contributor Occupation / Employer

When first looking at the grouping of the employer and occupation of the employer, I could see that there were:

## Employers:  14735 
## Occupations: 6732

This posed an issue with been able to determine if there were any distinct patterns, as some of these could have been similar occupations with different titles or the same employers with different names or recorded differently. So in order to determine if there were any patterns I created an additional variable in the dataset for employment status, based on the listed employer. From the plot and tables below we are able to see that the bulk (55%) of the contributions came from contributors that are employed at the time of the contribution.

Summary by employment status
employment_status contribution_count contribution_value avg_contribution max_contribution percent_amount
EMPLOYED 92693 29817871 321.68418 10800 55.206609
SELF EMPLOYED 25169 6525750 259.27727 10800 14.990292
NOT EMPLOYED 20650 1731867 83.86767 5400 12.298841
RETIRED 15542 2525894 162.52052 5400 9.256590
UNKNOWN 13848 5471183 395.08835 10800 8.247668
Percentage of Contributions by Employment Status
employment_status percent_amount
EMPLOYED 55.206609
SELF EMPLOYED 14.990292
NOT EMPLOYED 12.298841
RETIRED 9.256590
UNKNOWN 8.247668

Location of contributors

The plot below shows us that the location of contributions were generally spread out across the state of New York, with a couple of districts (Capital District and CentralNew York), that had a larger number of contributions. This is probably reflective of the population spread across the state of New York and where businesses are generally located.

Contributions for top 5 locations
contbr_city contribution_count contribution_value
BROOKLYN 40650 7873430.7
NEW YORK 37610 20274071.1
STONY BROOK 4922 452823.3
BRONX 2098 389668.5
ROCHESTER 1788 180695.5

Contributions over time (Receipt Date of the Contribution)

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2013-10-11" "2015-10-19" "2016-01-13" "2015-12-08" "2016-02-11" 
##         Max. 
## "2016-02-29"

From this plot we can see that the number of contributions has increased overtime. However the data and summary show an outlier in 2013. I believe that this could be related to possible data entry errors or data recorded later than the transaction occurred.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When performing the initial review of the data, I found that there were a couple of outliers that affected the spread of the value of the contributions of the data. These outliers included:

  • A contribution of 3,686,000
  • A contribution of -5,400
  • Contributions recorded with dates outside of the last 12 months

To help see the spread of contributions made, I also used either a SQRT or LOG10 transfrom on the scales when I found that the data was too close together to analyse and interpret. This helped to see the data in more detail. The risk of this when looking at plots, there is a potential for the misinterpreting the data. In order to prevent this you need to look at the scales carefully.

Top of Page

Bivariate Analysis

Relationships between the main features

The first relationship I analysed was the relationship between the total value of the contributions made to each candidate. The next plot shows the total value of the contribution per candidate. This plot shows that whilst candidate P60006723 (Marco Rubio), had the highest number of contributions the candidate with the highest value of contributions was P0003392 (Hillary Clinton), making up 56% of the total contribution amount.

Summary of Contribution Value by candidates
cand_nm contribution_value avg_contribution max_contribution min_contribution median_contribution percent_amount
Clinton, Hillary Rodham 25716436.18 431.33185 5400 0.08 75.00 55.8172435
Sanders, Bernard 5691346.41 76.21386 5400 1.00 35.00 12.3530052
Bush, Jeb 4086579.31 1582.71856 5400 1.00 2700.00 8.8698757
Rubio, Marco 3433298.04 677.44634 10800 2.00 100.00 7.4519359
Cruz, Rafael Edward ‘Ted’ 1579858.70 147.23753 10800 1.00 50.00 3.4290660
Christie, Christopher J. 1085187.00 2009.60556 6000 1.00 2700.00 2.3553865
Carson, Benjamin S. 878933.94 101.68139 5400 1.00 50.00 1.9077165
Kasich, John R. 599840.00 938.71674 5000 3.00 250.00 1.3019462
Graham, Lindsey O. 530982.07 1613.92726 7700 25.00 1000.00 1.1524908
Fiorina, Carly 417277.23 302.37480 5400 3.00 75.00 0.9056957
O’Malley, Martin Joseph 362917.00 760.83229 2700 3.00 250.00 0.7877074
Walker, Scott 350576.00 1608.14679 10800 50.00 1000.00 0.7609214
Pataki, George E. 346145.82 1625.09775 2700 20.16 1350.00 0.7513057
Paul, Rand 302144.35 209.67686 7300 1.00 50.00 0.6558010
Huckabee, Mike 268089.50 837.77969 5400 1.00 250.00 0.5818853
Trump, Donald J. 208977.00 322.49537 2700 1.00 203.98 0.4535823
Lessig, Lawrence 94926.59 571.84693 2700 6.98 250.00 0.2060371
Jindal, Bobby 32426.42 1473.92818 2700 11.10 1900.00 0.0703812
Webb, James Henry Jr. 28700.00 610.63830 2700 50.00 250.00 0.0622930
Santorum, Richard J. 27980.10 417.61343 2700 8.00 200.00 0.0607305
Perry, James R. (Rick) 17200.00 593.10345 2700 25.00 250.00 0.0373324
Stein, Jill 12744.00 283.20000 2800 10.00 100.00 0.0276607

In order to see the spread of the value of the contributions I applied a SQRT coordinate transformation on the y-axis, which can be seen on the plot below.

The plot below shows the spread of contribution values and the number of times a particular value was made. From this plot we can start to see that as the contribution value increases the number of times the contribution is made decreases.

Relationships between other features

Contribution Values by Employment Status

The plots below show that the bulk of the total contributions were made by empoyed contributors, however the highest average contributions came from the unknown employment status. This employment status is made up of contributors who did not have an employer recorded against their contribution. The employed status also had the highest number of contributions, which resonates with the fact that this group has the highest contribution total, but not the highest average. The unknown group also has the lowest number of contributions which pushes up their average value of contributions.

Contribution Values by location

When looking at the contribution values by location we can see similar patterns that occurred with the count of contributions by location. The areas with the higher value correlate with the locations with the higher number of contributions.

Contribution Values over time

What was the strongest relationship you found?

The strongest relationship that I observed was the number of contributions been received over time in the dataset. For example as the campaigning process ramps up / progresses further the number of contributions increase. What I had exepcted, but didn’t see was the increase in the total amount been contributed each time period.

Top of Page

Multivariate Analysis

Relationships

During this part of the analysis, I decided to see how the timing or progression of the campaign impacted the amount and number of the contributions. The first part I wanted to investigate was seeing if different buckets / bins of contribution amounts increased or decreased more than others. In order to determine this I broke the contribution amounts into the relevant quartile and from the plot below I could see that whilst the value of contributions per month is increasing for each quartile, the quartile with the greatest increase is occurring for the lowest bucket (0-25).

The next analysis I wanted to look at here was the top 5 candidates and to see how their total contribution amounts varied over time.The plots below show that for the most of the top 5 candidates they all have ups and downs, with all candidates dropping around the holiday season. The candidate with the highest / most consistent trend of growth was P60007168 (Sanders, Bernard). The only candidate with the reverse trend was P60008059 (Bush, Jeb). The heatmaps further down below show a similar story for all candidates.

I also wanted to see if different candidates received contributions from different areas more than others, however the plots below show that the top 5 candidates were receiving contributions from similar areas. This may be based on the population spread in New York.

Top of Page

Final Plots

The plot below shows us the following:

  • The number of contributions grow as the campaign progresses
  • the quartile of contribution values with the highest growth is the $0 to $25 range
  • the difference between the shapes of the area’s, supports that there is no direct correlation between the number of contributions and the total value contributed per period. This can be seen prior to July 2015 where there are slight increases in the number of contributions, however the total value of the contributions is almost equal to the end of the data.

The plot below shows us that for most of the candidates the bulk of their total contributions came from contributions that were between $100 - $10,800. When you compare this to the contributions over time it shows that people were generally contributing higher amounts earlier in the campaign and the smaller values were happening at the end of the campaign.

The plot below shows the breakdown of the percentage of both the number of contributions and value of contributions per candidate. From this plot we can see that whilst Bernard Sanders received the highest percentage of the number of contributions he did not receive the highest value. Which shows that his contributions were of lower value compared to other candidates, like Hilary Clinton. This is also supported in the quartile plot above as the bulk of Bernard Sanders contributions were in the lowest quartile, where Hilary Clinton’s contribution values were more spread out. Other candidates like Bush Jeb and Marco Rubio received a low number of contributions, however they received higher value contributions, which pushed their % of the Total Value up, but kept their % of the Total count lower.

Reflection / Conclusion

When looking at the location of where contributions are made from, I believe the greatest benefit in this would occur when looking at the USA overall, as we would be able to draw a link between the numbers of contributions and the popularity of each candidate by state.

Another aspect that could have been looked at which might have provided some benefit in analysing is to look at the gender and age of the contributors, as this could have helped to see if there were and groups of peole more likely to contribut to other candidates thn others.

The hardest part of this investigation was trying to determine which data to compare to each other and I found that there was greater beneift in grouping the contributions by groups, for example employment status.