Financial Contributions to Presidential Candidates

Introduction

This report takes a look at the Financial Contributions made to Presenditial Campaigns in the state of New York for 2016. The primary dataset was downloaded from datasource, however I also created a list of cities in New York along with the latitude and longitude. This data was extracted from city_datasource

Data Structure

The Financial Contribution dataset (after cleaning) contains 167,902 and contains 23 variables, made up of:

Variable	Name	Meaning / Use
cmte_id	Committee ID	A 9-character alpha-numeric code assigned to a committee by the Federal Election Commission.
cand_id	Candidate ID	A 9-character alpha-numeric code assigned to a candidate by the Federal Election Commission.
cand_nm	Candidate Name	Recorded name of the candidate
contbr_nm	Contributor Name	Reported name of the contributor.
contbr_city	Contributor City	Reported city of the contributor
contbr_state	Contributor State	Reported state of the contributor
contbr_zip	Contributor Zip Code	Reported zip code of the contributor
contbr_employer	Contributor Employer	Reported employer of the contributor
contbr_occupation	Contributor Occupation	Reported occupation of the contributor
contb_receipt_amt	Contribution Receipt Amount	Reported contribution amount
contb_receipt_dt	Contribution Receipt Date	Reported contribution date
receipt_desc	Receipt Description	Additional information reported by the committee about a specific contribution
memo_cd	Memo Code	‘X’ indicates the committee has provided additional text to describe a specific
memo_text	Memo Text	Additional information reported by the committee about a specific contribution
form_tp	Form Type	Indicates what schedule and line number the reporting committee reported a specific transaction
file_num	File Number	A unique number assigned to a report and all its associated transactions
tran_id	Transaction ID	A unique identifier for each transaction
election_tp	Election Type	This code indicates the election for which the contribution was made. EYYYY (election plus election year)

To help with the analysis I have added some additional fields to the dataset

Variable	Meaning / Use
month	used for grouping data by month
week	used for grouping data by week
year	used for grouping data by year
latitude	stores the latitude based on the reported contribution city
longitude	stores the longitude based on the reported contribution city
employment_status	stores the employment status of each contributor based on the listed employer

Note: In order to get the latitude and longitude to match the city, I needed to match the city from the cities dataframe to the cities in the financial dataframe. However initially there was an issue with some of the names not matching up. I was able to use a Python script from a previous project to create matches and fix differences in the spelling of cities.

Once I had added the additional fields, I was able to start analysing the data and to help with this I created a couple of grouping / summaries of the data:

financial.group_by_city - used to plot and analyse the number and value of contributions by city
financial.group_by_candidate - used to plot and analyse the contributions made to different candidates
financial.group_by_employment - used to plot and analyse contributions made based on the status of employment

Candidates

The table below provides details for each of the candidates

cand_id	cand_nm
P60008059	Bush, Jeb
P60005915	Carson, Benjamin S.
P60008521	Christie, Christopher J.
P00003392	Clinton, Hillary Rodham
P60006111	Cruz, Rafael Edward ‘Ted’
P60007242	Fiorina, Carly
P60007697	Graham, Lindsey O.
P80003478	Huckabee, Mike
P60008398	Jindal, Bobby
P60003670	Kasich, John R.
P60009685	Lessig, Lawrence
P60007671	O’Malley, Martin Joseph
P60007572	Pataki, George E.
P40003576	Paul, Rand
P20003281	Perry, James R. (Rick)
P60006723	Rubio, Marco
P60007168	Sanders, Bernard
P20002721	Santorum, Richard J.
P20003984	Stein, Jill
P80001571	Trump, Donald J.
P60006046	Walker, Scott
P60008885	Webb, James Henry Jr.

Top of Page

Univariate Analysis

What are the main feature(s) of interest in your dataset?

The main features of this dataset include the candidate and the value of the contributions that they received. The data below shows the break down of the contributions.

## Total Value of Contributions:  46072566 
## Total Number of Contributions:   167902 
## Average Value of Contribution: 274.4015 
## Maximum Contribution Value:       10800 
## Minimum Contribution Value:        0.08 
## Number of Candidates:                22 
## Number of Contributors:           35955

The plot and table below show us that there are 2 candidates that received the highest number of contributions compared to the rest of the candidates. These candidates received a total of 80% of the number of contributions with:

P60006723 (Sanders, Bernard) receiving 44.5%
P0003392 (Clinton, Hillary Rodham) receiving 35.5%

Percentage of Contributions by the top 5 candidates
cand_nm	contribution_count	percent_count
Clinton, Hillary Rodham	59621	35.509404
Sanders, Bernard	74676	44.475944
Bush, Jeb	2582	1.537802
Rubio, Marco	5068	3.018427
Cruz, Rafael Edward ‘Ted’	10730	6.390633

The first plot in the group above shows that the data is right skewed with the majority of the contributions been less than or equal to $500. So in order to see the spread of data better I performed a log transform on the transaction amount, which can be seen in the second plot of the group. The red bar on each of the plots above show the average / mean contribution amount of $274.40 and the blue line shows the median contribution amount of $50.00. The summary / statistics of the contribution amounts can be seen below.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.08    25.00    50.00   274.40   100.00 10800.00

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features in the dataset that would be useful to investigate are:

Contributor Occupation / Employer - to see if there are any trends based on the type of occupation or employer, in particular to compare employed vs unemployed.
Location of contributors - to see if there are particular areas of New York that prefer different candidates
Receipt Date of the Contribution - to see if there are certain times (weeks or months) of the campaign that would encourage more people to contribute

Contributor Occupation / Employer

When first looking at the grouping of the employer and occupation of the employer, I could see that there were:

## Employers:  14735 
## Occupations: 6732

This posed an issue with been able to determine if there were any distinct patterns, as some of these could have been similar occupations with different titles or the same employers with different names or recorded differently. So in order to determine if there were any patterns I created an additional variable in the dataset for employment status, based on the listed employer. From the plot and tables below we are able to see that the bulk (55%) of the contributions came from contributors that are employed at the time of the contribution.

Summary by employment status
employment_status	contribution_count	contribution_value	avg_contribution	max_contribution	percent_amount
EMPLOYED	92693	29817871	321.68418	10800	55.206609
SELF EMPLOYED	25169	6525750	259.27727	10800	14.990292
NOT EMPLOYED	20650	1731867	83.86767	5400	12.298841
RETIRED	15542	2525894	162.52052	5400	9.256590
UNKNOWN	13848	5471183	395.08835	10800	8.247668

Percentage of Contributions by Employment Status
employment_status	percent_amount
EMPLOYED	55.206609
SELF EMPLOYED	14.990292
NOT EMPLOYED	12.298841
RETIRED	9.256590
UNKNOWN	8.247668

Location of contributors

The plot below shows us that the location of contributions were generally spread out across the state of New York, with a couple of districts (Capital District and CentralNew York), that had a larger number of contributions. This is probably reflective of the population spread across the state of New York and where businesses are generally located.

Contributions for top 5 locations
contbr_city	contribution_count	contribution_value
BROOKLYN	40650	7873430.7
NEW YORK	37610	20274071.1
STONY BROOK	4922	452823.3
BRONX	2098	389668.5
ROCHESTER	1788	180695.5

Contributions over time (Receipt Date of the Contribution)

##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2013-10-11" "2015-10-19" "2016-01-13" "2015-12-08" "2016-02-11" 
##         Max. 
## "2016-02-29"

From this plot we can see that the number of contributions has increased overtime. However the data and summary show an outlier in 2013. I believe that this could be related to possible data entry errors or data recorded later than the transaction occurred.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When performing the initial review of the data, I found that there were a couple of outliers that affected the spread of the value of the contributions of the data. These outliers included:

A contribution of 3,686,000
A contribution of -5,400
Contributions recorded with dates outside of the last 12 months

To help see the spread of contributions made, I also used either a SQRT or LOG10 transfrom on the scales when I found that the data was too close together to analyse and interpret. This helped to see the data in more detail. The risk of this when looking at plots, there is a potential for the misinterpreting the data. In order to prevent this you need to look at the scales carefully.

Top of Page

Bivariate Analysis

Relationships between the main features

The first relationship I analysed was the relationship between the total value of the contributions made to each candidate. The next plot shows the total value of the contribution per candidate. This plot shows that whilst candidate P60006723 (Marco Rubio), had the highest number of contributions the candidate with the highest value of contributions was P0003392 (Hillary Clinton), making up 56% of the total contribution amount.

Summary of Contribution Value by candidates
cand_nm	contribution_value	avg_contribution	max_contribution	min_contribution	median_contribution	percent_amount
Clinton, Hillary Rodham	25716436.18	431.33185	5400	0.08	75.00	55.8172435
Sanders, Bernard	5691346.41	76.21386	5400	1.00	35.00	12.3530052
Bush, Jeb	4086579.31	1582.71856	5400	1.00	2700.00	8.8698757
Rubio, Marco	3433298.04	677.44634	10800	2.00	100.00	7.4519359
Cruz, Rafael Edward ‘Ted’	1579858.70	147.23753	10800	1.00	50.00	3.4290660
Christie, Christopher J.	1085187.00	2009.60556	6000	1.00	2700.00	2.3553865
Carson, Benjamin S.	878933.94	101.68139	5400	1.00	50.00	1.9077165
Kasich, John R.	599840.00	938.71674	5000	3.00	250.00	1.3019462
Graham, Lindsey O.	530982.07	1613.92726	7700	25.00	1000.00	1.1524908
Fiorina, Carly	417277.23	302.37480	5400	3.00	75.00	0.9056957
O’Malley, Martin Joseph	362917.00	760.83229	2700	3.00	250.00	0.7877074
Walker, Scott	350576.00	1608.14679	10800	50.00	1000.00	0.7609214
Pataki, George E.	346145.82	1625.09775	2700	20.16	1350.00	0.7513057
Paul, Rand	302144.35	209.67686	7300	1.00	50.00	0.6558010
Huckabee, Mike	268089.50	837.77969	5400	1.00	250.00	0.5818853
Trump, Donald J.	208977.00	322.49537	2700	1.00	203.98	0.4535823
Lessig, Lawrence	94926.59	571.84693	2700	6.98	250.00	0.2060371
Jindal, Bobby	32426.42	1473.92818	2700	11.10	1900.00	0.0703812
Webb, James Henry Jr.	28700.00	610.63830	2700	50.00	250.00	0.0622930
Santorum, Richard J.	27980.10	417.61343	2700	8.00	200.00	0.0607305
Perry, James R. (Rick)	17200.00	593.10345	2700	25.00	250.00	0.0373324
Stein, Jill	12744.00	283.20000	2800	10.00	100.00	0.0276607

In order to see the spread of the value of the contributions I applied a SQRT coordinate transformation on the y-axis, which can be seen on the plot below.

The plot below shows the spread of contribution values and the number of times a particular value was made. From this plot we can start to see that as the contribution value increases the number of times the contribution is made decreases.

Relationships between other features

Contribution Values by Employment Status

The plots below show that the bulk of the total contributions were made by empoyed contributors, however the highest average contributions came from the unknown employment status. This employment status is made up of contributors who did not have an employer recorded against their contribution. The employed status also had the highest number of contributions, which resonates with the fact that this group has the highest contribution total, but not the highest average. The unknown group also has the lowest number of contributions which pushes up their average value of contributions.

Contribution Values by location

When looking at the contribution values by location we can see similar patterns that occurred with the count of contributions by location. The areas with the higher value correlate with the locations with the higher number of contributions.

Contribution Values over time

What was the strongest relationship you found?

The strongest relationship that I observed was the number of contributions been received over time in the dataset. For example as the campaigning process ramps up / progresses further the number of contributions increase. What I had exepcted, but didn’t see was the increase in the total amount been contributed each time period.

Top of Page

Multivariate Analysis

Relationships

During this part of the analysis, I decided to see how the timing or progression of the campaign impacted the amount and number of the contributions. The first part I wanted to investigate was seeing if different buckets / bins of contribution amounts increased or decreased more than others. In order to determine this I broke the contribution amounts into the relevant quartile and from the plot below I could see that whilst the value of contributions per month is increasing for each quartile, the quartile with the greatest increase is occurring for the lowest bucket (0-25).

The next analysis I wanted to look at here was the top 5 candidates and to see how their total contribution amounts varied over time.The plots below show that for the most of the top 5 candidates they all have ups and downs, with all candidates dropping around the holiday season. The candidate with the highest / most consistent trend of growth was P60007168 (Sanders, Bernard). The only candidate with the reverse trend was P60008059 (Bush, Jeb). The heatmaps further down below show a similar story for all candidates.

I also wanted to see if different candidates received contributions from different areas more than others, however the plots below show that the top 5 candidates were receiving contributions from similar areas. This may be based on the population spread in New York.

Top of Page

Final Plots

The plot below shows us the following:

The number of contributions grow as the campaign progresses
the quartile of contribution values with the highest growth is the $0 to $25 range
the difference between the shapes of the area’s, supports that there is no direct correlation between the number of contributions and the total value contributed per period. This can be seen prior to July 2015 where there are slight increases in the number of contributions, however the total value of the contributions is almost equal to the end of the data.

The plot below shows us that for most of the candidates the bulk of their total contributions came from contributions that were between $100 - $10,800. When you compare this to the contributions over time it shows that people were generally contributing higher amounts earlier in the campaign and the smaller values were happening at the end of the campaign.

The plot below shows the breakdown of the percentage of both the number of contributions and value of contributions per candidate. From this plot we can see that whilst Bernard Sanders received the highest percentage of the number of contributions he did not receive the highest value. Which shows that his contributions were of lower value compared to other candidates, like Hilary Clinton. This is also supported in the quartile plot above as the bulk of Bernard Sanders contributions were in the lowest quartile, where Hilary Clinton’s contribution values were more spread out. Other candidates like Bush Jeb and Marco Rubio received a low number of contributions, however they received higher value contributions, which pushed their % of the Total Value up, but kept their % of the Total count lower.

Reflection / Conclusion

When looking at the location of where contributions are made from, I believe the greatest benefit in this would occur when looking at the USA overall, as we would be able to draw a link between the numbers of contributions and the popularity of each candidate by state.

Another aspect that could have been looked at which might have provided some benefit in analysing is to look at the gender and age of the contributors, as this could have helped to see if there were and groups of peole more likely to contribut to other candidates thn others.

The hardest part of this investigation was trying to determine which data to compare to each other and I found that there was greater beneift in grouping the contributions by groups, for example employment status.

Financial Contributions to Presidential Candidates

Gareth Hunt

8th May 2016

Introduction

Data Structure

Candidates

Univariate Analysis

What are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Contributor Occupation / Employer

Location of contributors

Contributions over time (Receipt Date of the Contribution)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Analysis

Relationships between the main features

Relationships between other features

Contribution Values by Employment Status

Contribution Values by location

Contribution Values over time

What was the strongest relationship you found?

Multivariate Analysis

Relationships

Final Plots

Reflection / Conclusion