(image credit: Ofure Itua)
Aaron Abraham*, Kevin Lin*
*Both authors contributed equally to the writing and research in this study. Their names are listed in alphabetical order.
This study aimed to find the major factors that are closely correlated with violent crime in communities scattered across the United States. A dataset from the UCI Machine Learning Repository was used to gather statistics on potential demographic and socioeconomic factors that could affect the violent crime rates. Five models were chosen (linear regression, stepwise regression, relaxed LASSO, random forest, and extreme gradient boosting trees) to predict which variables were most strongly correlated with violent crime rates. Models achieved RMSE scores of approximately 0.13 and r2 values of 0.65, indicating reasonable fitness. All models indicated that a mix of demographic factors, socioeconomic factors, familial factors, and individual factors was closely correlated with violent crime rates, including, but not limited to: percentage of kids under 2 parent households, number of children born to unmarried couples, and percentage of divorced males in specific communities. Analysis of variables that were highly correlated with violent crime rate revealed that the relationship may be traced to competition over resources, psychological harm, and difficult living conditions. Hopefully, these results can be used by law enforcement agencies to more accurately tackle crime and reduce violent crime rates.
Communities, Violent Crime, Attributes, Regression, Visualization
Violent crimes in the United States can be categorized as follows: murder, assault, sexual assault, stealing with force, and crimes in which an offender or perpetrator uses or threatens to use force upon a victim . The total number of violent crimes per 100,000 members of a population is a quantitative measure of safety within a certain region.
There are several factors that can influence the rate of violent crime in a certain community which include but are not limited to
1. individual factors such as high impulsivity and low achievement
2. parental/familial factors such as poor supervision and harsh discipline
3. demographic factors such as racial composition and percent of recent immigrants
4. socioeconomic factors such as low household income or large families.
The purpose of this study is to demonstrate which categorical factors have the largest impact on violent crime rates within various communities in the United States. Quantification of individual factors and their local impacts is impractical, thus, they will not be considered in the results. Regression and visualizations will facilitate the identification of the strongest factors that influence the safety of the city as well as create new possible solutions to decrease violent crime.
The extensive dataset used for this report was obtained from the University of California – Irvine Machine Learning Repository. Attributes were selected based on a possible connection to the crime, in addition to the predictability of that attribute (per capita violent crimes). The variables included in the dataset directly involved specific communities within the United States. The creators of this dataset omitted communities (mostly from the midwestern USA) that had unclear counts on the number of rape cases.
All numeric data (122 attributes) were already normalized into the decimal range from 0.00 to 1.00 using an equal-interval binning method. Attributes retained their distribution and skew. For example, an attribute described as “mean number of people per family” is the normalized version of the value within the given decimal range, which preserves rough ratios of values within an attribute to increase precision .
A total of 38 factors were eliminated due to large amounts of missing information, as many of the algorithms would not perform well with missing data. Hence, 90 attributes were kept for this study. These 90 attributes fell into one or more of the four categories listed (individual, parental/familial, demographic, socioeconomic). Attributes in the columns of the dataset were eliminated for the following reasons: 1) Many missing values, 2) Repeated column, and 3) Relatively unimportant in comparison to attributes (e.g. columns “percent of people living in the same city as in 1985 (5 years before)” and “percent of people living in the same state as in 1985 (5 years before)” were removed because “percent of people living in the same house as in 1985 (5 years before)” would better represent regional data). Considering that every column was already normalized, no outliers needed to be eliminated. In addition, visualizations were created after the model was created.
Since all data points were normalized, this justified minimal data manipulation, as all variables were in a range that could be easily compared. However, a limitation of the normalization is that it does not preserve relationships between values of different attributes. For this reason, it is not meaningful to compare the value for “percentage of population that is Caucasian” with the value for “percentage of population that is African American” for a single community, as the percentages of demographics within a region do not always add up to 1. Instead, normalization ensures that data in a column are in the same range as every other column.
Five machine learning algorithms were chosen based on their accuracy and simplicity which were: linear regression, stepwise regression, LASSO regression, extreme gradient boosting trees and random forests.
Linear regression aims to create a line that best correlates to the relationship presented between the dependent variable (in this case, the violent crime rate) and independent variables (Figure 1). As an extremely simple algorithm, it was used as a baseline performance measure to assess more complex models. The stepwise regression algorithm aims to fit a generalized linear model, but will add (forward selection) or remove variables (backward selection) in order to achieve high accuracy (Figure 2). For better variable selection and regression, the relaxed LASSO algorithm was also chosen (Figure 3). This algorithm creates a linear model but will ensure that the sum of the absolute value of the coefficients of variables is less than a fixed number, effectively scaling variables according to importance, and sometimes even scaling coefficients to zero. Relaxed LASSO adds on to this algorithm by using the nonzero variable coefficients, construct a least-squares regression and then pads the model with variables with coefficients of zero. Extreme gradient boosting trees is a very useful algorithm as it is a collection of many decision trees (Figure 4). However, as the model is trained, a new tree is created to predict the errors of the previous model. It then combines all the trees until no further improvements can be made, making it a very accurate model. The last model that was used is a random forest model, a ubiquitous ensemble model in machine learning (Figure 5). It is extremely useful when preventing overfitting, as it not only constructs multiple decision trees and averages predictions for final output, but it also utilizes bootstrap aggregating and a random feature selection at each node to minimize any memorization.
Data was divided into a training (70% of data) and testing set (30% of data) in order to prevent overfitting. Overfitting is the process in which a model memorizes outcomes rather than learning from them. Usually, an overfitted model would perform extremely well on the training dataset but would perform poorly on the testing set. To obviate this problem, 10-fold cross-validation was used in the training set. This procedure divides the training set into 10 groups. A certain number of these groups will be used for training, while the remaining group/s would be used for testing. This procedure would be iterated until the model parameters are optimized and able to achieve high accuracy. Cross-validation effectively prevents overfitting, as the model is re-trained multiple times on different segments of data.
As this was a regression problem, RMSE andr2values were considered the best metrics for model evaluation. The root means squared error (RMSE) metric is a measure of the differences between sample and population values. The RMSE serves to aggregate magnitudes of the errors in predictions to measure accuracy for continuous variables. It does this by taking the square root of the average of squared differences between predicted and observed values. RMSE is sensitive to outliers, hence larger errors have a disproportionately large effect. Lower values of RMSE indicate a better fit.
To further analyze our model, the varImp() function was used in the caret package to identify the variables that were the most important to the models when making predictions. These variables were grouped based on the categories listed above. Visualizations were then created in Python using seaborn, matplotlib, and plotly modules. The visualizations chosen, specifically bar charts, bubble charts and heatmaps, allow for a deeper analysis of the effects of variables on crime rates in the given dataset.
In the linear regression model, there was a strong positive correlation between the number of homeless people counted in the street and the total number of violent crimes per capita (r = 0.340). The result depicts a seemingly linear relationship between these two variables, suggesting a strong association between the abundance of homeless people on the streets and increased violent crime rate. Other important variables in this model included the percentage of males who have never married, median rent price (for rental housing), percentage of people living in areas classified as urban, and percentage of kids born to never-married couples.
Furthermore, the stepwise regression and relaxed LASSO models produced the most important variables that were nearly identical. The five most important ones, listed from most important to least important, included percentage of kids born to never-married couples, percentage of kids in family housing with two parents, percentage of families (with kids) that are headed by two parents, percentage of population that is Caucasian, and percent of kids 4 and under in two-parent households. Both models displayed a strong correlation between familial factors and the violent crime rate.
In the random forest model, the most important variables included the percentage of kids in family housing with two parents, percentage of kids born to never-married couples, percent of persons in dense housing (more than 1 person per room), and percentage of families (with kids) that are headed by two parents. Similar to the stepwise regression and relaxed LASSO models, this model displayed a strong correlation between familial factors and the violent crime rate.
In the extreme gradient boosting trees model (XGBoost), similar familial variables were considered important. These included the percentage of kids in family housing with two parents, the percentage of kids born to never-married couples, the number of kids born to never-married couples, percentage of Caucasians and percent of persons in dense housing (more than 1 person per room).
Overall, these models produced generally similar results. Almost every model had an r2 value of around 0.65, indicating that they were relatively fit. Random forest did not have an r2 value, as it was not fitting a line to the data points. The relaxed LASSO model had by far the highest RMSE value, at 0.733785, while the RMSE values in other models centered around 0.14. The three most important variables taken from the collective of models were percentage of kids born to never-married couples (r = 0.738), percentage of kids in family housing with two parents (r = -0.738), and percentage of families (with kids) that are headed by two parents (r = -0.707).
Bar graphs were also created to compare how variables changed amongst the cities with the 5 highest violent crime rates (5 bars on the left) and the 5 lowest violent crime rates (5 bars on the right). Additionally, heatmaps were created to confirm strong correlations.
Through the utilization of the aforementioned machine learning models, the results obtained from the study successfully and effectively highlighted which categorical factors have the largest impact on violent crime rates within various communities in the United States. Many important factors that were expected to strongly influence violent crime rates, such as homelessness and financial struggle, were confirmed to have a strong influence in most models and visualizations. However, numerous familial factors, such as the percent of kids born to unmarried parents, surprisingly had very high correlations with violent crime per capita. The major findings from this study were the most important variables from the collection of generated models (the percentage of kids born to never-married couples, the percentage of kids in family housing with two parents, and the percentage of families that are headed by two parents). Ideally, these attributes (all of which are familial factors) can be better addressed and prioritized by governments when combating high rates of violent crime.
The number of recent immigrants demonstrated a positive correlation with the rate of violent crime, and the percent of recent immigrants is higher in the cities with the highest violent crime rates. Of the many demographic factors that are commonly thought to be leading causes of high violent crime rates, this was the most telling. It is likely that recent immigrants, who also may not speak English very well, struggle to find employment opportunities. This could have decreased their confidence in their ability to enter the workforce and climb the socioeconomic ladder. Additionally, the added pressure of economic struggle could have influenced many immigrants to follow unlawful behavior. This low achievement and high impulsivity likely explain their engagement with criminal activity.
The results confirmed many well-known causes of violent crime in a community, such as low income, unemployment, and homelessness. Additionally, they showed that communities with a high percentage of households with public assistance income and a low percentage of households with investment/rent income also have high violent crime rates. These two attributes represent the behavior of the community as a result of socioeconomic status. This could mean that public assistance is a strong sign of a less wealthy community, one that is more prone to violent crime due to harder living conditions. Also, higher percentages of households with investment/rent income generally reflect a wealthier community that is able to pay regular installments of money. These communities are not as likely to see violent crime.
When there is tension or a lack of support in many households within a community, the community is more likely to have higher rates of violent crime. For example, children growing up in large families are less likely to live under desirable family conditions; they are often influenced by poor parental behaviour, poor child-rearing practices, and competition for physical and psychological resources. Similarly, children born to never-married couples or divorced parents also may not have always been properly cared for due to familial issues, which may encourage children to be influenced by negative factors outside the home. Hence, our results consistently show that higher percentages of two-parent households are linked to lower percentages of violent crime .
There were many demographic, socioeconomic, and parental/familial factors that highlighted interesting trends in our data. However, it was still inconclusive whether the violent crime of a community was the cause or the effect of certain attributes.
When considering the dataset at hand, there are some systemic problems that may have skewed results. Many of the features contributed to response bias, as they discuss potentially revealing or uncomfortable topics, such as the number of children born to never-married couples or divorce rates. Furthermore, there are some intrinsic imbalances in violent crime rates and other factors that affect these rates. As seen in Appendix 12, most of the data were collected in communities with low violent crime rates. This means that some variables may have been considered too important when finding violent crime rates due to skewed data. To prevent this, penalty regressions or other algorithms that accounted for imbalanced data should have been used.
Violent crime is a part of many communities. However, communities can establish mechanisms to prevent many instances of violent crime from occurring or lessen the damage done by it. These statistical findings may lead to long-term solutions useful for municipal governments. For example, social services may be improved to help the most marginalized and vulnerable groups in society. This includes recent immigrants and struggling parents who may find counseling and temporary assistance helpful before they regain stability, as our study indicates that communities with high immigrant percentages and parents with large households are most often linked to higher crime rates.
Other methods of government intervention may also increase the socioeconomic status of the community overall. Policy reform, which may include providing more educational opportunities to disadvantaged communities, could potentially allow them to build long-term economic foundations and reduce unemployment. Reforming fiscal policies may also make it more affordable to live in the city. Generally, the better quality of life and economic stability has proven to lower violent crime rates .
Future research could be done by comparing the leading causes of violent crime within a certain region over several decades to observe trends over time. Analysis of this data may improve policing, policy, and hence, safety within a community. Additionally, data from communities in different countries may be compared to address the rising concern of violent crime on a global scale.
1. Kuhn, Max. ‘caret’ Package. 10 Dec. 2017. Accessed 1 Dec. 2017.
2. UCI Machine Learning. “Communities and Crime Data Set.” University of California Irvine. Accessed 20 Dec. 2017.
3. Leung, Kevin et. al. “Collision Statistics: A Study in Toronto Road Safety.” STEM Fellowship, 2016. Accessed 7 Dec. 2017
4. Pamidimukkala, Anupya et. al. “Diving into Debt: A Study on Factors Related to Debt Risk Score in Toronto” STEM Fellowship, 2016. Accessed 7 Dec. 2017.
5. Farrington, David P. “Cross-national comparative research on criminal careers, risk factors, crime and punishment.” European Journal of Criminology, vol. 12, issue 4, pg. 386-399, 2013. Accessed 28 Dec. 2017.
6. Guthrie, Gerard. “Social factors affecting violent crime victimization in urban households.” Contemporary PNG Studies, vol. 18, pp. 35-54, May 2013. Accessed 28 Dec. 2017
7. Dai, Tanaka. “Exploring factors affecting crime rates in Japan (1955-2012)” Thesis, Chuo University, 20014. Accessed 28 Dec. 2017.
8. Tseloi, Andromachi et. al. “Exploring the international decline in crime rates.” European Journal of Criminology, 1 Sept. 2010. Accessed 28 Dec. 2017.
Additional graphs included for further clarity.
1,378 total views, 36 views today