import pandas as pd
= pd.read_excel('data/hate_Crimes_v2.xlsx') df
32 Lab: Hate crimes
In the session 7 (week 8) we discussed data and society: academic and practices discourse on the social, political and ethical aspects of data science, and discussed how one can responsibly carry out data science research on social phenomena, what ethical and social frameworks can help us to critically approach data science practices and its effects on society, and what are ethical practices for data scientists.
32.1 Datasets
This week we will work with the following datasets:
- Hate crimes
- World Bank Indicators Dataset
- Office for National Statistics (ONS): Gender Pay Gap
32.1.1 Further datasets
- OECD Poverty gap
- Poverty & Equity Data Portal: From Organisation for Economic Co-operation and Development (OECD) or from the WorldBank
- NHS: multiple files. The NHS inequality challenge https://www.nuffieldtrust.org.uk/project/nhs-visual-data-challenge
- Health state life expectancies by Index of Multiple Deprivation (IMD 2015 and IMD 2019): England, all ages multiple publications
32.1.2 Additional Readings
- Indicators - critical reviews: The Poverty of Statistics and the Statistics of Poverty: https://www.tandfonline.com/doi/full/10.1080/01436590903321844?src=recsys
- Indicators in global health: arguments: indicators are usually comprehensible to a small group of experts. Why use indicators then? “Because indicators used in global HIV finance offer openings for engagement to promote accountability (…) some indicators and data truly are better than others, and as they were all created by humans, they all can be deconstructed and remade in other forms” Davis, S. (2020). The Uncounted: Politics of Data in Global Health, Cambridge. doi:10.1017/9781108649544
32.2 Hate Crimes
In this notebook we will be using the Hate Crimes dataset from Fivethirtyeight, which was used in the story Higher Rates Of Hate Crimes Are Tied To Income Inequality.
32.2.1 Variables:
Header | Definition |
---|---|
NAME |
State name |
median_household_income |
Median household income, 2016 |
share_unemployed_seasonal |
Share of the population that is unemployed (seasonally adjusted), Sept. 2016 |
share_population_in_metro_areas |
Share of the population that lives in metropolitan areas, 2015 |
share_population_with_high_school_degree |
Share of adults 25 and older with a high-school degree, 2009 |
share_non_citizen |
Share of the population that are not U.S. citizens, 2015 |
share_white_poverty |
Share of white residents who are living in poverty, 2015 |
gini_index |
Gini Index, 2015 |
share_non_white |
Share of the population that is not white, 2015 |
share_voters_voted_trump |
Share of 2016 U.S. presidential voters who voted for Donald Trump |
hate_crimes_per_100k_splc |
Hate crimes per 100,000 population, Southern Poverty Law Center, Nov. 9-18, 2016 |
avg_hatecrimes_per_100k_fbi |
Average annual hate crimes per 100,000 population, FBI, 2010-2015 |
Gini Index: measures income inequality. Gini Index
values can range between 0
and 1
, where 0 indicates perfect equality and everyone has the same income, and 1 indicates perfect inequality. You can read more about Gini Index here: https://databank.worldbank.org/metadataglossary/world-development-indicators/series/SI.POV.GINI
32.3 Data exploration
We will again be using Altair and Geopandas this week. If you are using the course’s virtual environment, this should be installed for you the first time you set up your environment for the module. Refer to Appendix B for instructions on how to set up your environment.
A reminder: anything with a pd. prefix comes from pandas (since pd is the alias we have created for the pandas library). This is particulary useful for preventing a module from overwriting inbuilt Python functionality.
Let’s have a look at our dataset
# Retrieve the last ten rows of the df dataframe
10) df.tail(
NAME | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | South Dakota | 53053 | 0.035 | 0.51 | 0.899 | NaN | 0.08 | 0.442 | 0.17 | 0.62 | 0.00 | 3.30 |
42 | Tennessee | 43716 | 0.057 | 0.82 | 0.831 | 0.04 | 0.13 | 0.468 | 0.27 | 0.61 | 0.19 | 3.13 |
43 | Texas | 53875 | 0.042 | 0.92 | 0.799 | 0.11 | 0.08 | 0.469 | 0.56 | 0.53 | 0.21 | 0.75 |
44 | Utah | 63383 | 0.036 | 0.82 | 0.904 | 0.04 | 0.08 | 0.419 | 0.19 | 0.47 | 0.13 | 2.38 |
45 | Vermont | 60708 | 0.037 | 0.35 | 0.910 | 0.01 | 0.10 | 0.444 | 0.06 | 0.33 | 0.32 | 1.90 |
46 | Virginia | 66155 | 0.043 | 0.89 | 0.866 | 0.06 | 0.07 | 0.459 | 0.38 | 0.45 | 0.36 | 1.72 |
47 | Washington | 59068 | 0.052 | 0.86 | 0.897 | 0.08 | 0.09 | 0.441 | 0.31 | 0.38 | 0.67 | 3.81 |
48 | West Virginia | 39552 | 0.073 | 0.55 | 0.828 | 0.01 | 0.14 | 0.451 | 0.07 | 0.69 | 0.32 | 2.03 |
49 | Wisconsin | 58080 | 0.043 | 0.69 | 0.898 | 0.03 | 0.09 | 0.430 | 0.22 | 0.48 | 0.22 | 1.12 |
50 | Wyoming | 55690 | 0.040 | 0.31 | 0.918 | 0.02 | 0.09 | 0.423 | 0.15 | 0.70 | 0.00 | 0.26 |
# Is df indeed a DataFrame, let's do a quick check
type(df)
pandas.core.frame.DataFrame
# What about the data type (Dtype) of the columns in df. We better also be aware of these to help us understand about manipulating them effectively
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NAME 51 non-null object
1 median_household_income 51 non-null int64
2 share_unemployed_seasonal 51 non-null float64
3 share_population_in_metro_areas 51 non-null float64
4 share_population_with_high_school_degree 51 non-null float64
5 share_non_citizen 48 non-null float64
6 share_white_poverty 51 non-null float64
7 gini_index 51 non-null float64
8 share_non_white 51 non-null float64
9 share_voters_voted_trump 51 non-null float64
10 hate_crimes_per_100k_splc 51 non-null float64
11 avg_hatecrimes_per_100k_fbi 51 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 4.9+ KB
32.3.1 Missing values
Let’s explore the dataset
The above tables shows that we have some missing data for some of states, as did df.tail(10) earlier on. Let’s check again.
sum() df.isna().
NAME 0
median_household_income 0
share_unemployed_seasonal 0
share_population_in_metro_areas 0
share_population_with_high_school_degree 0
share_non_citizen 3
share_white_poverty 0
gini_index 0
share_non_white 0
share_voters_voted_trump 0
hate_crimes_per_100k_splc 0
avg_hatecrimes_per_100k_fbi 0
dtype: int64
Hmmm, the column ‘share_non_citizen’ does indeed have some missing data. How about the column NAME (state names). Is that looking ok?
import numpy as np
np.unique(df.NAME)
array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
'New Jersey', 'New Mexico', 'New York', 'North Carolina',
'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)
There aren’t any unexpected values in ‘NAME’.
# And how many states do we have the data for?
= df['NAME'].nunique()
count_states print(count_states)
51
Oh…one extra state! Which one is it? And is it a state? Even if you don’t get into investigating this immediately, if you realise that this entry is a particularly interesting one down the line in your analysis, you may wish to dig deeper into the context!
32.4 Mapping hate crime across the USA
#We need the geospatial polygons of the states in America
import geopandas as gpd
import pandas as pd
import altair as alt
# Read geospatial data as geospatial data frame -gdf
= gpd.read_file('data/gz_2010_us_040_00_500k.json')
gdf gdf.head()
GEO_ID | STATE | NAME | LSAD | CENSUSAREA | geometry | |
---|---|---|---|---|---|---|
0 | 0400000US23 | 23 | Maine | 30842.923 | MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ... | |
1 | 0400000US25 | 25 | Massachusetts | 7800.058 | MULTIPOLYGON (((-70.83204 41.6065, -70.82374 4... | |
2 | 0400000US26 | 26 | Michigan | 56538.901 | MULTIPOLYGON (((-88.68443 48.11578, -88.67563 ... | |
3 | 0400000US30 | 30 | Montana | 145545.801 | POLYGON ((-104.0577 44.99743, -104.25014 44.99... | |
4 | 0400000US32 | 32 | Nevada | 109781.180 | POLYGON ((-114.0506 37.0004, -114.05 36.95777,... |
# Checking what type geo_states is...
type(gdf)
geopandas.geodataframe.GeoDataFrame
As with the previous week, we have got a column called geometry that would allow us to project the data in a 2D map format.
# Calling the Altair alias (alt) to help us create the map of USA - more technically speaking,
# creating a Chart object using Altair with the following properties
='US states').mark_geoshape().encode(
alt.Chart(gdf, title
).properties(=500,
width=300
height
).project(type='albersUsa'
)
# Add the data
# You might recall that the df DataFrame and the geostates GeoDataFrame both have a NAME column
= gdf.merge(df, on='NAME')
geo_states
geo_states.head()
GEO_ID | STATE | NAME | LSAD | CENSUSAREA | geometry | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0400000US23 | 23 | Maine | 30842.923 | MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ... | 51710 | 0.044 | 0.54 | 0.902 | NaN | 0.12 | 0.437 | 0.09 | 0.45 | 0.61 | 2.62 | |
1 | 0400000US25 | 25 | Massachusetts | 7800.058 | MULTIPOLYGON (((-70.83204 41.6065, -70.82374 4... | 63151 | 0.046 | 0.97 | 0.890 | 0.09 | 0.08 | 0.475 | 0.27 | 0.34 | 0.63 | 4.80 | |
2 | 0400000US26 | 26 | Michigan | 56538.901 | MULTIPOLYGON (((-88.68443 48.11578, -88.67563 ... | 52005 | 0.050 | 0.87 | 0.879 | 0.04 | 0.09 | 0.451 | 0.24 | 0.48 | 0.40 | 3.20 | |
3 | 0400000US30 | 30 | Montana | 145545.801 | POLYGON ((-104.0577 44.99743, -104.25014 44.99... | 51102 | 0.041 | 0.34 | 0.908 | 0.01 | 0.10 | 0.435 | 0.10 | 0.57 | 0.49 | 2.95 | |
4 | 0400000US32 | 32 | Nevada | 109781.180 | POLYGON ((-114.0506 37.0004, -114.05 36.95777,... | 49875 | 0.067 | 0.87 | 0.839 | 0.10 | 0.08 | 0.448 | 0.50 | 0.46 | 0.14 | 2.11 |
# Let's take a look at the merged GeoDataFrame
10) geo_states.head(
GEO_ID | STATE | NAME | LSAD | CENSUSAREA | geometry | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0400000US23 | 23 | Maine | 30842.923 | MULTIPOLYGON (((-67.61976 44.51975, -67.61541 ... | 51710 | 0.044 | 0.54 | 0.902 | NaN | 0.12 | 0.437 | 0.09 | 0.45 | 0.61 | 2.62 | |
1 | 0400000US25 | 25 | Massachusetts | 7800.058 | MULTIPOLYGON (((-70.83204 41.6065, -70.82374 4... | 63151 | 0.046 | 0.97 | 0.890 | 0.09 | 0.08 | 0.475 | 0.27 | 0.34 | 0.63 | 4.80 | |
2 | 0400000US26 | 26 | Michigan | 56538.901 | MULTIPOLYGON (((-88.68443 48.11578, -88.67563 ... | 52005 | 0.050 | 0.87 | 0.879 | 0.04 | 0.09 | 0.451 | 0.24 | 0.48 | 0.40 | 3.20 | |
3 | 0400000US30 | 30 | Montana | 145545.801 | POLYGON ((-104.0577 44.99743, -104.25014 44.99... | 51102 | 0.041 | 0.34 | 0.908 | 0.01 | 0.10 | 0.435 | 0.10 | 0.57 | 0.49 | 2.95 | |
4 | 0400000US32 | 32 | Nevada | 109781.180 | POLYGON ((-114.0506 37.0004, -114.05 36.95777,... | 49875 | 0.067 | 0.87 | 0.839 | 0.10 | 0.08 | 0.448 | 0.50 | 0.46 | 0.14 | 2.11 | |
5 | 0400000US34 | 34 | New Jersey | 7354.220 | POLYGON ((-75.52684 39.65571, -75.52634 39.656... | 65243 | 0.056 | 1.00 | 0.874 | 0.11 | 0.07 | 0.464 | 0.44 | 0.42 | 0.07 | 4.41 | |
6 | 0400000US36 | 36 | New York | 47126.399 | MULTIPOLYGON (((-71.94356 41.28668, -71.9268 4... | 54310 | 0.051 | 0.94 | 0.847 | 0.10 | 0.10 | 0.499 | 0.42 | 0.37 | 0.35 | 3.10 | |
7 | 0400000US37 | 37 | North Carolina | 48617.905 | MULTIPOLYGON (((-82.60288 36.03983, -82.60074 ... | 46784 | 0.058 | 0.76 | 0.843 | 0.05 | 0.10 | 0.464 | 0.38 | 0.51 | 0.24 | 1.26 | |
8 | 0400000US39 | 39 | Ohio | 40860.694 | MULTIPOLYGON (((-82.81349 41.72347, -82.81049 ... | 49644 | 0.045 | 0.75 | 0.876 | 0.03 | 0.10 | 0.452 | 0.21 | 0.52 | 0.19 | 3.24 | |
9 | 0400000US42 | 42 | Pennsylvania | 44742.703 | POLYGON ((-75.41504 39.80179, -75.42804 39.809... | 55173 | 0.053 | 0.87 | 0.879 | 0.03 | 0.09 | 0.461 | 0.24 | 0.49 | 0.28 | 0.43 |
Let’s start making our visualisations and see if we can spot any trends or patterns
# Let's first check how hate crimes looked pre-election
= alt.Chart(geo_states, title='PRE-election Hate crime per 100k').mark_geoshape().encode(
chart_pre_election ='avg_hatecrimes_per_100k_fbi',
color=['NAME', 'avg_hatecrimes_per_100k_fbi']
tooltip
).properties(=500,
width=300
height
).project(type='albersUsa'
) chart_pre_election
As you can see above, Altair has chosen a color for each US state based on the range of values in the avg_hatecrimes_per_100k_fbi
column. We have also created a tooltip, so hover over the map and check the crime rates. Which ones are particularly high? average? low?
Also, is there a dark spot between Virginia and Maryland - What’s happening there? Remember: context matters for data analysis!
# Ok, what about the post election status?
= alt.Chart(geo_states, title='POST-election Hate crime per 100k').mark_geoshape().encode(
chart_post_election ='hate_crimes_per_100k_splc',
color=['NAME', 'hate_crimes_per_100k_splc']
tooltip
).properties(=500,
width=300
height
).project(type='albersUsa'
) chart_post_election
# Perhaps we can arrange the maps side-by-side to compare better?
= chart_pre_election | chart_post_election
pre_and_post_map pre_and_post_map
Oh, what’s happening here? We better investigate:
Identify why the maps (particularly) one of them looks so different now. Go back to the original maps to check. Go back to the description of the variables to check. Also, visit the source of the article.
Once you have identified the issue, can you find a way to address the issue?
32.4.1 Exploring data
import seaborn as sns
= df.iloc[:,1:]) sns.pairplot(data
The above plot may be hard to read without squinting our eyes (and take a bit longer to run on some devices), but it’s definitely worth a closer look if you are able to. Check the histograms along the diagonal - what do they show about the distribution of each variable. For example, what does the gini_index
distribution tell us? With respect to the scatter plots, some are more random while others show likely positive or negative correlations. You may wish to investigate what’s happening! And, you might remember (as we also discussed in the video recordings this week), correlation != causation!
# Let's take a look at the income range in the country
=['median_household_income']) df.boxplot(column
# And the average hatecrimes based on FBI data next (also, average over what? check the Variables description again to remind you if need be)
=['avg_hatecrimes_per_100k_fbi']) df.boxplot(column
We may want to drop some states (remove them). See more here.
Let us drop Hawaii (which is one of the states outside mainland USA)
# Let's find out the index value of the state in the DataFrame df
== 'Hawaii'] df[df.NAME
NAME | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | Hawaii | 71223 | 0.034 | 0.76 | 0.904 | 0.08 | 0.07 | 0.433 | 0.81 | 0.3 | 0.0 | 0.0 |
# Let's look at a summary of numeric columns prior to dropping Hawaii
df.describe()
median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 48.000000 | 51.000000 | 51.000000 | 51.000000 | 51.00000 | 51.000000 | 51.000000 |
mean | 55223.607843 | 0.049569 | 0.750196 | 0.869118 | 0.054583 | 0.091765 | 0.453765 | 0.315686 | 0.49000 | 0.275686 | 2.316863 |
std | 9208.478170 | 0.010698 | 0.181587 | 0.034073 | 0.031077 | 0.024715 | 0.020891 | 0.164915 | 0.11871 | 0.256252 | 1.729228 |
min | 35521.000000 | 0.028000 | 0.310000 | 0.799000 | 0.010000 | 0.040000 | 0.419000 | 0.060000 | 0.04000 | 0.000000 | 0.000000 |
25% | 48657.000000 | 0.042000 | 0.630000 | 0.840500 | 0.030000 | 0.075000 | 0.440000 | 0.195000 | 0.41500 | 0.125000 | 1.270000 |
50% | 54916.000000 | 0.051000 | 0.790000 | 0.874000 | 0.045000 | 0.090000 | 0.454000 | 0.280000 | 0.49000 | 0.210000 | 1.930000 |
75% | 60719.000000 | 0.057500 | 0.895000 | 0.898000 | 0.080000 | 0.100000 | 0.466500 | 0.420000 | 0.57500 | 0.340000 | 3.165000 |
max | 76165.000000 | 0.073000 | 1.000000 | 0.918000 | 0.130000 | 0.170000 | 0.532000 | 0.810000 | 0.70000 | 1.520000 | 10.950000 |
# Let's now drop Hawaii
= df.drop(df.index[11]) df
# Now check again for the statistical summary
df.describe()
median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 47.000000 | 50.000000 | 50.000000 | 50.000000 | 50.00000 | 50.000000 | 50.000000 |
mean | 54903.620000 | 0.049880 | 0.750000 | 0.868420 | 0.054043 | 0.092200 | 0.454180 | 0.305800 | 0.49380 | 0.281200 | 2.363200 |
std | 9010.994814 | 0.010571 | 0.183425 | 0.034049 | 0.031184 | 0.024767 | 0.020889 | 0.150551 | 0.11674 | 0.255779 | 1.714502 |
min | 35521.000000 | 0.028000 | 0.310000 | 0.799000 | 0.010000 | 0.040000 | 0.419000 | 0.060000 | 0.04000 | 0.000000 | 0.260000 |
25% | 48358.500000 | 0.042250 | 0.630000 | 0.839750 | 0.030000 | 0.080000 | 0.440000 | 0.192500 | 0.42000 | 0.130000 | 1.290000 |
50% | 54613.000000 | 0.051000 | 0.790000 | 0.874000 | 0.040000 | 0.090000 | 0.454500 | 0.275000 | 0.49500 | 0.215000 | 1.980000 |
75% | 60652.750000 | 0.057750 | 0.897500 | 0.897750 | 0.080000 | 0.100000 | 0.466750 | 0.420000 | 0.57750 | 0.345000 | 3.182500 |
max | 76165.000000 | 0.073000 | 1.000000 | 0.918000 | 0.130000 | 0.170000 | 0.532000 | 0.630000 | 0.70000 | 1.520000 | 10.950000 |
There seems to be some changes.
# Let's dig in deeper to the correlation between median household income and hatecrimes based on FBI data
= 'avg_hatecrimes_per_100k_fbi', y = 'median_household_income', kind='scatter') df.plot(x
# And the relationship between median household income and hatecrimes based on SPLC data
= 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter') df.plot(x
Hmmm, there doesn’t appear to be a strong (linear!) correlation, but surely there is a cluster and some outliers! That’s our cue - let’s find out which states might be outliers by using the standard deviation function ‘std’.
> (np.std(df.hate_crimes_per_100k_splc) * 2.5)] df[df.hate_crimes_per_100k_splc
NAME | median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | District of Columbia | 68277 | 0.067 | 1.00 | 0.871 | 0.11 | 0.04 | 0.532 | 0.63 | 0.04 | 1.52 | 10.95 |
37 | Oregon | 58875 | 0.062 | 0.87 | 0.891 | 0.07 | 0.10 | 0.449 | 0.26 | 0.41 | 0.83 | 3.39 |
47 | Washington | 59068 | 0.052 | 0.86 | 0.897 | 0.08 | 0.09 | 0.441 | 0.31 | 0.38 | 0.67 | 3.81 |
Remember that we discussed about ‘context’ earlier when we got 51 states? If your investigation paid off there, you can make better sense of the outliers here.
# Let's try to make the outliers more obvious
import matplotlib.pyplot as plt
= df[df.hate_crimes_per_100k_splc > (np.std(df.hate_crimes_per_100k_splc) * 2.5)]
outliers_df = 'hate_crimes_per_100k_splc', y = 'median_household_income', kind='scatter')
df.plot(x ='red') plt.scatter(outliers_df.hate_crimes_per_100k_splc, outliers_df.median_household_income ,c
# Let's create a pivot table to focus on specific columns of interest
= df.pivot_table(index=['NAME'], values=['hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi', 'median_household_income'])
df_pivot df_pivot
avg_hatecrimes_per_100k_fbi | hate_crimes_per_100k_splc | median_household_income | |
---|---|---|---|
NAME | |||
Alabama | 1.80 | 0.12 | 42278.0 |
Alaska | 1.65 | 0.14 | 67629.0 |
Arizona | 3.41 | 0.22 | 49254.0 |
Arkansas | 0.86 | 0.06 | 44922.0 |
California | 2.39 | 0.25 | 60487.0 |
Colorado | 2.80 | 0.39 | 60940.0 |
Connecticut | 3.77 | 0.33 | 70161.0 |
Delaware | 1.46 | 0.32 | 57522.0 |
District of Columbia | 10.95 | 1.52 | 68277.0 |
Florida | 0.69 | 0.18 | 46140.0 |
Georgia | 0.41 | 0.12 | 49555.0 |
Idaho | 1.89 | 0.12 | 53438.0 |
Illinois | 1.04 | 0.19 | 54916.0 |
Indiana | 1.75 | 0.24 | 48060.0 |
Iowa | 0.56 | 0.45 | 57810.0 |
Kansas | 2.14 | 0.10 | 53444.0 |
Kentucky | 4.20 | 0.32 | 42786.0 |
Louisiana | 1.34 | 0.10 | 42406.0 |
Maine | 2.62 | 0.61 | 51710.0 |
Maryland | 1.32 | 0.37 | 76165.0 |
Massachusetts | 4.80 | 0.63 | 63151.0 |
Michigan | 3.20 | 0.40 | 52005.0 |
Minnesota | 3.61 | 0.62 | 67244.0 |
Mississippi | 0.62 | 0.06 | 35521.0 |
Missouri | 1.90 | 0.18 | 56630.0 |
Montana | 2.95 | 0.49 | 51102.0 |
Nebraska | 2.68 | 0.15 | 56870.0 |
Nevada | 2.11 | 0.14 | 49875.0 |
New Hampshire | 2.10 | 0.15 | 73397.0 |
New Jersey | 4.41 | 0.07 | 65243.0 |
New Mexico | 1.88 | 0.29 | 46686.0 |
New York | 3.10 | 0.35 | 54310.0 |
North Carolina | 1.26 | 0.24 | 46784.0 |
North Dakota | 4.74 | 0.00 | 60730.0 |
Ohio | 3.24 | 0.19 | 49644.0 |
Oklahoma | 1.08 | 0.13 | 47199.0 |
Oregon | 3.39 | 0.83 | 58875.0 |
Pennsylvania | 0.43 | 0.28 | 55173.0 |
Rhode Island | 1.28 | 0.09 | 58633.0 |
South Carolina | 1.93 | 0.20 | 44929.0 |
South Dakota | 3.30 | 0.00 | 53053.0 |
Tennessee | 3.13 | 0.19 | 43716.0 |
Texas | 0.75 | 0.21 | 53875.0 |
Utah | 2.38 | 0.13 | 63383.0 |
Vermont | 1.90 | 0.32 | 60708.0 |
Virginia | 1.72 | 0.36 | 66155.0 |
Washington | 3.81 | 0.67 | 59068.0 |
West Virginia | 2.03 | 0.32 | 39552.0 |
Wisconsin | 1.12 | 0.22 | 58080.0 |
Wyoming | 0.26 | 0.00 | 55690.0 |
# the pivot table seems sorted by state names, let's sort by FBI hate crime data instead
=['avg_hatecrimes_per_100k_fbi'], ascending=False) df_pivot.sort_values(by
avg_hatecrimes_per_100k_fbi | hate_crimes_per_100k_splc | median_household_income | |
---|---|---|---|
NAME | |||
District of Columbia | 10.95 | 1.52 | 68277.0 |
Massachusetts | 4.80 | 0.63 | 63151.0 |
North Dakota | 4.74 | 0.00 | 60730.0 |
New Jersey | 4.41 | 0.07 | 65243.0 |
Kentucky | 4.20 | 0.32 | 42786.0 |
Washington | 3.81 | 0.67 | 59068.0 |
Connecticut | 3.77 | 0.33 | 70161.0 |
Minnesota | 3.61 | 0.62 | 67244.0 |
Arizona | 3.41 | 0.22 | 49254.0 |
Oregon | 3.39 | 0.83 | 58875.0 |
South Dakota | 3.30 | 0.00 | 53053.0 |
Ohio | 3.24 | 0.19 | 49644.0 |
Michigan | 3.20 | 0.40 | 52005.0 |
Tennessee | 3.13 | 0.19 | 43716.0 |
New York | 3.10 | 0.35 | 54310.0 |
Montana | 2.95 | 0.49 | 51102.0 |
Colorado | 2.80 | 0.39 | 60940.0 |
Nebraska | 2.68 | 0.15 | 56870.0 |
Maine | 2.62 | 0.61 | 51710.0 |
California | 2.39 | 0.25 | 60487.0 |
Utah | 2.38 | 0.13 | 63383.0 |
Kansas | 2.14 | 0.10 | 53444.0 |
Nevada | 2.11 | 0.14 | 49875.0 |
New Hampshire | 2.10 | 0.15 | 73397.0 |
West Virginia | 2.03 | 0.32 | 39552.0 |
South Carolina | 1.93 | 0.20 | 44929.0 |
Vermont | 1.90 | 0.32 | 60708.0 |
Missouri | 1.90 | 0.18 | 56630.0 |
Idaho | 1.89 | 0.12 | 53438.0 |
New Mexico | 1.88 | 0.29 | 46686.0 |
Alabama | 1.80 | 0.12 | 42278.0 |
Indiana | 1.75 | 0.24 | 48060.0 |
Virginia | 1.72 | 0.36 | 66155.0 |
Alaska | 1.65 | 0.14 | 67629.0 |
Delaware | 1.46 | 0.32 | 57522.0 |
Louisiana | 1.34 | 0.10 | 42406.0 |
Maryland | 1.32 | 0.37 | 76165.0 |
Rhode Island | 1.28 | 0.09 | 58633.0 |
North Carolina | 1.26 | 0.24 | 46784.0 |
Wisconsin | 1.12 | 0.22 | 58080.0 |
Oklahoma | 1.08 | 0.13 | 47199.0 |
Illinois | 1.04 | 0.19 | 54916.0 |
Arkansas | 0.86 | 0.06 | 44922.0 |
Texas | 0.75 | 0.21 | 53875.0 |
Florida | 0.69 | 0.18 | 46140.0 |
Mississippi | 0.62 | 0.06 | 35521.0 |
Iowa | 0.56 | 0.45 | 57810.0 |
Pennsylvania | 0.43 | 0.28 | 55173.0 |
Georgia | 0.41 | 0.12 | 49555.0 |
Wyoming | 0.26 | 0.00 | 55690.0 |
# Let's standardise our data before we attempt further modelling using the data
from sklearn import preprocessing
import numpy as np
# Get the column names first
= df[['median_household_income','share_unemployed_seasonal', 'share_population_in_metro_areas'
df_stand 'share_population_with_high_school_degree', 'share_non_citizen', 'share_white_poverty', 'gini_index'
, 'share_non_white', 'share_voters_voted_trump', 'hate_crimes_per_100k_splc', 'avg_hatecrimes_per_100k_fbi']]
, = df_stand.columns
names
# Create the Scaler object for standardising the data
= preprocessing.StandardScaler()
scaler
# Fit our data on the scaler object
= scaler.fit_transform(df_stand)
df2
# check what type is df2
type(df2)
numpy.ndarray
# Let's convert the numpy array into a DataFrame before further processing
= pd.DataFrame(df2, columns=names)
df2 10) df2.tail(
median_household_income | share_unemployed_seasonal | share_population_in_metro_areas | share_population_with_high_school_degree | share_non_citizen | share_white_poverty | gini_index | share_non_white | share_voters_voted_trump | hate_crimes_per_100k_splc | avg_hatecrimes_per_100k_fbi | |
---|---|---|---|---|---|---|---|---|---|---|---|
40 | -0.207459 | -1.421951 | -1.321717 | 0.907229 | NaN | -0.497582 | -0.588999 | -0.911176 | 1.092014 | -1.110547 | 0.551945 |
41 | -1.254157 | 0.680396 | 0.385501 | -1.110154 | -0.455183 | 1.541689 | 0.668306 | -0.240207 | 1.005484 | -0.360177 | 0.451784 |
42 | -0.115311 | -0.753023 | 0.936216 | -2.059511 | 1.813836 | -0.497582 | 0.716664 | 1.705604 | 0.313240 | -0.281191 | -0.950467 |
43 | 0.950557 | -1.326390 | 0.385501 | 1.055566 | -0.455183 | -0.497582 | -1.701230 | -0.776982 | -0.205942 | -0.597136 | 0.009898 |
44 | 0.650684 | -1.230829 | -2.202862 | 1.233571 | -1.427620 | 0.318126 | -0.492283 | -1.649243 | -1.417369 | 0.153233 | -0.272909 |
45 | 1.261305 | -0.657461 | 0.771002 | -0.071795 | 0.193108 | -0.905436 | 0.233085 | 0.497859 | -0.379003 | 0.311206 | -0.378961 |
46 | 0.466836 | 0.202590 | 0.605787 | 0.847894 | 0.841399 | -0.089728 | -0.637357 | 0.028181 | -0.984716 | 1.535493 | 0.852428 |
47 | -1.720951 | 2.209376 | -1.101431 | -1.199157 | -1.427620 | 1.949543 | -0.153778 | -1.582146 | 1.697727 | 0.153233 | -0.196315 |
48 | 0.356079 | -0.657461 | -0.330429 | 0.877562 | -0.779329 | -0.089728 | -1.169293 | -0.575692 | -0.119412 | -0.241698 | -0.732470 |
49 | 0.088155 | -0.944145 | -2.423149 | 1.470910 | -1.103475 | -0.089728 | -1.507798 | -1.045370 | 1.784258 | -1.110547 | -1.239166 |
# Now that our data has been standardised, let's look at the distribution across all columns
= sns.boxplot(data=df2, orient="h") ax
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/_base.py:948: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/u2071219/anaconda3/envs/IM939/lib/python3.12/site-packages/seaborn/categorical.py:632: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
positions = grouped.grouper.result_index.to_numpy(dtype=float)
import scipy.stats
# Let's create a correlation matrix by computing the pairwise correlation of numerical columns, rounding correlation values to two places
= df2.corr(numeric_only=True).round(2)
corrMatrix print (corrMatrix)
median_household_income \
median_household_income 1.00
share_unemployed_seasonal -0.34
share_population_in_metro_areas 0.29
share_population_with_high_school_degree 0.64
share_non_citizen 0.28
share_white_poverty -0.82
gini_index -0.15
share_non_white -0.00
share_voters_voted_trump -0.57
hate_crimes_per_100k_splc 0.33
avg_hatecrimes_per_100k_fbi 0.32
share_unemployed_seasonal \
median_household_income -0.34
share_unemployed_seasonal 1.00
share_population_in_metro_areas 0.37
share_population_with_high_school_degree -0.61
share_non_citizen 0.31
share_white_poverty 0.19
gini_index 0.53
share_non_white 0.59
share_voters_voted_trump -0.21
hate_crimes_per_100k_splc 0.18
avg_hatecrimes_per_100k_fbi 0.07
share_population_in_metro_areas \
median_household_income 0.29
share_unemployed_seasonal 0.37
share_population_in_metro_areas 1.00
share_population_with_high_school_degree -0.27
share_non_citizen 0.75
share_white_poverty -0.39
gini_index 0.52
share_non_white 0.60
share_voters_voted_trump -0.58
hate_crimes_per_100k_splc 0.26
avg_hatecrimes_per_100k_fbi 0.21
share_population_with_high_school_degree \
median_household_income 0.64
share_unemployed_seasonal -0.61
share_population_in_metro_areas -0.27
share_population_with_high_school_degree 1.00
share_non_citizen -0.30
share_white_poverty -0.48
gini_index -0.58
share_non_white -0.56
share_voters_voted_trump -0.13
hate_crimes_per_100k_splc 0.21
avg_hatecrimes_per_100k_fbi 0.16
share_non_citizen \
median_household_income 0.28
share_unemployed_seasonal 0.31
share_population_in_metro_areas 0.75
share_population_with_high_school_degree -0.30
share_non_citizen 1.00
share_white_poverty -0.38
gini_index 0.51
share_non_white 0.76
share_voters_voted_trump -0.62
hate_crimes_per_100k_splc 0.28
avg_hatecrimes_per_100k_fbi 0.30
share_white_poverty gini_index \
median_household_income -0.82 -0.15
share_unemployed_seasonal 0.19 0.53
share_population_in_metro_areas -0.39 0.52
share_population_with_high_school_degree -0.48 -0.58
share_non_citizen -0.38 0.51
share_white_poverty 1.00 0.01
gini_index 0.01 1.00
share_non_white -0.24 0.59
share_voters_voted_trump 0.54 -0.46
hate_crimes_per_100k_splc -0.26 0.38
avg_hatecrimes_per_100k_fbi -0.26 0.42
share_non_white \
median_household_income -0.00
share_unemployed_seasonal 0.59
share_population_in_metro_areas 0.60
share_population_with_high_school_degree -0.56
share_non_citizen 0.76
share_white_poverty -0.24
gini_index 0.59
share_non_white 1.00
share_voters_voted_trump -0.44
hate_crimes_per_100k_splc 0.12
avg_hatecrimes_per_100k_fbi 0.08
share_voters_voted_trump \
median_household_income -0.57
share_unemployed_seasonal -0.21
share_population_in_metro_areas -0.58
share_population_with_high_school_degree -0.13
share_non_citizen -0.62
share_white_poverty 0.54
gini_index -0.46
share_non_white -0.44
share_voters_voted_trump 1.00
hate_crimes_per_100k_splc -0.69
avg_hatecrimes_per_100k_fbi -0.50
hate_crimes_per_100k_splc \
median_household_income 0.33
share_unemployed_seasonal 0.18
share_population_in_metro_areas 0.26
share_population_with_high_school_degree 0.21
share_non_citizen 0.28
share_white_poverty -0.26
gini_index 0.38
share_non_white 0.12
share_voters_voted_trump -0.69
hate_crimes_per_100k_splc 1.00
avg_hatecrimes_per_100k_fbi 0.68
avg_hatecrimes_per_100k_fbi
median_household_income 0.32
share_unemployed_seasonal 0.07
share_population_in_metro_areas 0.21
share_population_with_high_school_degree 0.16
share_non_citizen 0.30
share_white_poverty -0.26
gini_index 0.42
share_non_white 0.08
share_voters_voted_trump -0.50
hate_crimes_per_100k_splc 0.68
avg_hatecrimes_per_100k_fbi 1.00
Look at the positive and negative correlation values above. What do they suggest and how strong, weak or moderate is the correlation.
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
# Let's create a heatmap to visualse the pairwise correlations for better understanding
= df2.corr(numeric_only=True).round(1) #Rounding to (1) so it's easier to read given number of variables
corrMatrix =True)
sn.heatmap(corrMatrix, annot plt.show()
# Let's now perform a linear regression on our data
# Try the commented code after you run this first
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
= df2[['median_household_income', 'share_population_with_high_school_degree', 'share_voters_voted_trump']]
x = df2[['avg_hatecrimes_per_100k_fbi']]
y #What if we change the data (y variable)
#y = df2[['hate_crimes_per_100k_splc']]
= LinearRegression(fit_intercept = True)
est
est.fit(x, y)
print("Coefficients:", est.coef_)
print ("Intercept:", est.intercept_)
# Generate predictions from our linear regression model, and check the MSE, Rsquared and Variance measures to assess performance
= est.predict(x)
y_hat print ("MSE:", metrics.mean_squared_error(y, y_hat))
print ("R^2:", metrics.r2_score(y, y_hat))
print ("var:", y.var())
Coefficients: [[-0.0861606 0.15199564 -0.53470881]]
Intercept: [2.12745505e-16]
MSE: 0.7326374635746324
R^2: 0.2673625364253678
var: avg_hatecrimes_per_100k_fbi 1.020408
dtype: float64
What do these values suggest?
Remember the earlier note about the maps looking different? Were you able to identify what’s going there?
Let’s revisit that part again
The maps did not share comparable scales. The first map was showing the annual crime rate per 100k residents, while the second map was showing the total incident numbers per 100k resident only for the 10-days following the 2016 election. How can we fix this discrepency?
Tweak the ??
s below to visualise changes
# We can generete two new "features" and add them to the DataFrame df
'hate_crimes_per_100k_splc_perday'] = df['hate_crimes_per_100k_splc'] / ??
df[# the 'avg_hatecrimes_per_100k_fbi' column is an annual incidence average between 2010- 15, so each data value is the number of incidences (per 100k residents) in an average year.
'avg_hatecrimes_per_100k_fbi_perday'] = df['avg_hatecrimes_per_100k_fbi'] / ??? df[
Cell In[35], line 2 df['hate_crimes_per_100k_splc_perday'] = df['hate_crimes_per_100k_splc'] / ?? ^ SyntaxError: invalid syntax
# Update geo_states
= geo_states.merge(df, on='????') geo_states
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /var/folders/7v/zl9mv52s3ls94kntlt_l9ryh0000gq/T/ipykernel_9546/3048915360.py in ?() 1 # Update geo_states ----> 2 geo_states = geo_states.merge(df, on='????') ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/frame.py in ?(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 10828 validate: MergeValidate | None = None, 10829 ) -> DataFrame: 10830 from pandas.core.reshape.merge import merge 10831 > 10832 return merge( 10833 self, 10834 right, 10835 how=how, ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/reshape/merge.py in ?(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 166 validate=validate, 167 copy=copy, 168 ) 169 else: --> 170 op = _MergeOperation( 171 left_df, 172 right_df, 173 how=how, ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/reshape/merge.py in ?(self, left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, indicator, validate) 790 self.right_join_keys, 791 self.join_names, 792 left_drop, 793 right_drop, --> 794 ) = self._get_merge_keys() 795 796 if left_drop: 797 self.left = self.left._drop_labels_or_levels(left_drop) ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/reshape/merge.py in ?(self) 1293 # Then we're either Hashable or a wrong-length arraylike, 1294 # the latter of which will raise 1295 rk = cast(Hashable, rk) 1296 if rk is not None: -> 1297 right_keys.append(right._get_label_or_level_values(rk)) 1298 else: 1299 # work-around for merge_asof(right_index=True) 1300 right_keys.append(right.index._values) ~/anaconda3/envs/IM939/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, key, axis) 1907 values = self.xs(key, axis=other_axes[0])._values 1908 elif self._is_level_reference(key, axis=axis): 1909 values = self.axes[axis].get_level_values(key)._values 1910 else: -> 1911 raise KeyError(key) 1912 1913 # Check for duplicates 1914 if values.ndim > 1: KeyError: '????'
# Let's plot again
# First the PRE election map
= alt.Chart(geo_states, title='PRE-election Hate crime per 100k per day').mark_geoshape().encode(
pre_election_map 'avg_hatecrimes_per_100k_fbi_perday', scale=alt.Scale(domain=[0, 0.15])),
alt.Color(=['NAME', 'avg_hatecrimes_per_100k_fbi_perday']
tooltip
).properties(=500,
width=300
height
).project(type='albersUsa'
)
= alt.Chart(geo_states, title='POST-election Hate crime per 100k per day').mark_geoshape().encode(
post_election_map 'hate_crimes_per_100k_splc_perday', scale=alt.Scale(domain=[0, 0.15])),
alt.Color(=['NAME', 'hate_crimes_per_100k_splc_perday']
tooltip
).properties(=500,
width=300
height
).project(type='albersUsa'
)
= pre_election_map | post_election_map
new_combined_map new_combined_map
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/vegalite/v5/api.py in ?(self, *args, **kwds) 3409 # see https://github.com/ipython/ipython/issues/11038 3410 try: 3411 dct = self.to_dict(context={"pre_transform": False}) 3412 except Exception: -> 3413 utils.display_traceback(in_ipython=True) 3414 return {} 3415 else: 3416 if renderer := renderers.get(): ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/vegalite/v5/api.py in ?(self, validate, format, ignore, context) 1809 1810 # TopLevelMixin instance does not necessarily have to_dict defined 1811 # but due to how Altair is set up this should hold. 1812 # Too complex to type hint right now -> 1813 vegalite_spec: Any = super(TopLevelMixin, copy).to_dict( # type: ignore[misc] 1814 validate=validate, ignore=ignore, context=dict(context, pre_transform=False) 1815 ) 1816 ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(self, validate, ignore, context) 1056 k: v for k, v in kwds.items() if k not in {*list(ignore), "shorthand"} 1057 } 1058 if "mark" in kwds and isinstance(kwds["mark"], str): 1059 kwds["mark"] = {"type": kwds["mark"]} -> 1060 result = _todict(kwds, context=context, np_opt=np_opt, pd_opt=pd_opt) 1061 else: 1062 msg = ( 1063 f"{self.__class__} instance has both a value and properties : " ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(obj, context, np_opt, pd_opt) 516 return obj.to_dict(validate=False, context=context) 517 elif isinstance(obj, (list, tuple)): 518 return [_todict(v, context, np_opt, pd_opt) for v in obj] 519 elif isinstance(obj, dict): --> 520 return { 521 k: _todict(v, context, np_opt, pd_opt) 522 for k, v in obj.items() 523 if v is not Undefined ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(obj, context, np_opt, pd_opt) 514 return result 515 if isinstance(obj, SchemaBase): 516 return obj.to_dict(validate=False, context=context) 517 elif isinstance(obj, (list, tuple)): --> 518 return [_todict(v, context, np_opt, pd_opt) for v in obj] 519 elif isinstance(obj, dict): 520 return { 521 k: _todict(v, context, np_opt, pd_opt) ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(obj, context, np_opt, pd_opt) 512 # See https://github.com/vega/altair/issues/1027 for why this is necessary. 513 result += "T00:00:00" 514 return result 515 if isinstance(obj, SchemaBase): --> 516 return obj.to_dict(validate=False, context=context) 517 elif isinstance(obj, (list, tuple)): 518 return [_todict(v, context, np_opt, pd_opt) for v in obj] 519 elif isinstance(obj, dict): ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/vegalite/v5/api.py in ?(self, validate, format, ignore, context) 3774 # for easier specification of datum encodings. 3775 copy = self.copy(deep=False) 3776 copy.data = core.InlineData(values=[{}]) 3777 return super(Chart, copy).to_dict(**kwds) -> 3778 return super().to_dict(**kwds) ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/vegalite/v5/api.py in ?(self, validate, format, ignore, context) 1809 1810 # TopLevelMixin instance does not necessarily have to_dict defined 1811 # but due to how Altair is set up this should hold. 1812 # Too complex to type hint right now -> 1813 vegalite_spec: Any = super(TopLevelMixin, copy).to_dict( # type: ignore[misc] 1814 validate=validate, ignore=ignore, context=dict(context, pre_transform=False) 1815 ) 1816 ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(self, validate, ignore, context) 1056 k: v for k, v in kwds.items() if k not in {*list(ignore), "shorthand"} 1057 } 1058 if "mark" in kwds and isinstance(kwds["mark"], str): 1059 kwds["mark"] = {"type": kwds["mark"]} -> 1060 result = _todict(kwds, context=context, np_opt=np_opt, pd_opt=pd_opt) 1061 else: 1062 msg = ( 1063 f"{self.__class__} instance has both a value and properties : " ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(obj, context, np_opt, pd_opt) 516 return obj.to_dict(validate=False, context=context) 517 elif isinstance(obj, (list, tuple)): 518 return [_todict(v, context, np_opt, pd_opt) for v in obj] 519 elif isinstance(obj, dict): --> 520 return { 521 k: _todict(v, context, np_opt, pd_opt) 522 for k, v in obj.items() 523 if v is not Undefined ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(obj, context, np_opt, pd_opt) 512 # See https://github.com/vega/altair/issues/1027 for why this is necessary. 513 result += "T00:00:00" 514 return result 515 if isinstance(obj, SchemaBase): --> 516 return obj.to_dict(validate=False, context=context) 517 elif isinstance(obj, (list, tuple)): 518 return [_todict(v, context, np_opt, pd_opt) for v in obj] 519 elif isinstance(obj, dict): ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(self, validate, ignore, context) 1056 k: v for k, v in kwds.items() if k not in {*list(ignore), "shorthand"} 1057 } 1058 if "mark" in kwds and isinstance(kwds["mark"], str): 1059 kwds["mark"] = {"type": kwds["mark"]} -> 1060 result = _todict(kwds, context=context, np_opt=np_opt, pd_opt=pd_opt) 1061 else: 1062 msg = ( 1063 f"{self.__class__} instance has both a value and properties : " ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(obj, context, np_opt, pd_opt) 516 return obj.to_dict(validate=False, context=context) 517 elif isinstance(obj, (list, tuple)): 518 return [_todict(v, context, np_opt, pd_opt) for v in obj] 519 elif isinstance(obj, dict): --> 520 return { 521 k: _todict(v, context, np_opt, pd_opt) 522 for k, v in obj.items() 523 if v is not Undefined ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/utils/schemapi.py in ?(obj, context, np_opt, pd_opt) 512 # See https://github.com/vega/altair/issues/1027 for why this is necessary. 513 result += "T00:00:00" 514 return result 515 if isinstance(obj, SchemaBase): --> 516 return obj.to_dict(validate=False, context=context) 517 elif isinstance(obj, (list, tuple)): 518 return [_todict(v, context, np_opt, pd_opt) for v in obj] 519 elif isinstance(obj, dict): ~/anaconda3/envs/IM939/lib/python3.12/site-packages/altair/vegalite/v5/schema/channels.py in ?(self, validate, ignore, context) 186 " verify that the field name is not misspelled." 187 " If you are referencing a field from a transform," 188 " also confirm that the data type is specified correctly." 189 ) --> 190 raise ValueError(msg) 191 else: 192 msg = ( 193 f"{shorthand} encoding field is specified without a type; " ValueError: Unable to determine data type for the field "avg_hatecrimes_per_100k_fbi_perday"; verify that the field name is not misspelled. If you are referencing a field from a transform, also confirm that the data type is specified correctly.
alt.HConcatChart(...)
How is that now?