One way to create potentially misleading visualisations is by manipulating the axes of a plot. Here we illustrate these using one of the FiveThirtyEight data sets, which are available here .
Data wrangling
We are going to use polls from the 2020 USA presidential election. As before, we load and examine the data.
import pandas as pd
import seaborn as sns
import altair as alt
df_polls = pd.read_csv('data/presidential_poll_averages_2020.csv' )
df_polls.head()
0
2020
Wyoming
11/3/2020
Joseph R. Biden Jr.
30.81486
30.82599
1
2020
Wisconsin
11/3/2020
Joseph R. Biden Jr.
52.12642
52.09584
2
2020
West Virginia
11/3/2020
Joseph R. Biden Jr.
33.49125
33.51517
3
2020
Washington
11/3/2020
Joseph R. Biden Jr.
59.34201
59.39408
4
2020
Virginia
11/3/2020
Joseph R. Biden Jr.
53.74120
53.72101
For our analysis, we are going to pick estimates from 11/3/2020 for the swing states of Florida, Texas, Arizona, Michigan, Minnesota and Pennsylvania.
df_nov = df_polls[
(df_polls.modeldate == '11/3/2020' )
]
df_nov = df_nov[
(df_nov.candidate_name == 'Joseph R. Biden Jr.' ) |
(df_nov.candidate_name == 'Donald Trump' )
]
df_swing = df_nov[
(df_nov['state' ] == 'Florida' ) |
(df_nov['state' ] == 'Texas' ) |
(df_nov['state' ] == 'Arizona' ) |
(df_nov['state' ] == 'Michigan' ) |
(df_nov['state' ] == 'Minnesota' ) |
(df_nov['state' ] == 'Pennsylvania' )
]
df_swing
7
2020
Texas
11/3/2020
Joseph R. Biden Jr.
47.46643
47.44781
12
2020
Pennsylvania
11/3/2020
Joseph R. Biden Jr.
50.22000
50.20422
30
2020
Minnesota
11/3/2020
Joseph R. Biden Jr.
51.86992
51.84517
31
2020
Michigan
11/3/2020
Joseph R. Biden Jr.
51.17806
51.15482
46
2020
Florida
11/3/2020
Joseph R. Biden Jr.
49.09162
49.08035
53
2020
Arizona
11/3/2020
Joseph R. Biden Jr.
48.72237
48.70539
63
2020
Texas
11/3/2020
Donald Trump
48.57118
48.58794
68
2020
Pennsylvania
11/3/2020
Donald Trump
45.57216
45.55034
86
2020
Minnesota
11/3/2020
Donald Trump
42.63638
42.66826
87
2020
Michigan
11/3/2020
Donald Trump
43.20577
43.23326
102
2020
Florida
11/3/2020
Donald Trump
46.68101
46.61909
109
2020
Arizona
11/3/2020
Donald Trump
46.11074
46.10181
Default barplot
We can look at the relative performance of the candidates within each state using a nested bar plot.
ax = sns.barplot(
data = df_swing,
x = 'state' ,
y = 'pct_estimate' ,
hue = 'candidate_name' )
Altering the axes
Altering the axis increases the distance between the bars. Some might say that is misleading.
ax = sns.barplot(
data = df_swing,
x = 'state' ,
y = 'pct_estimate' ,
hue = 'candidate_name' )
ax.set (ylim= (41 , 52 ))
What do you think?
How about if we instead put the data on the full 0 to 100 scale?
ax = sns.barplot(
data = df_swing,
x = 'state' ,
y = 'pct_estimate' ,
hue = 'candidate_name' )
ax.set (ylim= (0 , 100 ))
We can do the same thing in Altair.
alt.Chart(df_swing).mark_bar().encode(
x= 'candidate_name' ,
y= 'pct_estimate' ,
color= 'candidate_name' ,
column = alt.Column('state:O' , spacing = 5 , header = alt.Header(labelOrient = "bottom" )),
)
Note the need for the alt column. What happens if you do not provide an alt column?
Passing the domain option to the scale of the Y axis allows us to choose the y axis range.
alt.Chart(df_swing).mark_bar(clip= True ).encode(
x= 'candidate_name' ,
y= alt.Y('pct_estimate' , scale= alt.Scale(domain= [42 ,53 ])),
color= 'candidate_name' ,
column = alt.Column('state:O' , spacing = 5 , header = alt.Header(labelOrient = "bottom" )),
)
Altering the proportions
We can even be a bit tricky and stretch out the difference.
alt.Chart(df_swing).mark_bar(clip= True ).encode(
x= 'candidate_name' ,
y= alt.Y('pct_estimate' , scale= alt.Scale(domain= [42 ,53 ])),
color= 'candidate_name' ,
column = alt.Column('state:O' , spacing = 5 , header = alt.Header(labelOrient = "bottom" )),
).properties(
width= 20 ,
height= 600
)
Default line plot
It is not just bar plot that you can have fun with. Line plots are another interesting example.
For our simple line plot, we will need the poll data for a single state.
df_texas = df_polls[
df_polls['state' ] == 'Texas'
]
df_texas_bt = df_texas[
(df_texas['candidate_name' ] == 'Donald Trump' ) |
(df_texas['candidate_name' ] == 'Joseph R. Biden Jr.' )
]
df_texas_bt.head()
7
2020
Texas
11/3/2020
Joseph R. Biden Jr.
47.46643
47.44781
63
2020
Texas
11/3/2020
Donald Trump
48.57118
48.58794
231
2020
Texas
11/2/2020
Joseph R. Biden Jr.
47.46643
47.44781
287
2020
Texas
11/2/2020
Donald Trump
48.57118
48.58794
455
2020
Texas
11/1/2020
Joseph R. Biden Jr.
47.45590
47.43400
The modeldate
column is a string (object) and not date time. So we need to change that: we will create a new datetime column called date
.
df_texas_bt['date' ] = pd.to_datetime(df_texas_bt.loc[:,'modeldate' ], format = '%m/ %d /%Y' ).copy()
df_texas_bt.dtypes
cycle int64
state object
modeldate object
candidate_name object
pct_estimate float64
pct_trend_adjusted float64
date datetime64[ns]
dtype: object
Create our line plot.
alt.Chart(df_texas_bt).mark_line().encode(
y= alt.Y('pct_estimate' , scale= alt.Scale(domain= [42 ,53 ])),
x= 'date' ,
color= 'candidate_name' )
Sometimes multiple axis are used for each line, or in a combined line and bar plot.
The example here uses a dataframe with a column for each line. Our data does not have that.
df_texas_bt
our_df = df_texas_bt[['candidate_name' , 'pct_estimate' , 'date' ]]
our_df
7
Joseph R. Biden Jr.
47.46643
2020-11-03
63
Donald Trump
48.57118
2020-11-03
231
Joseph R. Biden Jr.
47.46643
2020-11-02
287
Donald Trump
48.57118
2020-11-02
455
Joseph R. Biden Jr.
47.45590
2020-11-01
...
...
...
...
28931
Donald Trump
49.09724
2020-02-29
28963
Joseph R. Biden Jr.
45.30901
2020-02-28
28995
Donald Trump
49.09676
2020-02-28
29027
Joseph R. Biden Jr.
45.30089
2020-02-27
29058
Donald Trump
49.07925
2020-02-27
502 rows × 3 columns
Pivot table allows us to reshape our dataframe.
our_df = pd.pivot_table(our_df, index= ['date' ], columns = 'candidate_name' )
our_df.columns = our_df.columns.to_series().str .join('_' )
our_df.head()
date
2020-02-27
49.07925
45.30089
2020-02-28
49.09676
45.30901
2020-02-29
49.09724
45.30896
2020-03-01
49.09724
45.30895
2020-03-02
48.91861
45.37694
Date here is the dataframe index. We want it to be a column.
our_df['date1' ] = our_df.index
our_df.columns = ['Trump' , 'Biden' , 'date1' ]
our_df.head()
date
2020-02-27
49.07925
45.30089
2020-02-27
2020-02-28
49.09676
45.30901
2020-02-28
2020-02-29
49.09724
45.30896
2020-02-29
2020-03-01
49.09724
45.30895
2020-03-01
2020-03-02
48.91861
45.37694
2020-03-02
Creating our new plot, to fool all those people who expect Trump to win in Texas.
base = alt.Chart(our_df).encode(
alt.X('date1' )
)
line_A = base.mark_line(color= '#5276A7' ).encode(
alt.Y('Trump' , axis= alt.Axis(titleColor= '#5276A7' ), scale= alt.Scale(domain= [42 ,53 ]))
)
line_B = base.mark_line(color= '#F18727' ).encode(
alt.Y('Biden' , axis= alt.Axis(titleColor= '#F18727' ), scale= alt.Scale(domain= [35 ,53 ]))
)
alt.layer(line_A, line_B).resolve_scale(y= 'independent' )
Did you see what I did there?
Of course, mixed axis plots are rarely purely line plots. Instead they can be mixes of different axis. For these and other plotting mistakes, the economist has a nice article here . You may want to try some of these plots with this data set or the world indicators dataset from a few weeks ago.