27 Lab: Axes manipulation

One way to create potentially misleading visualisations is by manipulating the axes of a plot. Here we illustrate these using one of the FiveThirtyEight data sets, which are available here.

27.1 Data wrangling

We are going to use polls from the 2020 USA presidential election. As before, we load and examine the data.

import pandas as pd 
import seaborn as sns
import altair as alt 

df_polls = pd.read_csv('data/presidential_poll_averages_2020.csv')
df_polls.head()

	cycle	state	modeldate	candidate_name	pct_estimate	pct_trend_adjusted
0	2020	Wyoming	11/3/2020	Joseph R. Biden Jr.	30.81486	30.82599
1	2020	Wisconsin	11/3/2020	Joseph R. Biden Jr.	52.12642	52.09584
2	2020	West Virginia	11/3/2020	Joseph R. Biden Jr.	33.49125	33.51517
3	2020	Washington	11/3/2020	Joseph R. Biden Jr.	59.34201	59.39408
4	2020	Virginia	11/3/2020	Joseph R. Biden Jr.	53.74120	53.72101

For our analysis, we are going to pick estimates from 11/3/2020 for the swing states of Florida, Texas, Arizona, Michigan, Minnesota and Pennsylvania.

df_nov = df_polls[
    (df_polls.modeldate == '11/3/2020')
]

df_nov = df_nov[
    (df_nov.candidate_name == 'Joseph R. Biden Jr.') |
    (df_nov.candidate_name == 'Donald Trump')
]

df_swing = df_nov[
    (df_nov['state'] == 'Florida') |
    (df_nov['state'] == 'Texas' ) |
    (df_nov['state'] == 'Arizona' ) |
    (df_nov['state'] == 'Michigan' ) |
    (df_nov['state'] == 'Minnesota' ) |
    (df_nov['state'] == 'Pennsylvania' ) 
]

df_swing

	cycle	state	modeldate	candidate_name	pct_estimate	pct_trend_adjusted
7	2020	Texas	11/3/2020	Joseph R. Biden Jr.	47.46643	47.44781
12	2020	Pennsylvania	11/3/2020	Joseph R. Biden Jr.	50.22000	50.20422
30	2020	Minnesota	11/3/2020	Joseph R. Biden Jr.	51.86992	51.84517
31	2020	Michigan	11/3/2020	Joseph R. Biden Jr.	51.17806	51.15482
46	2020	Florida	11/3/2020	Joseph R. Biden Jr.	49.09162	49.08035
53	2020	Arizona	11/3/2020	Joseph R. Biden Jr.	48.72237	48.70539
63	2020	Texas	11/3/2020	Donald Trump	48.57118	48.58794
68	2020	Pennsylvania	11/3/2020	Donald Trump	45.57216	45.55034
86	2020	Minnesota	11/3/2020	Donald Trump	42.63638	42.66826
87	2020	Michigan	11/3/2020	Donald Trump	43.20577	43.23326
102	2020	Florida	11/3/2020	Donald Trump	46.68101	46.61909
109	2020	Arizona	11/3/2020	Donald Trump	46.11074	46.10181

27.2 Default barplot

We can look at the relative performance of the candidates within each state using a nested bar plot.

ax = sns.barplot(
    data = df_swing, 
    x = 'state', 
    y = 'pct_estimate', 
    hue = 'candidate_name')

27.3 Altering the axes

Altering the axis increases the distance between the bars. Some might say that is misleading.

ax = sns.barplot(
    data = df_swing, 
    x = 'state', 
    y = 'pct_estimate', 
    hue = 'candidate_name')

ax.set(ylim=(41, 52))

What do you think?

How about if we instead put the data on the full 0 to 100 scale?

ax = sns.barplot(
    data = df_swing, 
    x = 'state', 
    y = 'pct_estimate', 
    hue = 'candidate_name')

ax.set(ylim=(0, 100))

We can do the same thing in Altair.

alt.Chart(df_swing).mark_bar().encode(
    x='candidate_name',
    y='pct_estimate',
    color='candidate_name',
    column = alt.Column('state:O', spacing = 5, header = alt.Header(labelOrient = "bottom")),
)

Note the need for the alt column. What happens if you do not provide an alt column?

Passing the domain option to the scale of the Y axis allows us to choose the y axis range.

alt.Chart(df_swing).mark_bar(clip=True).encode(
    x='candidate_name',
    y=alt.Y('pct_estimate', scale=alt.Scale(domain=[42,53])),
    color='candidate_name',
    column = alt.Column('state:O', spacing = 5, header = alt.Header(labelOrient = "bottom")),
)

27.4 Altering the proportions

We can even be a bit tricky and stretch out the difference.

alt.Chart(df_swing).mark_bar(clip=True).encode(
    x='candidate_name',
    y=alt.Y('pct_estimate', scale=alt.Scale(domain=[42,53])),
    color='candidate_name',
    column = alt.Column('state:O', spacing = 5, header = alt.Header(labelOrient = "bottom")),
).properties(
    width=20,
    height=600
)

27.5 Default line plot

It is not just bar plot that you can have fun with. Line plots are another interesting example.

For our simple line plot, we will need the poll data for a single state.

df_texas = df_polls[
    df_polls['state'] == 'Texas'
]

df_texas_bt = df_texas[
    (df_texas['candidate_name'] == 'Donald Trump') |
    (df_texas['candidate_name'] == 'Joseph R. Biden Jr.')
]

df_texas_bt.head()

	cycle	state	modeldate	candidate_name	pct_estimate	pct_trend_adjusted
7	2020	Texas	11/3/2020	Joseph R. Biden Jr.	47.46643	47.44781
63	2020	Texas	11/3/2020	Donald Trump	48.57118	48.58794
231	2020	Texas	11/2/2020	Joseph R. Biden Jr.	47.46643	47.44781
287	2020	Texas	11/2/2020	Donald Trump	48.57118	48.58794
455	2020	Texas	11/1/2020	Joseph R. Biden Jr.	47.45590	47.43400

The modeldate column is a string (object) and not date time. So we need to change that: we will create a new datetime column called date.

df_texas_bt['date'] = pd.to_datetime(df_texas_bt.loc[:,'modeldate'], format='%m/%d/%Y').copy()

df_texas_bt.dtypes

cycle                          int64
state                         object
modeldate                     object
candidate_name                object
pct_estimate                 float64
pct_trend_adjusted           float64
date                  datetime64[ns]
dtype: object

Create our line plot.

alt.Chart(df_texas_bt).mark_line().encode(
    y=alt.Y('pct_estimate', scale=alt.Scale(domain=[42,53])),
    x='date',
    color='candidate_name')

Sometimes multiple axis are used for each line, or in a combined line and bar plot.

The example here uses a dataframe with a column for each line. Our data does not have that.

df_texas_bt
our_df = df_texas_bt[['candidate_name', 'pct_estimate', 'date']]
our_df

	candidate_name	pct_estimate	date
7	Joseph R. Biden Jr.	47.46643	2020-11-03
63	Donald Trump	48.57118	2020-11-03
231	Joseph R. Biden Jr.	47.46643	2020-11-02
287	Donald Trump	48.57118	2020-11-02
455	Joseph R. Biden Jr.	47.45590	2020-11-01
...	...	...	...
28931	Donald Trump	49.09724	2020-02-29
28963	Joseph R. Biden Jr.	45.30901	2020-02-28
28995	Donald Trump	49.09676	2020-02-28
29027	Joseph R. Biden Jr.	45.30089	2020-02-27
29058	Donald Trump	49.07925	2020-02-27

502 rows × 3 columns

Pivot table allows us to reshape our dataframe.

our_df = pd.pivot_table(our_df, index=['date'], columns = 'candidate_name')
our_df.columns = our_df.columns.to_series().str.join('_')
our_df.head()

	pct_estimate_Donald Trump	pct_estimate_Joseph R. Biden Jr.
date
2020-02-27	49.07925	45.30089
2020-02-28	49.09676	45.30901
2020-02-29	49.09724	45.30896
2020-03-01	49.09724	45.30895
2020-03-02	48.91861	45.37694

Date here is the dataframe index. We want it to be a column.

our_df['date1'] = our_df.index
our_df.columns = ['Trump', 'Biden', 'date1']
our_df.head()

	Trump	Biden	date1
date
2020-02-27	49.07925	45.30089	2020-02-27
2020-02-28	49.09676	45.30901	2020-02-28
2020-02-29	49.09724	45.30896	2020-02-29
2020-03-01	49.09724	45.30895	2020-03-01
2020-03-02	48.91861	45.37694	2020-03-02

Creating our new plot, to fool all those people who expect Trump to win in Texas.

base = alt.Chart(our_df).encode(
        alt.X('date1')
)

line_A = base.mark_line(color='#5276A7').encode(
    alt.Y('Trump', axis=alt.Axis(titleColor='#5276A7'), scale=alt.Scale(domain=[42,53]))
)

line_B = base.mark_line(color='#F18727').encode(
    alt.Y('Biden', axis=alt.Axis(titleColor='#F18727'), scale=alt.Scale(domain=[35,53]))
)

alt.layer(line_A, line_B).resolve_scale(y='independent')

Did you see what I did there?

Of course, mixed axis plots are rarely purely line plots. Instead they can be mixes of different axis. For these and other plotting mistakes, the economist has a nice article here. You may want to try some of these plots with this data set or the world indicators dataset from a few weeks ago.