GLM: Poisson Regression

[1]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import patsy as pt
import pymc3 as pm
import seaborn as sns

print(f"Running on PyMC3 v{pm.__version__}")
Running on PyMC3 v3.9.3
[2]:
%config InlineBackend.figure_format = 'retina'
RANDOM_SEED = 8927
np.random.seed(RANDOM_SEED)
az.style.use("arviz-darkgrid")

This is a minimal reproducible example of Poisson regression to predict counts using dummy data.

This Notebook is basically an excuse to demo Poisson regression using PyMC3, both manually and using the glm library to demo interactions using the patsy library. We will create some dummy data, Poisson distributed according to a linear model, and try to recover the coefficients of that linear model through inference.

For more statistical detail see:

This very basic model is inspired by a project by Ian Osvald, which is concerned with understanding the various effects of external environmental factors upon the allergic sneezing of a test subject.

Local Functions

Generate Data

This dummy dataset is created to emulate some data created as part of a study into quantified self, and the real data is more complicated than this. Ask Ian Osvald if you’d like to know more https://twitter.com/ianozsvald

Assumptions:

  • The subject sneezes N times per day, recorded as nsneeze (int)

  • The subject may or may not drink alcohol during that day, recorded as alcohol (boolean)

  • The subject may or may not take an antihistamine medication during that day, recorded as the negative action nomeds (boolean)

  • I postulate (probably incorrectly) that sneezing occurs at some baseline rate, which increases if an antihistamine is not taken, and further increased after alcohol is consumed.

  • The data is aggregated per day, to yield a total count of sneezes on that day, with a boolean flag for alcohol and antihistamine usage, with the big assumption that nsneezes have a direct causal relationship.

Create 4000 days of data: daily counts of sneezes which are Poisson distributed w.r.t alcohol consumption and antihistamine usage

[3]:
# decide poisson theta values
theta_noalcohol_meds = 1  # no alcohol, took an antihist
theta_alcohol_meds = 3  # alcohol, took an antihist
theta_noalcohol_nomeds = 6  # no alcohol, no antihist
theta_alcohol_nomeds = 36  # alcohol, no antihist

# create samples
q = 1000
df = pd.DataFrame(
    {
        "nsneeze": np.concatenate(
            (
                np.random.poisson(theta_noalcohol_meds, q),
                np.random.poisson(theta_alcohol_meds, q),
                np.random.poisson(theta_noalcohol_nomeds, q),
                np.random.poisson(theta_alcohol_nomeds, q),
            )
        ),
        "alcohol": np.concatenate(
            (
                np.repeat(False, q),
                np.repeat(True, q),
                np.repeat(False, q),
                np.repeat(True, q),
            )
        ),
        "nomeds": np.concatenate(
            (
                np.repeat(False, q),
                np.repeat(False, q),
                np.repeat(True, q),
                np.repeat(True, q),
            )
        ),
    }
)
[4]:
df.tail()
[4]:
nsneeze alcohol nomeds
3995 40 True True
3996 30 True True
3997 37 True True
3998 22 True True
3999 33 True True

View means of the various combinations (Poisson mean values)

[5]:
df.groupby(["alcohol", "nomeds"]).mean().unstack()
[5]:
nsneeze
nomeds False True
alcohol
False 1.047 6.002
True 3.089 36.004

Briefly Describe Dataset

[6]:
g = sns.catplot(
    x="nsneeze",
    row="nomeds",
    col="alcohol",
    data=df,
    kind="count",
    height=4,
    aspect=1.5,
)
/home/amit/miniconda3/envs/pymc3/lib/python3.8/site-packages/seaborn/axisgrid.py:382: UserWarning: This figure was using constrained_layout==True, but that is incompatible with subplots_adjust and or tight_layout: setting constrained_layout==False.
  fig.tight_layout()
../../../_images/pymc-examples_examples_generalized_linear_models_GLM-poisson-regression_12_1.png

Observe:

  • This looks a lot like poisson-distributed count data (because it is)

  • With nomeds == False and alcohol == False (top-left, akak antihistamines WERE used, alcohol was NOT drunk) the mean of the poisson distribution of sneeze counts is low.

  • Changing alcohol == True (top-right) increases the sneeze count nsneeze slightly

  • Changing nomeds == True (lower-left) increases the sneeze count nsneeze further

  • Changing both alcohol == True and nomeds == True (lower-right) increases the sneeze count nsneeze a lot, increasing both the mean and variance.


Poisson Regression

Our model here is a very simple Poisson regression, allowing for interaction of terms:

\[\theta = exp(\beta X)\]
\[Y_{sneeze\_count} ~ Poisson(\theta)\]

Create linear model for interaction of terms

[7]:
fml = "nsneeze ~ alcohol + antihist + alcohol:antihist"  # full patsy formulation
[8]:
fml = "nsneeze ~ alcohol * nomeds"  # lazy, alternative patsy formulation

1. Manual method, create design matrices and manually specify model

Create Design Matrices

[9]:
(mx_en, mx_ex) = pt.dmatrices(fml, df, return_type="dataframe", NA_action="raise")
[10]:
pd.concat((mx_ex.head(3), mx_ex.tail(3)))
[10]:
Intercept alcohol[T.True] nomeds[T.True] alcohol[T.True]:nomeds[T.True]
0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0
3997 1.0 1.0 1.0 1.0
3998 1.0 1.0 1.0 1.0
3999 1.0 1.0 1.0 1.0

Create Model

[11]:
with pm.Model() as mdl_fish:

    # define priors, weakly informative Normal
    b0 = pm.Normal("b0_intercept", mu=0, sigma=10)
    b1 = pm.Normal("b1_alcohol[T.True]", mu=0, sigma=10)
    b2 = pm.Normal("b2_nomeds[T.True]", mu=0, sigma=10)
    b3 = pm.Normal("b3_alcohol[T.True]:nomeds[T.True]", mu=0, sigma=10)

    # define linear model and exp link function
    theta = (
        b0
        + b1 * mx_ex["alcohol[T.True]"]
        + b2 * mx_ex["nomeds[T.True]"]
        + b3 * mx_ex["alcohol[T.True]:nomeds[T.True]"]
    )

    ## Define Poisson likelihood
    y = pm.Poisson("y", mu=np.exp(theta), observed=mx_en["nsneeze"].values)

Sample Model

[12]:
with mdl_fish:
    inf_fish = pm.sample(1000, tune=1000, cores=4, return_inferencedata=True)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b3_alcohol[T.True]:nomeds[T.True], b2_nomeds[T.True], b1_alcohol[T.True], b0_intercept]
100.00% [8000/8000 00:33<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 35 seconds.
The acceptance probability does not match the target. It is 0.8860269606635056, but should be close to 0.8. Try to increase the number of tuning steps.
The acceptance probability does not match the target. It is 0.8862113253674084, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 25% for some parameters.

View Diagnostics

[13]:
az.plot_trace(inf_fish)
[13]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f50f3ce6430>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50f3bbad90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f50f3b9da30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50f478aac0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f50f58d0e80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50f67132b0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f50f7550a60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50f87e8250>]],
      dtype=object)
../../../_images/pymc-examples_examples_generalized_linear_models_GLM-poisson-regression_29_1.png

Observe:

  • The model converges quickly and traceplots looks pretty well mixed

Transform coeffs and recover theta values

[14]:
np.exp(az.summary(inf_fish)[["mean", "hdi_3%", "hdi_97%"]])
[14]:
mean hdi_3% hdi_97%
b0_intercept 1.047074 0.988072 1.108491
b1_alcohol[T.True] 2.950575 2.762124 3.148733
b2_nomeds[T.True] 5.731630 5.387061 6.086054
b3_alcohol[T.True]:nomeds[T.True] 2.033991 1.894585 2.185840

Observe:

  • The contributions from each feature as a multiplier of the baseline sneezecount appear to be as per the data generation:

    1. exp(b0_intercept): mean=1.02 cr=[0.96, 1.08]

      Roughly linear baseline count when no alcohol and meds, as per the generated data:

      theta_noalcohol_meds = 1 (as set above) theta_noalcohol_meds = exp(b0_intercept) = 1

    2. exp(b1_alcohol): mean=2.88 cr=[2.69, 3.09]

      non-zero positive effect of adding alcohol, a ~3x multiplier of baseline sneeze count, as per the generated data:

      theta_alcohol_meds = 3 (as set above) theta_alcohol_meds = exp(b0_intercept + b1_alcohol) = exp(b0_intercept) * exp(b1_alcohol) = 1 * 3 = 3

    3. exp(b2_nomeds[T.True]): mean=5.76 cr=[5.40, 6.17]

      larger, non-zero positive effect of adding nomeds, a ~6x multiplier of baseline sneeze count, as per the generated data:

      theta_noalcohol_nomeds = 6 (as set above) theta_noalcohol_nomeds = exp(b0_intercept + b2_nomeds) = exp(b0_intercept) * exp(b2_nomeds) = 1 * 6 = 6

    4. exp(b3_alcohol[T.True]:nomeds[T.True]): mean=2.12 cr=[1.98, 2.30]

      small, positive interaction effect of alcohol and meds, a ~2x multiplier of baseline sneeze count, as per the generated data:

      theta_alcohol_nomeds = 36 (as set above) theta_alcohol_nomeds = exp(b0_intercept + b1_alcohol + b2_nomeds + b3_alcohol:nomeds) = exp(b0_intercept) * exp(b1_alcohol) * exp(b2_nomeds * b3_alcohol:nomeds) = 1 * 3 * 6 * 2 = 36

2. Alternative method, using pymc.glm

Create Model

Alternative automatic formulation using ``pmyc.glm``

[15]:
with pm.Model() as mdl_fish_alt:

    pm.glm.GLM.from_formula(fml, df, family=pm.glm.families.Poisson())

Sample Model

[16]:
with mdl_fish_alt:
    inf_fish_alt = pm.sample(2000, tune=2000, return_inferencedata=True)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [mu, alcohol[T.True]:nomeds[T.True], nomeds[T.True], alcohol[T.True], Intercept]
100.00% [8000/8000 01:03<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 2_000 tune and 2_000 draw iterations (4_000 + 4_000 draws total) took 64 seconds.
The number of effective samples is smaller than 25% for some parameters.

View Traces

[17]:
az.plot_trace(inf_fish_alt)
[17]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f50e91f65e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50e811ceb0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f50e3bb01f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50e3bd1070>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f50f3805190>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50f37f8520>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f50e80681c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50f38128e0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f50e811a520>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f50e8136850>]],
      dtype=object)
../../../_images/pymc-examples_examples_generalized_linear_models_GLM-poisson-regression_41_1.png

Transform coeffs

[18]:
np.exp(az.summary(inf_fish)[["mean", "hdi_3%", "hdi_97%"]])
[18]:
mean hdi_3% hdi_97%
b0_intercept 1.047074 0.988072 1.108491
b1_alcohol[T.True] 2.950575 2.762124 3.148733
b2_nomeds[T.True] 5.731630 5.387061 6.086054
b3_alcohol[T.True]:nomeds[T.True] 2.033991 1.894585 2.185840

Observe:

  • The traceplots look well mixed

  • The transformed model coeffs look moreorless the same as those generated by the manual model

  • Note also that the mu coeff is for the overall mean of the dataset and has an extreme skew, if we look at the median value …

[19]:
np.percentile(inf_fish_alt.posterior["mu"], [25, 50, 75])
[19]:
array([ 3.91395615,  9.56839665, 22.9396979 ])

We see this is pretty close to the overall mean of:

[20]:
df["nsneeze"].mean()
[20]:
11.5355

Example originally contributed by Jonathan Sedar 2016-05-15 github.com/jonsedar

[21]:
%load_ext watermark
%watermark -n -u -v -iv -w
patsy   0.5.1
pymc3   3.9.3
seaborn 0.10.1
numpy   1.18.5
pandas  1.0.5
arviz   0.9.0
last updated: Fri Sep 25 2020

CPython 3.8.3
IPython 7.16.1
watermark 2.0.2