Intro to Programming for Economists

Applied Microeconomics Study Group (Apr 23, 2019)

Mauro Moretto, Faith Feng

Overview:

  1. Improve your workflow with JupyterLab, some coding tips
  2. Simulate data of the single agent job search model based on Application 1: Search Model in Nonstationary Dynamic Models with Finite Dependence (Arcidiacono and Miller, 2019)
  3. Estimation by minimum distance with CCP

1. JupyterLab: Smoothing Your Workflow

  • JupyterLab is an augmentation of the very popular Jupyter Notebook which supports over 40 programming languages such as Python, R, Julia.
  • Like Jupyter Notebook, one can keep the comments and notes together with your codes. One can display the results of your code cleanly as well. (For keeping notes in Markdown, see Markdown Cheatsheet. Mathematics formula are essentially the same as LaTex.)
  • So it is very good for developing codes since you can record your ideas and experiment your codes at the same place.
  • However, Jupyter Notebook (file type .ipynb instead of .py) is slower than compiling Python (.py) codes in terminal, so people usually just use it to experiment ideas, keep records of code development or presenting codes.
  • With JupyterLab, one can transfer the codes easily to traditional Python files and compile in terminals.

A touch of software engineering: Top-Down v.s. Bottom-Up Design

  • Top-Down: Write the big framework first, then fill in the smaller functions that the big framework needs.

    advantage: no need to think about every little details of implementation first; it can help you figure out what kind of supporting functions you need, and makes project management easier.

    disadvantage: no coding can start until there is a thorough understanding of the entire structure of the project.

  • Bottom-Up: Write the smaller elements first, then write the big framework.

    advantage: it is easier to test the smaller functions as soon as you create them

    disadvantage: it is usually easier to lose track of the big framework in which the small functions operate, and may lead to a non-optimized implementation of such functions

Good code styles can help you in the long-run

It is generally good pratice to:

  • name your variables and functions well Don't be lazy to write short abbreviations of names that are hard to understand)

  • avoid magic numbers (unique values with unexplained meaning or multiple occurrences which could be replaced with named constants) to clarify meaning, so that other people (including yourself after not working on the project for a while) can quickly understand what's going on.

  • avoid copy-pasting your code Instead, write functions or classes with changeable parameters.

  • minimize the usage of global variables They can be really messy in a medium or large sized project, as they can easily affect what happens in your code. And at the same time, they can be modified by any part of your program. See Global variables are BAD (http://wiki.c2.com/?GlobalVariablesAreBad)

Why Python is not that bad?

- Community support:

Language Registered Packages GitHub Projects Reddit Job Postings
Python 176,369 2,182,259 350k 30,328
R 14,094 561,434 13.5k 8,563
Julia over 2,400 16,394 6.5k 203

Data Sources:

*Registered packages: Pypi.org (Python), R-CRAN (R), Julialang (Julia);

*GitHub projects: from GHTorrent data collected in 2018;

*Reddit: members of subreddits r/Python, r/Rlanguage and r/Julia;

*Job posting: search by "language programming" (language=Python, R, Julia) on Indeed.com (Apr 17, 2019)

- Speed:

Python is slow compared to low-level languages like C/C++, Java .etc for mainly 2 reasons:

  1. It is interpreted rather than compiled (it translates higher level languages into lower-level/faster languages). Common problem for all high-level languages.
  2. Global Interpreter Lock (GIL): prevents multi-threading, but it greatly simplifies programming complexity faced by developers.

However, there are ways to boost the performance in Python (such as multiprocessing). And it is important to keep in mind that, most of the times, especially for not so professionally trained programmers (such as myself), the greatest restriction on the performance is the underlying algorithm of the program that one writes.

Yes, the performance matters. But the user/developer experience also matters. Python is known for being simple to program, easy to read and maintain.

Conclusion: Don't use Python to program fancy video games. But for most of data analysis purposes, it is not necessarily low-performing, and it has much more community supports.

2. Data Simulation

A Simple Job Search Model

  • Each period $t \in \{1, \cdots, T\}$, an individual may stay home ($d_{1t} = 1$) or apply for temporary employment setting ($d_{2t}=1$).

  • Job applicants are successful with probability $\lambda(x_t)$, and the value of the position depends on the experience of the individual denoted by $x \in \{1, \cdots, X\}$. $$\lambda(x_{t}) = 0.2\frac{x_t}{(T-1)}+0.8, ~~ x_t \in \{0, \cdots, T-1\}$$

  • If the individual works his experience increases by one unit, and remains at the current level otherwise.

  • The preference primitives are given by the current utility from staying home, denoted by $u_{1}(x_t) $, and the utility of working $u_2(x_t)$. $$ u_{1}(x_t) = 0, \\ u_2(x_t) = \beta_0 + \beta_1 \left( \frac{x_t}{T-1} \right) $$

  • Data generating process parameters: $\beta_0 = -0.3, ~ \beta_1 = 2, ~ \delta = 0.9$

  • Error term distribution: $\epsilon_{1t}, ~\epsilon_{2t}$ following Gumble distribution with location=0, scale=0.15 so that the error terms don't overwhelm the total values.

In [1]:
import numpy as np
import pandas as pd
import functools, itertools
import os, time, random
import matplotlib.pyplot as plt

Compute Future/Continuation Value

Let the conditional choice value be: $$ v_{jt}(x_t) = u_{jt}(x_t) + \delta E[V_{t+1}(x_{t+1}) | x_{t}, j]$$ Here we're computing the second part of the total conditional value, where: $$ E[V_{t+1}(x_{t+1}) | x_{t}, j] = E_{x_{t+1}, \epsilon_{1t+1},\epsilon_{2t+1}}\left[\max\left\{ u_{1t+1}(x_{t+1}) + \epsilon_{1t+1} + \delta E[V(x_{t+2})|x_{t+1}, j=1], \\ u_{2t+1}(x_{t+1}) + \epsilon_{2t+1} + \delta E[V(x_{t+2})|x_{t+1}, j=2]\right\} |~x_t, j \right] $$

The transition of state variable $x_t$: $$ F(x_{t+1} = x_t |x_t, j=1) = 1 \\ F(x_{t+1} = x_t |x_t, j=2) = 1-\lambda_{t} \\ F(x_{t+1} = x_t+1 |x_t, j=2) = \lambda_{t} $$

So with current work experience $x_t$, if the current choice is $d_t=1$, the work experience next period $x_{t+1}$ is $x_t$ for sure. The continuation value at $t$ is $V(x_t)$. But if $d_t=2$, the work experience next period can be $x_t$ or $x_t+1$. The continuation value at $t$ is $\lambda_t V(x_t+1)+(1-\lambda_t) V(x_t)$.

Here we compute the value function using Monte Carlo method, that is, we compute the expectation by taking an average of many simulations.

\begin{gather*} \hat{V}_t(x_t) = \frac{1}{N} \sum_{i=1}^N \max \left\lbrace \epsilon_H^i + u_H(x_t) + \delta \hat{V}_t(x_t), \\ \epsilon_W^i + (1-\lambda(x_t)) \left[u_H(x) + \delta \hat{V}_t(x_t) \right] + \lambda(x_t) \left[u_W(x) + \delta \hat{V}_t(x_t+1) \right] \right\rbrace \end{gather*}

2.1 Simulation using Recursive Function

A bit more coding trick:

  • Python allows recursive functions.
  • Recursive functions are sometimes easier to write and to read when the problem is recursive in nature (such as the classical Fibonacci function) than traditional iterations (for/while loops), but it's usually much slower in Python.
  • However, one can improve the performance of recursion greatly in Python using "memoization".
  • Memoization means that we tell the program to memorize the "intermediate" results to avoid unnecessary calculations by keeping a "memo" of previous calculated results.

Compute Expected Maximum of Two Random Variables

Here we have a special case of a constant plus a TIEV random variable. So we can just generate a big list of TIEV random numbers, add the constant, and then take an average.

In [3]:
def expectedMax(*args, mu=0, beta=0.125, size=2000):
    # Use Monte Carlo method to compute Emax[val1+epsilon1, val2+epsilon2]
    total_values = [args[i]+np.random.gumbel(mu, beta, size) for i in range(len(args))]
    return np.average([max(x) for x in zip(*total_values)])
In [4]:
## success rate of job application
success = lambda work_experience, T=10: (work_experience/(T-1))*0.2+0.8

# decorator function: function that takes another function as argument
def memoize(function):
    memo = {}
    def helper(x):
        if x not in memo:            
            memo[x] = function(x)
        return memo[x]
    return helper

def saveData(res, T, N):
    individuals = list(itertools.chain(*[[i]*T for i in range(N)]))
    time_periods = list(itertools.chain(*[list(range(T)) for i in range(N)]))
    work_experiences = list(itertools.chain(*[item[0] for item in res]))
    choices = list(itertools.chain(*[item[1] for item in res]))
    data = pd.DataFrame({'individual': individuals, 'age': time_periods, 
                         'work_experience': work_experiences, 'choice': choices})
    return data
In [5]:
# simulate data given the parameter
def dataSimulationRecursion(parameter, successRates, N=1000, T=10):
    theta0, theta1, discount = parameter
    utilityWork = [theta0+theta1*x/T for x in range(0,T)]
    @memoize
    def continuationValue(arg_tuple):
        nonlocal discount, successRates
        # nonlocal continuation_values
        t, T, work_experience, current_choice = arg_tuple
        work_experience = int(work_experience)
        if t>=T-1:
            # continuation_values[t][work_experience] = 0
            return 0
        else:
            success_rate = successRates[int(work_experience)]
            state_tuple_home = (t+1, T, work_experience, 1)
            value_home = continuationValue(state_tuple_home)
            state_tuple_work = (t+1, T, work_experience, 2)
            value_work = (utilityWork[work_experience]+
                             continuationValue(state_tuple_work))
            if current_choice==1:
                # now home -> state variable next period stays the same
                continuation_value = discount*expectedMax(value_home, value_work)
            else:
                # now work -> state variable next period may change
                # if job application succeeds
                state_tuple_home_success = (t+1, T, work_experience+1, 1)
                value_home_success = continuationValue(state_tuple_home_success)
                state_tuple_work_success = (t+1, T, work_experience+1, 2)
                value_work_success = (utilityWork[work_experience+1]+
                             continuationValue(state_tuple_work_success))
                # total continuation value
                continuation_value = discount*(
                    success_rate*expectedMax(value_home_success, value_work_success)+
                    (1-success_rate)*expectedMax(value_home, value_work))
            return continuation_value
        
    def generateChoices(T, successRates, discount, mu=0, beta=0.125):
        # default mu and beta -> type I extreme value
        work_experience = 0
        work_experiences = [work_experience]
        choices = []
        actual_shock_home = np.random.gumbel(mu, beta, T)
        actual_shock_work = np.random.gumbel(mu, beta, T)
        t = 0
        while t<=T-1:
            success_rate = successRates[work_experience]
            job_search = np.random.binomial(n=1, p=success_rate)
            state_tuple_home = (t, T, work_experience, 1)
            state_tuple_work = (t, T, work_experience, 2)
            value_home = actual_shock_home[t]+continuationValue(state_tuple_home)
            value_work = (actual_shock_work[t]+success_rate*utilityWork[work_experience]+
                          continuationValue(state_tuple_work))
            choices += [1+(value_home<=value_work)]
            if t<T-1:
                work_experience += int(job_search*(value_home<=value_work))
                work_experiences += [work_experience]
            t += 1
        return work_experiences, choices
    res = [generateChoices(T, successRates, discount) for i in range(N)]
    data = saveData(res, T, N)
    return data
In [6]:
# simulate data
start = time.time()
T, N_individuals = 10, 1000
successRates = [success(x) for x in range(T)]
parameter = (-0.3, 1, 0.9)
data_recursion = dataSimulationRecursion(parameter=parameter, 
        successRates=successRates)
end = time.time()
print("It takes {} seconds to simulate {} individuals living for \
        {} periods".format(np.round(end-start,3),N_individuals,T))
data_recursion.head()
It takes 0.293 seconds to simulate 1000 individuals living for         10 periods
Out[6]:
age choice individual work_experience
0 0 2 0 0
1 1 2 0 1
2 2 2 0 2
3 3 2 0 3
4 4 2 0 4

2.2 Simulation Using Iteration

Another approach to compute the continuation value is to use iteration (for loops) instead. Here we need to apply backwards induction, and simulate from the last period to the first.

In [7]:
def dataSimulationIteration(success_rates, parameters, N=2000, T=10):
    theta0 = parameters[0]
    theta1 = parameters[1]
    discount = parameters[2]
    N_sim = 2000
    utilityWork = [theta0+theta1*x/T for x in range(0,T)]
    utilityHome = [0]*T
    # no need to allocate space to store
    continuation_value = np.zeros((T+1,T+1))
    sigma = 0.125
    epsilon_work = np.random.gumbel(0,sigma,size = N_sim)
    epsilon_home = np.random.gumbel(0,sigma,size = N_sim)  

    for age in range(T-1, -1, -1): 
        for exp in range(age, -1, -1):              
            success_rate = success_rates[exp]
            value_hw = np.zeros((N_sim,2))
            value_hw[:,0] = (utilityHome[exp] + epsilon_home + 
                             discount*continuation_value[age+1,exp])
            value_hw[:,1] = epsilon_work + success_rate*(utilityWork[exp] + 
                            discount*continuation_value[age+1,exp+1]) + (
                1-success_rate)*(utilityHome[exp] + discount*continuation_value[age+1,exp])
            continuation_value[age,exp] = np.mean(np.max(value_hw,axis=1))

    def individualSimulation(i):  
        nonlocal T, success_rates, continuation_value, sigma

        epsilon_work_i = np.random.gumbel(0,sigma,size = T)
        epsilon_home_i = np.random.gumbel(0,sigma,size = T)
        success_shock_sim = np.random.random(size=T)   
        
        exp_i = np.zeros((T,),dtype = int)
        choice_i = np.zeros((T,),dtype = int)
        
        for age in range(T):            
            success_rate = success_rates[exp_i[age]]
            value_home = (utilityHome[exp_i[age]] + epsilon_home_i[age] + 
                          discount*continuation_value[age+1,exp_i[age]])
            value_work = (epsilon_work_i[age] + success_rate*(utilityWork[exp_i[age]] + 
                          discount*continuation_value[age+1,exp_i[age]+1]) + 
                          (1-success_rate)*(utilityHome[exp_i[age]] + 
                        discount*continuation_value[age+1,exp_i[age]]))
            choice_i[age] = 1 + int(value_home <= value_work)
            
            if (age < T-1):
                    exp_i[age+1] = exp_i[age] + (choice_i[age] == 2) *(
                        success_shock_sim[age] <= success_rate) 
            
        matrix_sim_i = np.zeros((T,4),dtype = int)
        matrix_sim_i[:,0] = i*np.ones((T,))
        matrix_sim_i[:,1] = choice_i
        matrix_sim_i[:,2] = exp_i
        matrix_sim_i[:,3] = range(0,T)                    
        return matrix_sim_i
    
    matrix_sim = np.zeros((N*T,4))           
    for i in range(0,N):
        
        matrix_sim[i*T:(i+1)*T,:] = individualSimulation(i)
        
    df_sim = pd.DataFrame(matrix_sim, 
        columns=["individual", "choice", "work_experience", "age"],dtype = int)
            
    return df_sim
In [8]:
# simulate data
start = time.time()
successRates = [success(x) for x in range(10)]
param0 = (-.3,1,0.9)
data_iteration = dataSimulationIteration(successRates,parameters=param0)
end = time.time()
print("It takes {} seconds to simulate {} individuals living for \
        {} periods".format(np.round(end-start,3), 1000, 10))
data_iteration.head()
It takes 0.343 seconds to simulate 1000 individuals living for         10 periods
Out[8]:
individual choice work_experience age
0 0 2 0 0
1 0 2 1 1
2 0 2 2 2
3 0 1 3 3
4 0 2 3 4

3. Estimation Using Data

General idea: minimizing the distance between simulated choice probabilities and empirical choice probabilities. (Related lecture notes: http://www.its.caltech.edu/~mshum/gradio/single-dynamics2.pdf)

Procedure:

  1. Estimate conditional choice probability $\hat{P}(d|X)$
  2. Estimate transition probability (job search success rate)
  3. Simulate value functions, generate model choice probability $\tilde{P}(d|X, \theta)$
  4. Estimate by minimum distance $$ \min_{{\theta}}\left|\left|\mathbf{\tilde{P}}(d|X, \theta)-\mathbf{\hat{P}}(d|X)\right|\right|^2_2$$

Note: To see more computation implementation of other methods (MPEC, NFP, .etc), I recommend Professor Param Vir Singh's class PhD Seminar on Estimating Dynamic and Structural Models. The course is taught in Matlab, but you're free to do the assignments in whatever language you like.

Choice Probability

Now, we we want to estimate the condtional choice probabilities. Given that the problem is non-stationary, we need to compute: $$\hat{p}(d = 2|x,t)= \frac{\sum_{i=1}^N \mathbf{1} \left( x_{it} = x, d_{it} = 2 \right)}{\sum_{i=1}^N \mathbf{1} \left( x_{it} = x \right)}$$ Now, it may be the case that for certain combinations of $(x,t)$, we do not have any observation.

In our simulation, all agents are born in period $t=0$, and have no experience, so that $$x_{i0}=0 \ \forall i=1,\cdots,N$$

This also implies that in the data we will not observe $x_{it}>t$ for any agent $i$. If we define $$\mathbf{\hat{P}}=[\hat{p}_{ij}] = [\hat{p}(d =2 | t=i,x=j)]$$ then we should expect $\hat{\mathbf{P}}$ to be defined only in its lower-triangular portion.

Transition Probability

Now we want to estimate the probability of job acceptance conditional on the level of experience accumulated over the working life: $$\hat{\lambda} (x) = \frac{\sum_{i=1}^N \sum_{t=0}^{T-1} \mathbf{1} \left( x_{it}=x, x_{it+1} = x + 1, d_{it}=2 \right)}{\sum_{i=1}^N \sum_{t=0}^{T-1} \mathbf{1} \left( x_{it}=x, d_{it}=2 \right)} $$

In [9]:
def ccp_fun(data, T=10):    
    def ccp_state_fun(arg):
        age , exp =  arg
        mask_den = (data['age'] == age) & (data['work_experience'] == exp)
        mask_num = (mask_den) & (data['choice'] == 2) 
        # weight ccp by number of observations
        W_state = len(data[mask_den])
        ccp_state = len(data[mask_num])/W_state if W_state>0 else 999
        return ccp_state, W_state
    output = [ccp_state_fun(item) for item in filter(lambda x: x[0]>=x[1], 
        itertools.product(range(T),range(T)))]
    ccp = np.array([item[0] for item in output])
    W = np.array([item[1] for item in output])
    W = W/np.sum(W) 
    return ccp, W

# estimation transition probability/success rate
def p_acpt_fun(data, T=10):
    data['future_work_experience'] = data['work_experience'].shift(-1).values.astype(int)
    mask = data.age == T-1
    data.loc[mask,'future_work_experience'] = 999    
    p_acpt = np.zeros((T-1,))
    for i in range(T-1):
        num = len(data[(data['age'] < T-1) & (data['work_experience'] == i) & 
            (data['future_work_experience'] == i + 1) & (data['choice'] == 2)])
        den = len(data[(data['age'] < T-1) & (data['work_experience'] == i) & 
            (data['choice'] == 2)]) 
        if den > 0:
            p_acpt[i] = num/den
        else:
            p_acpt[i] = np.nan
    return p_acpt

3.1 Estimation Procedure with Recursion

In [10]:
# estimate ccp and weights
actual_ccp, actual_W = ccp_fun(data_recursion)
# estimate success rates
T = 10
success_rates = np.zeros((T,)) 
success_rates[0:T-1] = p_acpt_fun(data_recursion)
# replace nan to 0
success_rates = [x if np.isnan(x)==False else 0 for x in success_rates]
# plot estimated success rates v.s. theoretical success rates
%matplotlib inline
plt.figure()
plt.plot(range(0,T-1),success_rates[0:T-1],'r',label='Empirical')
plt.plot(range(0,T-1),[0.8 + 0.2/(T-1) * item for item in range(0,T-1)],'b',label = 'True')
plt.xlabel(r'$x$')
plt.ylabel(r'$\lambda(x)$')
plt.legend()
plt.show()
In [11]:
# minimizing distance between predicted CCP and actual CCP
def predictCCPRecursion(success_rates, parameters):
    data = dataSimulationRecursion(parameters, success_rates)
    ccp, W = ccp_fun(data)
    return ccp, W

def estimatorRecursion(parameters, actual_ccp, success_rates, actual_W):
    predicted_ccp, W = predictCCPRecursion(success_rates, parameters)
    distance = np.sum(np.multiply((predicted_ccp-actual_ccp)**2,W))
    return distance
In [12]:
def estimationRecursion(data_recursion, parameter_combos):
    actual_ccp, actual_W = ccp_fun(data_recursion)
    # estimate success rates
    T = 10
    success_rates = np.zeros((T,)) 
    success_rates[0:T-1] = p_acpt_fun(data_recursion)
    # replace nan to 0
    success_rates = [x if np.isnan(x)==False else 0 for x in success_rates]
    # grid search for minimum distance estimation
    estimatorNewRecursion = functools.partial(estimatorRecursion, actual_ccp=actual_ccp, 
                            success_rates=success_rates, actual_W=actual_W)
    obj = [estimatorNewRecursion(item) for item in parameter_combos]
    # find parameters that gives the minimum distance
    search_grid_sol = list(parameter_combos)[np.argmin(obj)]
    return search_grid_sol
In [12]:
# grid search for minimum distance estimation
# true parameters = [-0.3, 1, 0.9]
theta0_vec = np.linspace(-1, 0, 10)
theta1_vec = np.linspace(0, 2, 10)
discount_vec = np.linspace(0.8, 1, 5)

parameter_combos = list(itertools.product(theta0_vec, theta1_vec, discount_vec))
estimatorNewRecursion = functools.partial(estimatorRecursion, actual_ccp=actual_ccp, 
                        success_rates=success_rates, actual_W=actual_W)

start = time.time()
obj = [estimatorNewRecursion(item) for item in parameter_combos]
end = time.time()

# find parameters that gives the minimum distance
search_grid_sol = parameter_combos[np.argmin(obj)]
print("The solution from the search-grid algorithm is :{}.\n \
    It took a total of {} seconds to compute".format(search_grid_sol, end-start))
The solution from the search-grid algorithm is :(-0.33333333333333337, 1.1111111111111112, 0.9).
     It took a total of 194.134694814682 seconds to compute
In [13]:
# traditional optimization
from scipy.optimize import minimize as smin
print(smin(fun=estimatorNewRecursion, x0=(0, 0.1, 0.5)))
      fun: 0.0531374235256095
 hess_inv: array([[ 0.67784799, -0.46740164, -0.08976955],
       [-0.46740164,  0.35196037, -0.10851184],
       [-0.08976955, -0.10851184,  0.99067569]])
      jac: array([ 41151.34947714, 130968.32617273,  62918.00429488])
  message: 'Desired error not necessarily achieved due to precision loss.'
     nfev: 116
      nit: 1
     njev: 21
   status: 2
  success: False
        x: array([1.69538183e-05, 1.00019415e-01, 5.00000983e-01])

3.1.a Estimating standard errors

Bootstrapping

We can use bootstrap to estimate the standard errors of the coefficients. Specifically, we take $B$ resamples (with replacement) from the data, and estimate the coefficient for each subsample. Then calculate the standard error of the $B$ estimated coefficients.

$$ SE(\hat{\theta}) = \sqrt{\frac{1}{B-1} \sum_{b=1}^{B} (\hat{\theta}_b - \bar{\theta})^2} $$

In [26]:
from sklearn.utils import resample
def getResample(data):
    individuals = list(set(data.individual.values))
    groups = data.groupby('individual')
    sub_index = resample(individuals, replace=True, n_samples=len(individuals))
    subsample = pd.concat([groups.get_group(i) for i in sub_index])
    return subsample

# calculate standard errors
def standardErrors(coefficients):
    avg = np.average(coefficients)
    se = np.sqrt(np.sum((coefficients-avg)**2)/(len(coefficients)-1))
    return se
In [ ]:
theta0_vec = np.linspace(-1, 0, 7)
theta1_vec = np.linspace(0, 2, 7)
discount_vec = np.linspace(0.8, 1, 3)
n_bootstrap = 10
bootstrap_args = (data_recursion, theta0_vec, theta1_vec, discount_vec, n_bootstrap)

## without multiprocessing
def bootstrapEstimation(args):
    data, theta0_vec, theta1_vec, discount_vec, n_bootstrap = args
    coeffs = []
    for i in range(n_bootstrap):
        subsample = getResample(data)
        print(len(set(subsample.individual.values)))
        parameter_combos = list(itertools.product(theta0_vec, theta1_vec, discount_vec))
        coeffs += [estimationRecursion(subsample, parameter_combos)]
    return coeffs

start = time.time()
bootstrap_coeffs = bootstrapEstimation(bootstrap_args)
print('total computation time is %.2f' %(time.time()-start))
# get mean and standard errors
num_parameters = 3
bootstrap_coeffs = [np.array([x[i] for x in bootstrap_coeffs]
                            ) for i in range(num_parameters)]
print(bootstrap_coeffs)
bootstrap_means = [np.mean(x) for x in bootstrap_coeffs]
bootstrap_standard_errors = [standardErrors(x) for x in bootstrap_coeffs]
print('standard errors are %.4f for theta0, %.4f for theta1 and \
        %.4f for discount factor' %tuple(bootstrap_standard_errors))

Parallelism in Python

Introduction: Multiprocessing v.s. threading in Python

For more tutorials on multiprocessing:

Function multiBootstrap(args, nprocs) is based on Python - parallelizing CPU-bound tasks with multiprocessing

In [ ]:
## with multiprocessing
def estimationRecursion(args):
    data_recursion, parameter_combos = args
    actual_ccp, actual_W = ccp_fun(data_recursion)
    # estimate success rates
    T = 10
    success_rates = np.zeros((T,)) 
    success_rates[0:T-1] = p_acpt_fun(data_recursion)
    # replace nan to 0
    success_rates = [x if np.isnan(x)==False else 0 for x in success_rates]
    # grid search for minimum distance estimation
    estimatorNewRecursion = functools.partial(estimatorRecursion, actual_ccp=actual_ccp, 
                            success_rates=success_rates, actual_W=actual_W)
    obj = [estimatorNewRecursion(item) for item in parameter_combos]
    # find parameters that gives the minimum distance
    search_grid_sol = list(parameter_combos)[np.argmin(obj)]
    return search_grid_sol

def multiBootstrap(args, nprocs):
    def worker(args, out_q):
        outlist = []
        for arg in args:
            coefficients = estimationRecursion(arg)
        out_q.put(outlist)
    out_q = Queue()
    procs = []
    chunksize = len(args)//nprocs
    for i in range(nprocs):
        chunk = args[i*chunksize:(i+1)*chunksize]
        p = Process(target=worker,
                args=(chunk, out_q))
        procs.append(p)
        p.start()
    # Collect all results into a single result list
    resultlist = []
    for i in range(nprocs):
        resultlist += [out_q.get()]
    # Wait for all worker processes to finish
    for p in procs:
        p.join()
    return resultlist
In [ ]:
n_bootstrap = 2**7
## with multiprocessing
from multiprocessing import cpu_count, Process, Queue
parameter_combos = tuple(list(itertools.product(theta0_vec, theta1_vec, discount_vec)))
grid_size = len(parameter_combos)
N_individuals = 1000
args = [(getResample(data_recursion), parameter_combos) for i in range(n_bootstrap)]
start = time.time()
result = multiBootstrap(args, cpu_count())
print(result)
print("total time used for bootstrapping %d times of grid size %d is %.2f seconds" 
      %(n_bootstrap, grid_size, time.time()-start))
# should be 32 items in bootstrap_coeffs
bootstrap_coeffs = list(itertools.chain(*result))
num_parameters = 3
bootstrap_coeffs = [np.array([x[i] for x in bootstrap_coeffs]) for i in range(num_parameters)]
bootstrap_means = tuple([np.mean(x) for x in bootstrap_coeffs])
print(bootstrap_means)
bootstrap_standard_errors = [standardErrors(x) for x in bootstrap_coeffs]
print(bootstrap_standard_errors)

3.2 Estimation Procedure with Iteration

In [14]:
# estimate ccp and weights
actual_ccp, actual_W = ccp_fun(data_iteration)
# estimate success rates
T = 10
success_rates = np.zeros((T,)) 
success_rates[0:T-1] = p_acpt_fun(data_iteration)
# replace nan to 0
success_rates = [x if np.isnan(x)==False else 0 for x in success_rates]
In [15]:
# minimizing distance between predicted CCP and actual CCP
def predictCCP(success_rates, parameters):
    data = dataSimulationIteration(success_rates, parameters)
    ccp, W = ccp_fun(data)
    return ccp, W

def estimatorIteration(actual_ccp,success_rates,parameters):
    predicted_ccp, W = predictCCP(success_rates, parameters)
    distance = np.sum(np.multiply((predicted_ccp-actual_ccp)**2,W))
    return distance
In [16]:
# grid search for minimum distance estimation
theta0_vec = np.linspace(-1,0,15)
theta1_vec = np.linspace(0,2,15)
discount_vec = np.linspace(0.85,0.95,3)

start = time.time()
obj = list(map(functools.partial(estimatorIteration,actual_ccp,success_rates),
               itertools.product(theta0_vec,theta1_vec,discount_vec)))
search_grid_sol = list(itertools.product(theta0_vec,theta1_vec,discount_vec))[np.argmin(obj)]
end = time.time()

print("The solution from the search-grid algorithm is :{}.\n \
        It took a total of {} seconds to compute".format(search_grid_sol,np.round(end-start),2))
The solution from the search-grid algorithm is :(-0.28571428571428581, 1.0, 0.89999999999999991).
         It took a total of 235.0 seconds to compute
In [30]:
from mpl_toolkits import mplot3d

plt.figure()
ax = plt.axes(projection='3d')
xline = [item[0] for item in itertools.product(
    theta0_vec,theta1_vec,discount_vec) if item[2]==discount_vec[0]]
yline = [item[1] for item in itertools.product(
    theta0_vec,theta1_vec,discount_vec) if item[2]==discount_vec[0]]
zline = [item[0] for item in zip(obj,itertools.product(
    theta0_vec,theta1_vec,discount_vec)) if item[1][2] == discount_vec[0]]

ax.scatter3D(xline,yline,zline, c='r')
Out[30]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x1cd2251e320>

Bayesian Optimization

A smarter way to search for the solution of minimization.

In [31]:
from hyperopt import fmin, hp, tpe, Trials

tpe_trials = Trials()
tpe_algo = tpe.suggest
space = [hp.normal('theta0', 0, 2), hp.normal('theta1', 0, 2), hp.uniform('discount', 0.1, 1)]
estimatorNew = functools.partial(estimatorIteration,actual_ccp,success_rates)
best = fmin(fn = estimatorNew, space = space, algo=tpe.suggest, max_evals = 2000)
print("The solution from the Bayesian Optimization algorithm is :{}".format(best))
100%|████| 2000/2000 [12:06<00:00,  2.73it/s, best loss: 0.0029346613200324934]
The solution from the Bayesian Optimization algorithm is :{'discount': 0.9073350094784516, 'theta0': -0.3106940042587562, 'theta1': 1.0256471491391026}

Fully Parametric Bootstrapping

In [ ]:
for i in range(0,100):
    data_bootstrap = dataSimulationIteration(success_rates,parameters=(best['theta0'], best['theta1'], best['discount']))
    ccp_bootstrap, W_bootstrap = ccp_fun(data_bootstrap)
    success_rates_bootstrap = np.zeros((T,)) 
    success_rates_bootstrap[0:T-1] = p_acpt_fun(data_bootstrap)
    estimatorNew_bootstrap = functools.partial(estimator,ccp_bootstrap,success_rates_bootstrap)
    space_bootstrap = [hp.normal('theta0',best['theta0'], 0.5), hp.normal('theta1', best['theta1'], 0.5), hp.uniform('discount', .8, 1)]
    sol_bootstrap = fmin(fn = estimatorNew_bootstrap, space = space_bootstrap, algo=tpe.suggest, max_evals = 500)
    estimates_bootstrap.append(sol_bootstrap)
In [26]:
discount_bootstrap = [ item['discount'] for item in estimates_bootstrap]
discount_std = np.std(discount_bootstrap)
theta0_bootstrap = [ item['theta0'] for item in estimates_bootstrap]
theta0_std = np.std(theta0_bootstrap)
theta1_bootstrap = [ item['theta1'] for item in estimates_bootstrap]
theta1_std = np.std(theta1_bootstrap)
print("The standard deviation of the discount factor is: {}".format(np.round(discount_std,3)))
print("The standard deviation of the intercept is: {}".format(np.round(theta0_std,3)))
print("The standard deviation of the slope is: {}".format(np.round(theta1_std,3)))
The standard deviation of the discount factor is: 0.011
The standard deviation of the intercept is: 0.016
The standard deviation of the slope is: 0.048