Module reference

Function Description
bootstrap Estimate population mu and SE of a sample with boothstrap subset selection method.
diag_plots Produce the four R-style OLS diagnostics plots.
SubsetSelect Goes through all features and finds the ones that are best predictors of \(y\).

Bootstrap resampling

enhancesa.bootstrap.bootstrap(X, iters=1)[source]

Estimate population mu and SE of a sample with boothstrap subset selection method. For a quick intro, got here.

Parameters:
  • X (an array/series object) – A fitted Statsmodels ols model.
  • iters (int, optional) – The number of resampling iterations. Usually a large value, e.g. 1000
Returns:

Contains estimated population mean and stadnard deviation of \(n\) samples from the the given x sample.

Return type:

DataFrame or Series object

Examples

>>> x = np.random.normal(size=100)
>>> enhancesa.bootstrap(x, iters=1000)
Estimated mean: -0.025309
Estimated SE: 0.095531
dtype: float64

Diagnostic plots for an OLS model

enhancesa.diag_plots.diag_plots(model, y)[source]

Produce the four R-style OLS diagnostics plots.

Parameters:
  • model (Statsmodels.api.ols object) – A fitted Statsmodels ols model.
  • y (numpy array, pandas series/dataframe) – The response/target variable of the model.
Returns:

A 2-by-2 figure containing four diagnostics plots.

Return type:

matplotlib.pyplot figure

Examples

>>> # Generate data with numpy
>>> x = np.random.uniform(size=100)
>>> y = 2 + 0.5*x + np.random.normal(size=100)
>>> # Put into a pandas df because of Statsmodels requirement
>>> df = pd.DataFrame(data={'x':x, 'y', y})
>>> # Create the ols model from statsmodels.formula.api
>>> model = ols('y ~ x', data=df).fit()
>>> # Create the plots
>>> enhancesa.diag_plots(model, y)

Subset selection

class enhancesa.SubsetSelect(method='best')[source]

Bases: object

Goes through all features and finds the ones that are best predictors of a response \(y\).

Parameters:method (str, default='best') – Subset selection method. Currently implemented subset selection methods are best, forward stepwise, and backward stepwise.

Methods

fit(self, X, y) Fits a subset selection method to the data.
fit(self, X, y)[source]

Fits a subset selection method to the data.

Parameters:
  • X (a multidimensional array or dataframe object) – This is X predictor variables.
  • y (an array or Series object) – The target or response variable.
Returns:

A dataframe with the best models selected by the given method parameter and their corresponding residual sum of squares (RSS).

Return type:

DataFrame object

Examples

>>> from enhancesa.subset_selection import SubsetSelect
>>> from sklearn.preprocessing import PolynomialFeatures
>>> # Generate data
>>> X = np.random.normal(size=100)
>>> y = 0.5 + 2*X - 5*(X**2) + 3*(X**3) + np.random.normal(size=100)
>>> # Make it a model with polynomial features
>>> poly = PolynomialFeatures(degree=10, include_bias=False)
>>> X_arr = poly.fit_transform(X[:, np.newaxis])
>>> # Put them in a dataframe, coz SubsetSelect accepts dataframe only (yet)
>>> col_names = ['Y']+['X'+ str(i) for i in range(1, 11)]
>>> df = pd.DataFrame(np.concatenate((y[:, np.newaxis], X_arr), axis=1), columns=col_names)
>>> subsets = SubsetSelect(method='best').fit(df.iloc[:,1:], df.iloc[:,0])
100%|██████████| 10/10 [00:05<00:00,  1.97it/s]