Module reference¶

Function	Description
bootstrap	Estimate population mu and SE of a sample with boothstrap subset selection method.
diag_plots	Produce the four R-style OLS diagnostics plots.
SubsetSelect	Goes through all features and finds the ones that are best predictors of \(y\).

Index of all modules

Bootstrap resampling¶

enhancesa.bootstrap.bootstrap(X, iters=1)[source]¶

Estimate population mu and SE of a sample with boothstrap subset selection method. For a quick intro, got here.

Parameters:	X (an array/series object) – A fitted Statsmodels ols model. iters (int, optional) – The number of resampling iterations. Usually a large value, e.g. 1000
Returns:	Contains estimated population mean and stadnard deviation of \(n\) samples from the the given `x` sample.
Return type:	DataFrame or Series object

Examples

>>> x = np.random.normal(size=100)
>>> enhancesa.bootstrap(x, iters=1000)
Estimated mean: -0.025309
Estimated SE: 0.095531
dtype: float64

Diagnostic plots for an OLS model¶

enhancesa.diag_plots.diag_plots(model, y)[source]¶

Produce the four R-style OLS diagnostics plots.

Parameters:	model (Statsmodels.api.ols object) – A fitted Statsmodels ols model. y (numpy array, pandas series/dataframe) – The response/target variable of the model.
Returns:	A 2-by-2 figure containing four diagnostics plots.
Return type:	matplotlib.pyplot figure

Examples

>>> # Generate data with numpy
>>> x = np.random.uniform(size=100)
>>> y = 2 + 0.5*x + np.random.normal(size=100)
>>> # Put into a pandas df because of Statsmodels requirement
>>> df = pd.DataFrame(data={'x':x, 'y', y})
>>> # Create the ols model from statsmodels.formula.api
>>> model = ols('y ~ x', data=df).fit()
>>> # Create the plots
>>> enhancesa.diag_plots(model, y)

Subset selection¶

class enhancesa.SubsetSelect(method='best')[source]¶

Bases: object

Goes through all features and finds the ones that are best predictors of a response \(y\).

Parameters:	method (str, default='best') – Subset selection method. Currently implemented subset selection methods are `best`, `forward` stepwise, and `backward` stepwise.

Methods

fit(self, X, y) Fits a subset selection method to the data.

fit(self, X, y)[source]¶

Fits a subset selection method to the data.

Parameters:	X (a multidimensional array or dataframe object) – This is X predictor variables. y (an array or Series object) – The target or response variable.
Returns:	A dataframe with the best models selected by the given `method` parameter and their corresponding residual sum of squares (RSS).
Return type:	DataFrame object

Examples

>>> from enhancesa.subset_selection import SubsetSelect
>>> from sklearn.preprocessing import PolynomialFeatures
>>> # Generate data
>>> X = np.random.normal(size=100)
>>> y = 0.5 + 2*X - 5*(X**2) + 3*(X**3) + np.random.normal(size=100)
>>> # Make it a model with polynomial features
>>> poly = PolynomialFeatures(degree=10, include_bias=False)
>>> X_arr = poly.fit_transform(X[:, np.newaxis])
>>> # Put them in a dataframe, coz SubsetSelect accepts dataframe only (yet)
>>> col_names = ['Y']+['X'+ str(i) for i in range(1, 11)]
>>> df = pd.DataFrame(np.concatenate((y[:, np.newaxis], X_arr), axis=1), columns=col_names)
>>> subsets = SubsetSelect(method='best').fit(df.iloc[:,1:], df.iloc[:,0])
100%|██████████| 10/10 [00:05<00:00,  1.97it/s]