Module reference¶
Function | Description |
---|---|
bootstrap | Estimate population mu and SE of a sample with boothstrap subset selection method. |
diag_plots | Produce the four R-style OLS diagnostics plots. |
SubsetSelect | Goes through all features and finds the ones that are best predictors of \(y\). |
Bootstrap resampling¶
-
enhancesa.bootstrap.
bootstrap
(X, iters=1)[source]¶ Estimate population mu and SE of a sample with boothstrap subset selection method. For a quick intro, got here.
Parameters: - X (an array/series object) – A fitted Statsmodels ols model.
- iters (int, optional) – The number of resampling iterations. Usually a large value, e.g. 1000
Returns: Contains estimated population mean and stadnard deviation of \(n\) samples from the the given
x
sample.Return type: DataFrame or Series object
Examples
>>> x = np.random.normal(size=100) >>> enhancesa.bootstrap(x, iters=1000) Estimated mean: -0.025309 Estimated SE: 0.095531 dtype: float64
Diagnostic plots for an OLS model¶
-
enhancesa.diag_plots.
diag_plots
(model, y)[source]¶ Produce the four R-style OLS diagnostics plots.
Parameters: - model (Statsmodels.api.ols object) – A fitted Statsmodels ols model.
- y (numpy array, pandas series/dataframe) – The response/target variable of the model.
Returns: A 2-by-2 figure containing four diagnostics plots.
Return type: matplotlib.pyplot figure
Examples
>>> # Generate data with numpy >>> x = np.random.uniform(size=100) >>> y = 2 + 0.5*x + np.random.normal(size=100) >>> # Put into a pandas df because of Statsmodels requirement >>> df = pd.DataFrame(data={'x':x, 'y', y}) >>> # Create the ols model from statsmodels.formula.api >>> model = ols('y ~ x', data=df).fit() >>> # Create the plots >>> enhancesa.diag_plots(model, y)
Subset selection¶
-
class
enhancesa.
SubsetSelect
(method='best')[source]¶ Bases:
object
Goes through all features and finds the ones that are best predictors of a response \(y\).
Parameters: method (str, default='best') – Subset selection method. Currently implemented subset selection methods are best
,forward
stepwise, andbackward
stepwise.Methods
fit
(self, X, y)Fits a subset selection method to the data. -
fit
(self, X, y)[source]¶ Fits a subset selection method to the data.
Parameters: - X (a multidimensional array or dataframe object) – This is X predictor variables.
- y (an array or Series object) – The target or response variable.
Returns: A dataframe with the best models selected by the given
method
parameter and their corresponding residual sum of squares (RSS).Return type: DataFrame object
Examples
>>> from enhancesa.subset_selection import SubsetSelect >>> from sklearn.preprocessing import PolynomialFeatures >>> # Generate data >>> X = np.random.normal(size=100) >>> y = 0.5 + 2*X - 5*(X**2) + 3*(X**3) + np.random.normal(size=100) >>> # Make it a model with polynomial features >>> poly = PolynomialFeatures(degree=10, include_bias=False) >>> X_arr = poly.fit_transform(X[:, np.newaxis]) >>> # Put them in a dataframe, coz SubsetSelect accepts dataframe only (yet) >>> col_names = ['Y']+['X'+ str(i) for i in range(1, 11)] >>> df = pd.DataFrame(np.concatenate((y[:, np.newaxis], X_arr), axis=1), columns=col_names) >>> subsets = SubsetSelect(method='best').fit(df.iloc[:,1:], df.iloc[:,0]) 100%|██████████| 10/10 [00:05<00:00, 1.97it/s]
-