-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open PR to remove lightgbm
stubs from microsoft/python-type-stubs
at next release
#5863
Comments
😱 I've never seen this! Thanks very much for bringing it to our attention. We've been methodically working on adding type hints here in this repo for 2+ years (#3756). 😭 @shiyu1994 did you know about that project? Is it a part of VS Code? @bschnurr since you're the author of microsoft/python-type-stubs#257, could you help us understand the purpose of that PR and what you'd like to see happen here in LightGBMM? |
I see now there are return type hints. LightGBM/python-package/lightgbm/basic.py Line 2755 in d0dfcee
I added stubs to address an issue with slow return type inference for function
Source code example import pandas as pd
import numpy as np
import lightgbm as lgb
#import xgboost as xgb
from scipy.sparse import vstack, csr_matrix, save_npz, load_npz
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import gc
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import median_absolute_error
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold
gc.enable()
dtypes = {
'MachineIdentifier': 'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion': 'category',
'AvSigVersion': 'category',
'IsBeta': 'int8',
'RtpStateBitfield': 'float16',
'IsSxsPassiveMode': 'int8',
'DefaultBrowsersIdentifier': 'float16',
'AVProductStatesIdentifier': 'float32',
'AVProductsInstalled': 'float16',
'AVProductsEnabled': 'float16',
'HasTpm': 'int8',
'CountryIdentifier': 'int16',
'CityIdentifier': 'float32',
'OrganizationIdentifier': 'float16',
'GeoNameIdentifier': 'float16',
'LocaleEnglishNameIdentifier': 'int8',
'Platform': 'category',
'Processor': 'category',
'OsVer': 'category',
'OsBuild': 'int16',
'OsSuite': 'int16',
'OsPlatformSubRelease': 'category',
'OsBuildLab': 'category',
'SkuEdition': 'category',
'IsProtected': 'float16',
'AutoSampleOptIn': 'int8',
'PuaMode': 'category',
'SMode': 'float16',
'IeVerIdentifier': 'float16',
'SmartScreen': 'category',
'Firewall': 'float16',
'UacLuaenable': 'float32',
'Census_MDC2FormFactor': 'category',
'Census_DeviceFamily': 'category',
'Census_OEMNameIdentifier': 'float16',
'Census_OEMModelIdentifier': 'float32',
'Census_ProcessorCoreCount': 'float16',
'Census_ProcessorManufacturerIdentifier': 'float16',
'Census_ProcessorModelIdentifier': 'float16',
'Census_ProcessorClass': 'category',
'Census_PrimaryDiskTotalCapacity': 'float32',
'Census_PrimaryDiskTypeName': 'category',
'Census_SystemVolumeTotalCapacity': 'float32',
'Census_HasOpticalDiskDrive': 'int8',
'Census_TotalPhysicalRAM': 'float32',
'Census_ChassisTypeName': 'category',
'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16',
'Census_InternalPrimaryDisplayResolutionHorizontal': 'float16',
'Census_InternalPrimaryDisplayResolutionVertical': 'float16',
'Census_PowerPlatformRoleName': 'category',
'Census_InternalBatteryType': 'category',
'Census_InternalBatteryNumberOfCharges': 'float32',
'Census_OSVersion': 'category',
'Census_OSArchitecture': 'category',
'Census_OSBranch': 'category',
'Census_OSBuildNumber': 'int16',
'Census_OSBuildRevision': 'int32',
'Census_OSEdition': 'category',
'Census_OSSkuName': 'category',
'Census_OSInstallTypeName': 'category',
'Census_OSInstallLanguageIdentifier': 'float16',
'Census_OSUILocaleIdentifier': 'int16',
'Census_OSWUAutoUpdateOptionsName': 'category',
'Census_IsPortableOperatingSystem': 'int8',
'Census_GenuineStateName': 'category',
'Census_ActivationChannel': 'category',
'Census_IsFlightingInternal': 'float16',
'Census_IsFlightsDisabled': 'float16',
'Census_FlightRing': 'category',
'Census_ThresholdOptIn': 'float16',
'Census_FirmwareManufacturerIdentifier': 'float16',
'Census_FirmwareVersionIdentifier': 'float32',
'Census_IsSecureBootEnabled': 'int8',
'Census_IsWIMBootEnabled': 'float16',
'Census_IsVirtualDevice': 'float16',
'Census_IsTouchEnabled': 'int8',
'Census_IsPenCapable': 'int8',
'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',
'Wdft_IsGamer': 'float16',
'Wdft_RegionIdentifier': 'float16',
'HasDetections': 'int8'
}
# scikit-learn examples
survey = fetch_openml(data_id=534, as_frame=True)
X = survey.data[survey.feature_names]
X.describe(include="all")
X.head()
y = survey.target.values.ravel()
survey.target.head()
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
survey.data.info()
categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR',
'MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE']
preprocessor = make_column_transformer(
(OneHotEncoder(drop='if_binary'), categorical_columns),
remainder='passthrough'
)
model = make_pipeline(
preprocessor,
TransformedTargetRegressor(
regressor=Ridge(alpha=1e-10),
func=np.log10,
inverse_func=sp.special.exp10
)
)
_ = model.fit(X_train, y_train)
mae = median_absolute_error(y_train, y_pred)
string_score = f'MAE on training set: {mae:.2f} $/hour'
y_pred = model.predict(X_test)
mae = median_absolute_error(y_test, y_pred)
string_score += f'\nMAE on testing set: {mae:.2f} $/hour'
fig, ax = plt.subplots(figsize=(5, 5))
plt.scatter(y_test, y_pred)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red")
plt.text(3, 20, string_score)
plt.title('Ridge model, small regularization')
plt.ylabel('Model predictions')
plt.xlabel('Truths')
plt.xlim([0, 27])
_ = plt.ylim([0, 27])
feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate(
[feature_names, numerical_columns])
coefs = pd.DataFrame(
model.named_steps['transformedtargetregressor'].regressor_.coef_,
columns=['Coefficients'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model, small regularization')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)
X_train_preprocessed = pd.DataFrame(
model.named_steps['columntransformer'].transform(X_train),
columns=feature_names
)
X_train_preprocessed.std(axis=0).plot(kind='barh', figsize=(9, 7))
plt.title('Features std. dev.')
plt.subplots_adjust(left=.3)
coefs = pd.DataFrame(
model.named_steps['transformedtargetregressor'].regressor_.coef_ *
X_train_preprocessed.std(axis=0),
columns=['Coefficient importance'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model, small regularization')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)
cv_model = cross_validate(
model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
[est.named_steps['transformedtargetregressor'].regressor_.coef_ *
X_train_preprocessed.std(axis=0)
for est in cv_model['estimator']],
columns=feature_names
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5)
plt.axvline(x=0, color='.5')
plt.xlabel('Coefficient importance')
plt.title('Coefficient importance and its variability')
plt.subplots_adjust(left=.3)
# end of scikit learn example
print('Download Train and Test Data.\n')
train = pd.read_csv('../input/train.csv', dtype=dtypes, low_memory=True)
train['MachineIdentifier'] = train.index.astype('uint32')
test = pd.read_csv('../input/test.csv', dtype=dtypes, low_memory=True)
test['MachineIdentifier'] = test.index.astype('uint32')
gc.collect()
print('Transform all features to category.\n')
for usecol in train.columns.tolist()[1:-1]:
train[usecol] = train[usecol].astype('str')
test[usecol] = test[usecol].astype('str')
#Fit LabelEncoder
le = LabelEncoder().fit(
np.unique(train[usecol].unique().tolist()+
test[usecol].unique().tolist()))
#At the end 0 will be used for dropped values
train[usecol] = le.transform(train[usecol])+1
test[usecol] = le.transform(test[usecol])+1
agg_tr = (train
.groupby([usecol])
.aggregate({'MachineIdentifier':'count'})
.reset_index()
.rename({'MachineIdentifier':'Train'}, axis=1))
agg_te = (test
.groupby([usecol])
.aggregate({'MachineIdentifier':'count'})
.reset_index()
.rename({'MachineIdentifier':'Test'}, axis=1))
agg = pd.merge(agg_tr, agg_te, on=usecol, how='outer').replace(np.nan, 0)
#Select values with more than 1000 observations
agg = agg[(agg['Train'] > 1000)].reset_index(drop=True)
agg['Total'] = agg['Train'] + agg['Test']
#Drop unbalanced values
agg = agg[(agg['Train'] / agg['Total'] > 0.2) & (agg['Train'] / agg['Total'] < 0.8)]
agg[usecol+'Copy'] = agg[usecol]
train[usecol] = (pd.merge(train[[usecol]],
agg[[usecol, usecol+'Copy']],
on=usecol, how='left')[usecol+'Copy']
.replace(np.nan, 0).astype('int').astype('category'))
test[usecol] = (pd.merge(test[[usecol]],
agg[[usecol, usecol+'Copy']],
on=usecol, how='left')[usecol+'Copy']
.replace(np.nan, 0).astype('int').astype('category'))
del le, agg_tr, agg_te, agg, usecol
gc.collect()
y_train = np.array(train['HasDetections'])
train_ids = train.index
test_ids = test.index
del train['HasDetections'], train['MachineIdentifier'], test['MachineIdentifier']
gc.collect()
print("If you don't want use Sparse Matrix choose Kernel Version 2 to get simple solution.\n")
print('--------------------------------------------------------------------------------------------------------')
print('Transform Data to Sparse Matrix.')
print('Sparse Matrix can be used to fit a lot of models, eg. XGBoost, LightGBM, Random Forest, K-Means and etc.')
print('To concatenate Sparse Matrices by column use hstack()')
print('Read more about Sparse Matrix https://docs.scipy.org/doc/scipy/reference/sparse.html')
print('Good Luck!')
print('--------------------------------------------------------------------------------------------------------')
#Fit OneHotEncoder
ohe = OneHotEncoder(categories='auto', sparse=True, dtype='uint8').fit(train)
#Transform data using small groups to reduce memory usage
m = 100000
train = vstack([ohe.transform(train[i*m:(i+1)*m]) for i in range(train.shape[0] // m + 1)])
test = vstack([ohe.transform(test[i*m:(i+1)*m]) for i in range(test.shape[0] // m + 1)])
save_npz('train.npz', train, compressed=True)
save_npz('test.npz', test, compressed=True)
del ohe, train, test
gc.collect()
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf.get_n_splits(train_ids, y_train)
lgb_test_result = np.zeros(test_ids.shape[0])
lgb_train_result = np.zeros(train_ids.shape[0])
#xgb_test_result = np.zeros(test_ids.shape[0])
#xgb_train_result = np.zeros(train_ids.shape[0])
counter = 0
print('\nLightGBM\n')
for train_index, test_index in skf.split(train_ids, y_train):
print('Fold {}\n'.format(counter + 1))
train = load_npz('train.npz')
X_fit = vstack([train[train_index[i*m:(i+1)*m]] for i in range(train_index.shape[0] // m + 1)])
X_val = vstack([train[test_index[i*m:(i+1)*m]] for i in range(test_index.shape[0] // m + 1)])
X_fit, X_val = csr_matrix(X_fit, dtype='float32'), csr_matrix(X_val, dtype='float32')
y_fit, y_val = y_train[train_index], y_train[test_index]
del train
gc.collect()
lgb_model = lgb.LGBMClassifier(max_depth=-1,
n_estimators=30000,
learning_rate=0.05,
num_leaves=2**12-1,
colsample_bytree=0.28,
objective='binary',
n_jobs=-1)
#xgb_model = xgb.XGBClassifier(max_depth=6,
# n_estimators=30000,
# colsample_bytree=0.2,
# learning_rate=0.1,
# objective='binary:logistic',
# n_jobs=-1)
lgb_model.fit(X_fit, y_fit, eval_metric='auc',
eval_set=[(X_val, y_val)],
verbose=100, early_stopping_rounds=100)
#xgb_model.fit(X_fit, y_fit, eval_metric='auc',
# eval_set=[(X_val, y_val)],
# verbose=1000, early_stopping_rounds=300)
lgb_train_result[test_index] += lgb_model.predict_proba(X_val)[:,1]
#xgb_train_result[test_index] += xgb_model.predict_proba(X_val)[:,1]
del X_fit, X_val, y_fit, y_val, train_index, test_index
gc.collect()
test = load_npz('test.npz')
test = csr_matrix(test, dtype='float32')
lgb_test_result += lgb_model.predict_proba(test)[:,1]
#xgb_test_result += xgb_model.predict_proba(test)[:,1]
counter += 1
del test
gc.collect()
#Stop fitting to prevent time limit error
#if counter == 3 : break
print('\nLigthGBM VAL AUC Score: {}'.format(roc_auc_score(y_train, lgb_train_result)))
#print('\nXGBoost VAL AUC Score: {}'.format(roc_auc_score(y_train, xgb_train_result)))
submission = pd.read_csv('../input/sample_submission.csv')
submission['HasDetections'] = lgb_test_result / counter
submission.to_csv('lgb_submission.csv', index=False)
#submission['HasDetections'] = xgb_test_result / counter
#submission.to_csv('xgb_submission.csv', index=False)
#submission['HasDetections'] = 0.5 * lgb_test_result / counter + 0.5 * xgb_test_result / counter
#submission.to_csv('lgb_xgb_submission.csv', index=False)
print('\nDone.')
import pytz
from datetime import datetime
# assuming now contains a timezone aware datetime
pactz = pytz.timezone('America/Los_Angeles')
loc_dt = pactz.localize(datetime(2019, 10, 27, 6, 0, 0))
utcnow = pytz.utc
print(pytz.all_timezones)
dt = datetime(2019, 10, 31, 23, 30)
print (pactz.utcoffset(dt, is_dst=True))
def do_plotly():
import plotly.graph_objs as go
fig = go.Figure()
fig.add_scatter |
I'll remove the bundled stubs when the next version of lightGBM, with type annotations, is released. |
Ok sure, thanks! Sorry, I would have probably been watching that If you're interested in improving LightGBM's typing (or anything else), we'd also welcome any contributions you'd like to make here and would be happy to help with the process. |
my stubs where generated using pylance/pyright by adding |
you can also use pyright to verify your type completeness of your public api (--verifytypes ). https://microsoft.github.io/pyright/#/command-line?id=pyright-command-line-options |
@bschnurr With the release of 4.0.0, it seems LightGBM's types seem complete enough to obsolete https://github.com/microsoft/python-type-stubs/tree/main/stubs/lightgbm-stubs . Not only has a lot of variables been removed or renamed (ie: a handful are no longer public), the only area I can see the inline type hints not bein on par, at a glance, are some non-public method parameters using the @jameslamb Are you aware of any major issue with the state of type hints in the currently published version? |
Sorry for the delayed response. It appears that @bschnurr just went ahead and removed those stubs about 2 weeks ago: microsoft/python-type-stubs#294 So I guess this discussion about whether or not they should be removed can be closed.
We'd welcome a pull request fixing this if you're interested in contributing!
I'm not aware of any in the public API that are incorrect. There are certainly some cases where the type hints could be more specific (e.g. where they're using implicit |
Description
Since both projects are under the Microsoft organization, it made sense to open a reminder here.
Open a PR to delete https://github.com/microsoft/python-type-stubs/tree/main/lightgbm once a typed version of LightGBM is released on PyPI.
This could be added as a checklist item to #5153 if that's the version that will include type hints and a
py.typed
marker.Motivation
https://github.com/microsoft/python-type-stubs is bundled with Pylance. The long-term goal as stated by maintainers is to upstream everything either to the base repository, or typeshed. Once LightGBM for Python releases with type hints, https://github.com/microsoft/python-type-stubs/tree/main/lightgbm needs to be deleted or users of Pylance won't be able to properly make use of the new type-hints. As well as causing differences between what's seen in the IDE and the pyright CLI results.
The text was updated successfully, but these errors were encountered: