Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
data = {'w': [100, 1, 1]}
df = pd.DataFrame(data)
df.sample(n=2, weights=df.w, replace=False)
Issue Description
In order for PPS sampling without replacement to be feasible, the selection probabilities must be less than 1, i.e.
$ \frac{n \cdot w_i}{\sum w_i}< 1$
where w is the weight and n is the total number of units to be sampled. This is often not the case if you are selecting a decent proportion of all units and there is wide variance in unit size. For example, suppose you want to select 2 units with PPS without replacement from a sampling frame of 3 units with sizes 100, 1, and 1. There is no way to make the probability of selection of the first unit 100x the probability of selection of the other two units (since the max prob for the first unit is 1 and at least one of the other units must have prob >= .5).
Unfortunately, pandas df.sampling function doesn't throw an error in this case.
Expected Behavior
The code above should throw some sort of error like "Some unit probabilities are larger than 1 and thus PPS sampling without replacement cannot be performed"
Installed Versions
Replace this line with the output of pd.show_versions()