-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
main cuml umap #397
main cuml umap #397
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
engine : UMAPEngine
(public interfaces) +engine_resolved : UMAPEngineCOncrete
(internal calls) so users can pick which engine, default to"auto"
. Match howfeaturize()
does it. - print -> logger
- un-delete
feature
threading through params (needed for memoization)
In a follow-on PR, let's add an ipynb demo!
graphistry/umap_utils.py
Outdated
@@ -119,10 +132,12 @@ def umap_lazy_init( | |||
negative_sample_rate=5, | |||
n_components: int = 2, | |||
metric: str = "euclidean", | |||
# engine: str = "auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need this b/c dicates which gets init'd below
graphistry/umap_utils.py
Outdated
@@ -134,6 +149,7 @@ def umap_lazy_init( | |||
local_connectivity=local_connectivity, | |||
repulsion_strength=repulsion_strength, | |||
negative_sample_rate=negative_sample_rate, | |||
# engine=engine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bring back all these engine calls, we need
graphistry/umap_utils.py
Outdated
|
||
# self._umap = umap.UMAP(**umap_kwargs) | ||
# if self.engine == 'cuml': | ||
if 'cuml' in sys.modules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def concrete_umap_engine(engine):
if engine == "auto":
if 'cuml' in ...
and then here:
def lazy_init(engine : UMAPEngine):
engine_concrete : UMAPEngineConcrete = concretize_umap_engine(engine)
if engine_concrete == 'cuml':
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume 'conrete_umap_engine' here == 'resolve_umap_engine' you described above? the example feature_util has no 'conrete_feature_engine' analog
graphistry/umap_utils.py
Outdated
@@ -181,6 +210,8 @@ def umap_fit(self, X: pd.DataFrame, y: Union[pd.DataFrame, None] = None): | |||
mins = (time() - t) / 60 | |||
logger.info(f"-UMAP-ing took {mins:.2f} minutes total") | |||
logger.info(f" - or {X.shape[0]/mins:.2f} rows per minute") | |||
print(f"-UMAP-ing took {mins:.2f} minutes total") | |||
print(f" - or {X.shape[0]/mins:.2f} rows per minute") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert all printf before final landing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcolinmorgan if you want to have logging set VERBOSE=True
in constants.py https://github.com/graphistry/pygraphistry/blob/master/graphistry/constants.py
graphistry/umap_utils.py
Outdated
|
||
# res._umap = umap.UMAP(**umap_kwargs) | ||
|
||
if 'cuml' in sys.modules: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass in engine selection; should be concretized by this point
graphistry/umap_utils.py
Outdated
# engine='umap-learn' | ||
print(engine) | ||
has_umap=assert_imported(engine) | ||
self.umap_lazy_init(has_umap=has_umap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
engine_resolved = resolve_umap)engine(engine)
graphistry/umap_utils.py
Outdated
"`pip install cuml`") | ||
raise import_cuml_exn | ||
|
||
UMAPEngineConcrete = Literal["none", "cuml", "umap_learn"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"none"
seems like an illegal value in this case -- how do we do a umap w/ no engine?
(it's valid for feature engineering b/c that means a no-op passthrough of features)
graphistry/umap_utils.py
Outdated
self.engine = engine | ||
|
||
if engine_resolved == "umap_learn": | ||
del umap_kwargs['engine'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confusing for dels to be here, can be part of umap_kwargs construction
graphistry/umap_utils.py
Outdated
self.umap_initialized = True | ||
self.engine = engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save resolved engine here
graphistry/umap_utils.py
Outdated
@@ -361,7 +434,10 @@ def umap( | |||
:return: self, with attributes set with new data | |||
""" | |||
assert_imported() | |||
assert_imported_cuml() | |||
self.umap_lazy_init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass engine
graphistry/umap_utils.py
Outdated
self.umap_lazy_init() | ||
engine_resolved = resolve_umap_engine(engine) | ||
|
||
|
||
self.suffix = suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all assignments should be on self
, and after if inplace: res = self else res = self.bind()
... includes lazy init which seems to mutate a bunch too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i missed that one point and have changed it in the wimpiest commit ever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah guess I was unclear
engine and suffix should be set on res
, currently defined like 430, not mutate the original self
For end-user simplicity and internal optimizations (memoization, ...), an important guarantee is the API is functional/pure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes:
- ensure
def umap(self, ...)
immediately switches tores = ..
, so subsequent calls do not mutate originalself
(lazy_init...) in typical case ofin_place=False
. RN, I think it'llself._umap = ...
. - memoization seems to be dropping impactful params like metric & engine
Recommended:
- surprisingly typed (and unnecessary?)
has_umap : str = "True"
- notes on ipynb
- add to CHANGELOG and to README
Follow-on PR?
- if input is a pandas df yet engine is cuml, can we auto-coerce pdf to cudf (and keep cudf after)?
graphistry/umap_utils.py
Outdated
@@ -167,28 +160,30 @@ def umap_lazy_init( | |||
negative_sample_rate=5, | |||
n_components: int = 2, | |||
metric: str = "euclidean", | |||
engine: str = "auto", #this seems to override user input | |||
# has_umap: str = "True", | |||
has_umap: str = "True", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of has_umap? Or at least make a bool?
I think it is there just for throwing exns when no engine, so we can do those early vs propagating
graphistry/umap_utils.py
Outdated
engine_resolved = resolve_umap_engine(engine) | ||
|
||
self.engine=engine_resolved | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe more on a future PR, we can probably automate pandas->cudf (and back) when the engine is cuml
graphistry/umap_utils.py
Outdated
self.umap_lazy_init() | ||
engine_resolved = resolve_umap_engine(engine) | ||
|
||
|
||
self.suffix = suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah guess I was unclear
engine and suffix should be set on res
, currently defined like 430, not mutate the original self
For end-user simplicity and internal optimizations (memoization, ...), an important guarantee is the API is functional/pure
graphistry/umap_utils.py
Outdated
inplace: bool = False, | ||
feature_engine: str = "auto", | ||
memoize: bool = True, | ||
**featurize_kwargs, | ||
): | ||
""" | ||
UMAP the featurized node or edges data, | ||
or pass in your own X, y (optional). | ||
or pass in your own fX, y (optional). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo?
graphistry/umap_utils.py
Outdated
umap_kwargs = dict( | ||
n_components=n_components, | ||
metric=metric, | ||
# metric=metric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove metric?
if a problem for cuml, we can throw a targeted exn if engine is cuml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I can tell this flag is still an open issue, with some hacks to get it working rapidsai/cuml#1653 (comment). From what I can tell there is no such flag listed in the cuml UMAP function https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP and since metric='euclidean' by default in sklearn anyway I thought we could bypass the need for it. although i can see why the user could want to keep the functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that was my guess why you did it. it ends up big for umap users, so to support mult engines, a pattern can be like:
opts = {
'a': 'b',
'c': 'd',
**({'x': 'y'} if engine='umap' else {})
}
README.md
Outdated
|
||
```python | ||
g = graphistry.nodes(G) | ||
g.umap(engine='cuml',metric='euclidean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g.umap(engine='cum').plot()
:)
README.md
Outdated
@@ -178,6 +178,13 @@ It is easy to turn arbitrary data into insightful graphs. PyGraphistry comes wit | |||
graphistry.edges(edges, 'src', 'dst').plot() | |||
``` | |||
|
|||
* GPU [RAPIDS.ai](https://www.rapids.ai) cuml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuML
README.md
Outdated
@@ -384,6 +391,12 @@ See `help(g.featurize)` for more options | |||
g.umap(kind='nodes', X=['col_1', ..., 'col_n'], y=['label', ..., 'other_targets'], ...) | |||
``` | |||
|
|||
* UMAP supports different backend `engine`, which by default exploit an availabile GPU, resulting in marked speed increases, and falling back to CPU processing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umap(engine="...")
supports multiple implementations. It defaults to using the GPU-acceleratedengine="cuml"
when a GPU is available, resulting in orders-of-magnitude speedups, and falls back to CPU processing viaengine="umap"
.
"!pip install -U pygraphistry/ --quiet\n", | ||
"\n", | ||
"import graphistry \n", | ||
"graphistry.register(api=3,protocol=\"https\", server=\"hub.graphistry.com\", username='dcolinmorgan', password='f5UwthGEF@F@xnP')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcolinmorgan can you scrub your pwd and reset your hub acct?
FYI, a bunch of fixes on main branch in case your CI has a lot of false positives. Still a false positive from py3.10 around mypi, but that should be it. |
g.umap(engine='cuml')
, otherwise defaults to umap-learn (also takes that as engine arg)