Skip to content

Commit 34f74a7

Browse files
authoredSep 19, 2024
Intel Gaudi: update to 1.17.1 with Python 3.11 (instructlab#2211)
Update to latest Intel Gaudi software 1.17.1. The new release comes with PyTorch 2.3.1a0, Python 3.11, and official RHEL 9.4 support. **Checklist:** - [x] **Commit Message Formatting**: Commit titles and messages follow guidelines in the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/#summary). - [x] [Changelog](https://github.com/instructlab/instructlab/blob/main/CHANGELOG.md) updated with breaking and/or notable changes for the next minor release. - [x] Documentation has been updated, if necessary. - [ ] Unit tests have been added, if necessary. - [ ] Integration tests have been added, if necessary. Approved-by: cdoern Approved-by: jaideepr97 Approved-by: leseb
2 parents e189f4e + c0b49ca commit 34f74a7

File tree

8 files changed

+58
-72
lines changed

8 files changed

+58
-72
lines changed
 

‎CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414
* `list`
1515
* `sysinfo`
1616
* `test`
17+
* Intel Gaudi software has been updated to 1.17.1 with Python 3.11 and
18+
Torch 2.3.1 support.
1719

1820
## v0.18.1
1921

‎Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ HPU_CONTAINERFILE = $(CURDIR)/containers/hpu/Containerfile
2828
HPU_CONTEXT_DIR = $(CURDIR)/containers/hpu
2929
HPU_DEPS = \
3030
$(HPU_CONTAINERFILE) \
31-
$(CURDIR)/requirements-hpu.txt \
31+
$(CURDIR)/requirements/hpu.txt \
3232
$(COMMON_DEPS) \
3333
$(NULL)
3434

‎containers/hpu/Containerfile

+23-24
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,43 @@
1-
ARG HABANA_VERSION=1.16.2
2-
ARG BASEIMAGE=vault.habana.ai/gaudi-docker/${HABANA_VERSION}/rhel9.2/habanalabs/pytorch-installer-2.2.2
1+
ARG HABANA_VERSION=1.17.1
2+
ARG BASEIMAGE=vault.habana.ai/gaudi-docker/${HABANA_VERSION}/rhel9.4/habanalabs/pytorch-installer-2.3.1
33

44
FROM ${BASEIMAGE} AS runtime
55
# base image has PyTorch fork with Habana plugins in self-compiled Python 3.10
6-
ARG PYTHON=python3.10
6+
ARG PYTHON=python3.11
77

88
ENV PYTHON="${PYTHON}" \
99
APP_ROOT="/opt/app-root"
1010
ENV PIP_DISABLE_PIP_VERSION_CHECK=1 \
1111
PIP_NO_COMPILE=1 \
12+
PIP_NO_CACHE_DIR=off \
1213
PS1="(app-root) \w\$ " \
1314
VIRTUAL_ENV="${APP_ROOT}" \
1415
PATH="${APP_ROOT}/bin:${PATH}"
1516

17+
# Gaudi container has Torch and habanalabs plugins in system Python.
18+
# Use system site packages and replace shebang, so scripts like torchrun
19+
# pick up the virtual env.
1620
RUN ${PYTHON} -m venv --upgrade-deps --system-site-packages ${VIRTUAL_ENV} && \
21+
sed -i '1s:#!/usr/bin/python.*:#!/usr/bin/env python3:' /usr/local/bin/* && \
1722
mkdir ${VIRTUAL_ENV}/src && \
1823
find ${VIRTUAL_ENV} -name __pycache__ | xargs rm -rf
1924

2025
COPY containers/sitecustomize.py ${VIRTUAL_ENV}/lib/${PYTHON}/site-packages/
2126
COPY containers/bin/debug-* ${VIRTUAL_ENV}/bin/
2227

23-
# -mno-avx: work around a build problem with llama-cpp-python and gcc.
24-
# flash-attn is compiled from source, bitsandbytes has a manylinux wheel
25-
COPY requirements.txt requirements-hpu.txt /tmp
26-
RUN sed 's/\[.*\]//' /tmp/requirements.txt >/tmp/constraints.txt && \
27-
export PIP_NO_CACHE_DIR=off; \
28-
${VIRTUAL_ENV}/bin/pip install -U wheel pip && \
29-
CMAKE_ARGS="-DLLAMA_NATIVE=off" \
30-
FORCE_CMAKE=1 \
31-
${VIRTUAL_ENV}/bin/pip install --no-binary llama_cpp_python -c /tmp/constraints.txt llama_cpp_python && \
32-
${VIRTUAL_ENV}/bin/pip install -r /tmp/requirements.txt -r /tmp/requirements-hpu.txt && \
33-
rm /tmp/constraints.txt && \
28+
COPY . /tmp/instructlab
29+
RUN CMAKE_ARGS="-DLLAMA_NATIVE=off" \
30+
${VIRTUAL_ENV}/bin/pip install "/tmp/instructlab[hpu]" && \
3431
find ${VIRTUAL_ENV} -name __pycache__ | xargs rm -rf
3532

36-
COPY . /tmp/instructlab
37-
RUN ${VIRTUAL_ENV}/bin/pip install "/tmp/instructlab[hpu]" && \
33+
# install Intel Gaudi fork of DeepSpeed
34+
RUN ${VIRTUAL_ENV}/bin/pip uninstall -y deepspeed && \
35+
${VIRTUAL_ENV}/bin/pip install --no-build-isolation git+https://github.com/HabanaAI/DeepSpeed.git@1.17.1 && \
36+
find ${VIRTUAL_ENV} -name __pycache__ | xargs rm -rf
37+
38+
# install Intel Gaudi fork of vLLM
39+
RUN VLLM_TARGET_DEVICE=hpu \
40+
${VIRTUAL_ENV}/bin/pip install --no-build-isolation git+https://github.com/HabanaAI/vllm-fork.git@v0.5.3.post1-Gaudi-1.17.0 && \
3841
pip list && \
3942
find ${VIRTUAL_ENV} -name __pycache__ | xargs rm -rf
4043

@@ -43,14 +46,10 @@ WORKDIR "${HOME}"
4346
VOLUME ["/opt/app-root/src"]
4447
CMD ["/bin/bash"]
4548

46-
47-
# default values, override with `-e HABANA_VISIBLE_MODULES="0,1"`
48-
ENV TSAN_OPTIONS='ignore_noninstrumented_modules=1' \
49-
HABANA_VISIBLE_MODULES="all" \
50-
PT_HPU_LAZY_MODE=0 \
51-
PT_HPU_ENABLE_EAGER_CACHE=TRUE \
52-
PT_HPU_EAGER_4_STAGE_PIPELINE_ENABLE=TRUE \
53-
PT_ENABLE_INT64_SUPPORT=1
49+
# https://docs.habana.ai/en/latest/PyTorch/Reference/Runtime_Flags.html
50+
# use eager mode / torch.compile()
51+
ENV PT_HPU_LAZY_MODE=0 \
52+
PT_HPU_ENABLE_EAGER_CACHE=TRUE
5453
# workaround for race condition in libgomp / oneMKL (HS-1795)
5554
ENV OMP_NUM_THREADS=8
5655

‎docs/habana-gaudi.md

+5-13
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
77
## System requirements
88

9-
- RHEL 9 on `x86_64` (tested with RHEL 9.3 and patched installer)
9+
- RHEL 9 on `x86_64` (tested with RHEL 9.4)
1010
- Intel Gaudi 2 device
11-
- [Habana Labs](https://docs.habana.ai/en/latest/index.html) software stack (tested with 1.16.2)
11+
- [Habana Labs](https://docs.habana.ai/en/latest/index.html) software stack (tested with 1.17.1)
1212
- software from Habana Vault for [RHEL](https://vault.habana.ai/ui/native/rhel) and [PyTorch](https://vault.habana.ai/ui/native/gaudi-pt-modules)
1313
- software [HabanaAI GitHub](https://github.com/HabanaAI/) org like [optimum-habana](https://github.com/HabanaAI/optimum-habana-fork) fork
1414

@@ -28,7 +28,7 @@ sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.
2828
```ini
2929
[vault]
3030
name=Habana Vault
31-
baseurl=https://vault.habana.ai/artifactory/rhel/9/9.2
31+
baseurl=https://vault.habana.ai/artifactory/rhel/9/9.4
3232
enabled=1
3333
repo_gpgcheck=0
3434
```
@@ -86,7 +86,7 @@ hl-smi
8686
+=============================================================================+
8787
````
8888

89-
See [Intel Gaudi SW Stack for RHEL 9.2](https://docs.habana.ai/en/latest/shared/Install_Driver_and_Firmware.html)
89+
See [Intel Gaudi SW Stack for RHEL 9.4](https://docs.habana.ai/en/latest/shared/Install_Driver_and_Firmware.html)
9090
for detailed documentation.
9191

9292
## Other tools
@@ -106,15 +106,7 @@ curl -O https://vault.habana.ai/artifactory/gaudi-installer/1.15.1/habanalabs-in
106106
chmod +x habanalabs-installer.sh
107107
```
108108

109-
> **NOTE**
110-
>
111-
> Habana Labs Installer 1.15.1 only supports RHEL 9.2 and will fail on 9.3+. You can hack around the limitation by patching the installer:
112-
>
113-
> ```shell
114-
> sed -i 's/OS_VERSION=\$VERSION_ID/OS_VERSION=9.2/' habanalabs-installer.sh
115-
> ```
116-
117-
Install dependencies (use `--verbose` for verbose logging). This will install several RPM packages, download Intel compilers + libraries, download + compile Python 3.10, and more.
109+
Install dependencies (use `--verbose` for verbose logging). This will install several RPM packages, download Intel compilers + libraries, and more.
118110

119111
```shell
120112
export MAKEFLAGS="-j$(nproc)"

‎requirements.txt

+5-11
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,7 @@ instructlab-sdg>=0.3.0
1313
instructlab-training>=0.4.1
1414
llama_cpp_python[server]==0.2.79
1515
mlx>=0.5.1,<0.6.0; sys_platform == 'darwin' and platform_machine == 'arm64'
16-
# HabanaLabs / Intel Gaudi env comes with Python 3.10 and slightly older
17-
# versions of some dependencies. Use '3.10' as an indicator.
18-
# Habana installer has NumPy 1.23.5
19-
numpy>=1.23.5,<2.0.0 ; python_version == '3.10'
20-
numpy>=1.26.4,<2.0.0 ; python_version >= '3.11'
16+
numpy>=1.26.4,<2.0.0
2117
openai>=1.13.3
2218
peft>=0.9.0
2319
prompt-toolkit>=3.0.38
@@ -31,13 +27,11 @@ sentencepiece>=0.2.0
3127
# "old" version required for vLLM on CUDA to build
3228
tokenizers>=0.11.1
3329
toml>=0.10.2
34-
# Habana Labs 1.16.2 has PyTorch 2.2.2a0+gitxxx pre-release
35-
torch>=2.2.2a0,<2.4.0 ; python_version == '3.10'
36-
torch>=2.3.0,<2.4.0 ; python_version >= '3.11'
30+
# Habana Labs 1.17.1 has PyTorch 2.3.1a0+gitxxx pre-release
31+
torch>=2.3.0,<2.4.0
3732
tqdm>=4.66.2
38-
# 'optimum' for Intel Gaudi needs transformers <4.41.0,>=4.40.0
39-
transformers>=4.40.0 ; python_version == '3.10'
40-
transformers>=4.41.2 ; python_version >= '3.11'
33+
# 'optimum' for Intel Gaudi needs transformers <4.44.0,>=4.43.0
34+
transformers>=4.41.2
4135
trl>=0.9.4
4236
wandb>=0.16.4
4337
xdg-base-dirs>=6.0.1

‎requirements/hpu.txt

+5-7
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
# Dependencies for Intel Gaudi / Habana Labs HPU devices
22
#
33

4-
optimum>=1.20.1
5-
optimum-habana>=1.12.0
6-
# Habana Labs 1.16.2 has NumPy 1.23.5
7-
numpy>=1.23.5,<2.0.0
4+
optimum>=1.21.0
5+
optimum-habana>=1.13.1
86
# Habana Labs 1.16.2 has PyTorch 2.2.2a0+gitxxx pre-release
9-
torch>=2.2.2a0,<2.4.0
7+
torch>=2.3.1a0,<2.4.0
108
# Habana Labs frameworks
11-
habana-torch-plugin>=1.16.2
12-
habana_gpu_migration>=1.16.2
9+
habana-torch-plugin>=1.17.1
10+
habana_gpu_migration>=1.17.1
1311
# additional Habana Labs packages (installed, but not used)
1412
#habana-media-loader
1513
#habana-pyhlml

‎src/instructlab/train/linux_train.py

+4-9
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
# Standard
44
from pathlib import Path
5-
from typing import Optional
65
import logging
76
import os
7+
import typing
88

99
# Third Party
1010
# https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset
@@ -53,9 +53,6 @@
5353
# Habana implementations of SFT Trainer
5454
# https://huggingface.co/docs/optimum/habana/index
5555
from optimum.habana import GaudiConfig, GaudiTrainingArguments
56-
from optimum.habana.transformers.generation.configuration_utils import (
57-
GaudiGenerationConfig,
58-
)
5956
from optimum.habana.trl import GaudiSFTTrainer
6057

6158
# silence warning: "Call mark_step function will not have any effect"
@@ -162,7 +159,7 @@ def linux_train(
162159
train_file: Path,
163160
test_file: Path,
164161
model_name: str,
165-
num_epochs: Optional[int] = None,
162+
num_epochs: int | None,
166163
train_device: str = "cpu",
167164
four_bit_quant: bool = False,
168165
output_dir: Path = Path("training_results"),
@@ -346,6 +343,7 @@ def model_generate(user, **kwargs):
346343
tokenizer.padding_side = "right"
347344
per_device_train_batch_size = 1
348345
max_seq_length = 300
346+
generate_kwargs: dict[str, typing.Any]
349347

350348
if device.type == "hpu":
351349
# Intel Gaudi trainer
@@ -384,10 +382,7 @@ def model_generate(user, **kwargs):
384382
args=training_arguments,
385383
gaudi_config=gaudi_config,
386384
)
387-
generate_kwargs = {
388-
# TODO: check generation config parameters?
389-
"generation_config": GaudiGenerationConfig(),
390-
}
385+
generate_kwargs = {}
391386
else:
392387
training_arguments = TrainingArguments(
393388
output_dir=output_dir,

‎tests/test_package.py

+13-7
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77

88
# Third Party
99
from packaging.requirements import Requirement
10+
from packaging.version import Version
1011
import pytest
1112

1213
PKG_NAME = "instructlab"
@@ -15,9 +16,8 @@
1516
# special cases
1617
EXTRA_CHECKS = {
1718
"hpu": {
18-
"numpy": "1.23.5",
19-
"torch": "2.2.2a0",
20-
"transformers": "4.40.2",
19+
"torch": Version("2.3.1a0"),
20+
"transformers": Version("4.43.0"),
2121
}
2222
}
2323

@@ -44,8 +44,8 @@ def test_require_no_url_req():
4444
@pytest.mark.parametrize("hw_extra", sorted(HW_EXTRAS))
4545
@pytest.mark.parametrize("py_version", ["3.10", "3.11"])
4646
def test_package_conflict(py_version: str, hw_extra: str) -> None:
47-
if py_version == "3.11" and hw_extra == "hpu":
48-
pytest.skip("Intel Gaudi is not supported on 3.11")
47+
if py_version != "3.11" and hw_extra == "hpu":
48+
pytest.skip("Intel Gaudi only supports 3.11")
4949

5050
base: dict[str, Requirement] = {}
5151
hw: dict[str, Requirement] = {}
@@ -72,12 +72,18 @@ def test_package_conflict(py_version: str, hw_extra: str) -> None:
7272
continue
7373
for specifier in hwreq.specifier:
7474
# naive check for common version conflicts
75+
# allow pre-releases for Gaudi
7576
if specifier.operator in {"~=", "==", "<=", ">="}:
76-
assert basereq.specifier.contains(specifier.version), (basereq, hwreq)
77+
version: Version = Version(specifier.version)
78+
assert basereq.specifier.contains(
79+
version, prereleases=version.is_prerelease
80+
), (basereq, hwreq)
7781

7882
# verify special cases against base requirements
7983
if hw_extra in EXTRA_CHECKS:
8084
for name, basereq in base.items():
8185
extra_check = EXTRA_CHECKS[hw_extra].get(name)
8286
if extra_check is not None:
83-
assert basereq.specifier.contains(extra_check), (basereq, extra_check)
87+
assert basereq.specifier.contains(
88+
extra_check, prereleases=extra_check.is_prerelease
89+
), (basereq, extra_check)

0 commit comments

Comments
 (0)
Please sign in to comment.