diff --git a/README.md b/README.md index 48edeccbe..8a34f0dd6 100644 --- a/README.md +++ b/README.md @@ -70,14 +70,12 @@ For best performance on Intel® Data Center GPU Flex and Max Series, please chec | [BERT large](https://arxiv.org/pdf/1810.04805.pdf) [Sapphire Rapids](https://www.intel.com/content/www/us/en/newsroom/opinion/updates-next-gen-data-center-platform-sapphire-rapids.html#gs.blowcx) | Tensorflow | Training | [FP32 BFloat16 BFloat32](/quickstart/language_modeling/tensorflow/bert_large/training/cpu/README.md) | [SQuAD](https://github.com/IntelAI/models/tree/master/datasets/bert_data/README.md#inference) | | [BERT large (Hugging Face)](https://arxiv.org/pdf/1810.04805.pdf) | TensorFlow | Inference | [FP32 FP16 BFloat16 BFloat32](/benchmarks/language_modeling/tensorflow/bert_large_hf/inference/README.md) | [SQuAD](https://github.com/IntelAI/models/tree/master/datasets/bert_data/README.md#inference) | | [BERT large](https://arxiv.org/pdf/1810.04805.pdf) | PyTorch | Inference | [FP32 Int8 BFloat16 BFloat32](/models_v2/pytorch/bert_large/inference/cpu/README.md) | BERT Large SQuAD1.1 | -| [BERT large](https://arxiv.org/pdf/1810.04805.pdf) | PyTorch | Training | [FP32 BFloat16 BFloat32](/models_v2/pytorch/bert_large/training/cpu/README.md) | [preprocessed text dataset](https://drive.google.com/drive/folders/1cywmDnAsrP5-2vsr8GDc6QUc7VWe-M3v) | | [DistilBERT base](https://arxiv.org/abs/1910.01108) | PyTorch | Inference | [FP32 BF32 BF16Int8-FP32 Int8-BFloat16 BFloat32](/models_v2/pytorch/distilbert/inference/cpu/README.md) | [ DistilBERT Base SQuAD1.1](https://huggingface.co/distilbert-base-uncased-distilled-squad) | | [RNN-T](https://arxiv.org/abs/2007.15188) | PyTorch | Inference | [FP32 BFloat16 BFloat32](/models_v2/pytorch/rnnt/inference/cpu/README.md) | [RNN-T dataset](/models_v2/pytorch/rnnt/inference/cpu/download_dataset.sh) | | [RNN-T](https://arxiv.org/abs/2007.15188) | PyTorch | Training | [FP32 BFloat16 BFloat32](/models_v2/pytorch/rnnt/training/cpu/README.md) | [RNN-T dataset](/models_v2/pytorch/rnnt/training/cpu/download_dataset.sh) | | [GPTJ 6B](https://huggingface.co/EleutherAI/gpt-j-6b) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/gptj/inference/cpu/README.md) | | | [GPTJ 6B MLPerf](https://github.com/mlcommons/inference/tree/master/language/gpt-j#datasets--models) | PyTorch | Inference | [INT4](/models_v2/pytorch/gpt-j_mlperf/inference/cpu/README.md) | [CNN-Daily Mail dataset](https://huggingface.co/datasets/cnn_dailymail)| | [LLAMA2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/llama/inference/cpu/README.md) | | -| [LLAMA2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) | PyTorch | Training | [FP32 FP16 BFloat16 BF32](/models_v2/pytorch/llama/training/cpu/README.md) | | | [LLAMA2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/llama/inference/cpu/README.md) | | | [ChatGLMv3 6B](https://huggingface.co/THUDM/chatglm3-6b) | PyTorch | Inference | [FP32 FP16 BFloat16 BF32 INT8](/models_v2/pytorch/chatglm/inference/cpu/README.md) | | diff --git a/docker/pytorch/docker-compose.yml b/docker/pytorch/docker-compose.yml index 3116a57d8..f5f507033 100644 --- a/docker/pytorch/docker-compose.yml +++ b/docker/pytorch/docker-compose.yml @@ -32,15 +32,15 @@ services: dockerfile: docker/pytorch/bert_large/inference/cpu/pytorch-bert-large-inference.Dockerfile-${BASE_IMAGE_NAME:-ubuntu} command: > bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'" - bert_large-training-cpu: - image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-language-modeling-bert-large-training - pull_policy: always - build: - context: ../../ - dockerfile: docker/pytorch/bert_large/training/cpu/pytorch-bert-large-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu} - extends: bert_large-inference-cpu - command: > - bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'" + # bert_large-training-cpu: + # image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-language-modeling-bert-large-training + # pull_policy: always + # build: + # context: ../../ + # dockerfile: docker/pytorch/bert_large/training/cpu/pytorch-bert-large-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu} + # extends: bert_large-inference-cpu + # command: > + # bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'" maskrcnn-inference-cpu: image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-object-detection-maskrcnn-inference pull_policy: always @@ -185,15 +185,15 @@ services: extends: bert_large-inference-cpu command: > bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'" - llama-training-cpu: - image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-generative-ai-llama-training - pull_policy: always - build: - context: ../../ - dockerfile: docker/pytorch/llama/training/cpu/pytorch-llama-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu} - extends: bert_large-inference-cpu - command: > - bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'" + # llama-training-cpu: + # image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-generative-ai-llama-training + # pull_policy: always + # build: + # context: ../../ + # dockerfile: docker/pytorch/llama/training/cpu/pytorch-llama-training.Dockerfile-${BASE_IMAGE_NAME:-ubuntu} + # extends: bert_large-inference-cpu + # command: > + # bash -c "python -c 'import torch; import intel_extension_for_pytorch as ipex; print(\"torch:\", torch.__version__, \" ipex:\",ipex.__version__)'" vit-inference-cpu: image: ${REGISTRY}/aiops/mlops-ci:b-${GITHUB_RUN_NUMBER:-0}-${BASE_IMAGE_NAME:-ubuntu}-${BASE_IMAGE_TAG:-22.04}-image-recognition-vit-inference pull_policy: always diff --git a/docs/general/CPU_DEVCATALOG.md b/docs/general/CPU_DEVCATALOG.md index 666e1e5a3..420ecbc3c 100644 --- a/docs/general/CPU_DEVCATALOG.md +++ b/docs/general/CPU_DEVCATALOG.md @@ -13,7 +13,6 @@ The tables below link to documentation on how to run each use case using docker | --------| ------------------------------------------------------ | ---------- | ------| --------------------- | | PyTorch | [GPT-J](../../models_v2/pytorch/gptj/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32 | Inference | LAMBADA | | PyTorch | [Llama 2](../../models_v2/pytorch/llama/inference/cpu/CONTAINER.md) 7B,13B | FP32,BF32,BF16,FP16,INT8-FP32 | Inference | LAMBADA | -| PyTorch | [Llama 2](../../models_v2/pytorch/llama/training/cpu/CONTAINER.md) 7B | FP32,BF32,BF16,FP16 | Training | LAMBADA | | PyTorch | [ChatGLM](../../models_v2/pytorch/chatglm/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32 | Inference | LAMBADA | | PyTorch | [LCM](../../models_v2/pytorch/LCM/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32,INT8-BF16 | Inference | COCO 2017 | | PyTorch | [Stable Diffusion](../../models_v2/pytorch/stable_diffusion/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16,INT8-FP32,INT8-BF16 | Inference | COCO 2017 | @@ -40,7 +39,6 @@ The tables below link to documentation on how to run each use case using docker | Framework | Model | Precisions | Mode | Dataset | | --------| ------------------------------------------------------ | ---------- | ------| --------------------- | -| PyTorch | [BERT large](../../models_v2/pytorch/bert_large/training/cpu/CONTAINER.md) | FP32,BF32,BF16,FP16 | Training | Preprocessed Text dataset | | PyTorch |[BERT large](../../models_v2/pytorch/bert_large/inference/cpu/CONTAINER.md) | FP32,BF32,BF16,INT8 | Inference | SQuAD1.0 | | PyTorch | [RNN-T](../../models_v2/pytorch/rnnt/training/cpu/CONTAINER.md) | FP32,BF32,BF16,INT8 | Inference | LibriSpeech | | PyTorch |[RNN-T](../../models_v2/pytorch/rnnt/inference/cpu/CONTAINER.md) | FP32,BF32,FP16 | Training | LibriSpeech | diff --git a/models_v2/pytorch/bert_large/inference/cpu/CONTAINER.md b/models_v2/pytorch/bert_large/inference/cpu/CONTAINER.md index fe4ded33f..bed29fac8 100644 --- a/models_v2/pytorch/bert_large/inference/cpu/CONTAINER.md +++ b/models_v2/pytorch/bert_large/inference/cpu/CONTAINER.md @@ -45,7 +45,7 @@ To run the BERT Large inference scripts, set environment variables to specify th ```bash export EVAL_DATA_FILE= export OUTPUT_DIR= -export PRECISION= +export PRECISION= export FINETUNED_MODELL= export TEST_MODE= export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16 (for FP16 precision) diff --git a/models_v2/pytorch/bert_large/inference/cpu/README.md b/models_v2/pytorch/bert_large/inference/cpu/README.md index 8882df83f..776c469cf 100644 --- a/models_v2/pytorch/bert_large/inference/cpu/README.md +++ b/models_v2/pytorch/bert_large/inference/cpu/README.md @@ -95,7 +95,7 @@ export FINETUNED_MODEL=$(pwd)/bert_squad_model | **TEST_MODE** (THROUGHPUT, ACCURACY, REALTIME) | `export TEST_MODE=THROUGHPUT (THROUGHPUT, ACCURACY, REALTIME)` | | **EVAL_DATA_FILE** | `export EVAL_DATA_FILE=` | | **OUTPUT_DIR** | `export OUTPUT_DIR=` | -| **PRECISION** | `export PRECISION=bf16` (bf16, bf32, fp32, fp16, int8, avx-int8, avx-fp32 for throughput and bf16, bf32, fp32, fp16, int8, avx-fp32, avx-int8, fp8 for accuracy) | +| **PRECISION** | `export PRECISION=bf16` (bf16, fp32, fp16, int8, avx-int8, avx-fp32 for throughput and bf16, bf32, fp32, fp16, int8, avx-fp32, avx-int8, fp8 for accuracy and realtime) | | **FINETUNED_MODEL** | `export FINETUNED_MODEL=` | | **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | | **BATCH_SIZE** (optional) | `export BATCH_SIZE=` | diff --git a/models_v2/pytorch/bert_large/training/cpu/CONTAINER.md b/models_v2/pytorch/bert_large/training/cpu/CONTAINER.md deleted file mode 100644 index 92ab5fb2d..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/CONTAINER.md +++ /dev/null @@ -1,96 +0,0 @@ -# PyTorch BERT Large training - -## Description -This document has instructions for running BERT-Large training using Intel Extension for PyTorch. - -## Pull Command - -```bash -docker pull intel/language-modeling:pytorch-cpu-bert-large-training -``` - -> [!NOTE] -> The `avx-fp32` precision runs the same scripts as `fp32`, except that the `DNNL_MAX_CPU_ISA` environment variable is unset. The environment variable is otherwise set to `DNNL_MAX_CPU_ISA=AVX512_CORE_AMX`. - -## Datasets -Follow instructions to [download and preprocess](./README.md#download-the-preprocessed-text-dataset) the text dataset and set the `DATASET_DIR` to point to the pre-processed dataset. - -# BERT Config File -BERT Training happens in two stages. Download the BERT Config file from [here](https://drive.google.com/drive/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT) and export `BERT_MODEL_CONFIG` variable to point to this file path. - -# Checkpoint Directory -The checkpoint directory is created as a result of Phase 1 Training. Please set the `PRETRAINED_MODEL` to point to the pre-trained model path and volume mount it for Phase 2 training. - -## Docker Run -(Optional) Export related proxy into docker environment. -```bash -export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} \ - -e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} \ - -e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} \ - -e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} \ - -e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} \ - -e SOCKS_PROXY=${SOCKS_PROXY}" -``` - -To run the BERT-Large training scripts, set environment variables to specify the dataset directory, precision and an output directory. - -```bash -export DATASET_DIR= -export OUTPUT_DIR= -export PRECISION= -export BERT_MODEL_CONFIG= -export PRETRAINED_MODEL= -export TRAINING_PHASE= -export DNNL_MAX_CPU_ISA= -export TRAIN_SCRIPT=/workspace/pytorch-bert-large-training/run_pretrain_mlperf.py -export DDP=false -export TORCH_INDUCTOR=0 - -DOCKER_ARGS="--rm -it" -IMAGE_NAME=intel/language-modeling:pytorch-cpu-bert-large-training - -docker run \ - --cap-add SYS_NICE \ - --shm-size 16G \ - --env PRECISION=${PRECISION} \ - --env OUTPUT_DIR=${OUTPUT_DIR} \ - --env TRAIN_SCRIPT=${TRAIN_SCRIPT} \ - --env DATASET_DIR=${DATASET_DIR} \ - --env TRAINING_PHASE=${TRAINING_PHASE} \ - --env DDP=${DDP} \ - --env TORCH_INDUCTOR=${TORCH_INDUCTOR} \ - --env BERT_MODEL_CONFIG=${BERT_MODEL_CONFIG} \ - --env PRETRAINED_MODEL=${PRETRAINED_MODEL} \ - --env DNNL_MAX_CPU_ISA=${DNNL_MAX_CPU_ISA} \ - --volume ${OUTPUT_DIR}:${OUTPUT_DIR} \ - --volume ${DATASET_DIR}:${DATASET_DIR} \ - --volume ${BERT_MODEL_CONFIG}:${BERT_MODEL_CONFIG} \ - --volume ${PRETRAINED_MODEL}:${PRETRAINED_MODEL} \ - ${DOCKER_RUN_ENVS} \ - ${DOCKER_ARGS} \ - $IMAGE_NAME \ - /bin/bash run_model.sh -``` - -> [!NOTE] -> The workload container was validated on a single node(`DDP=false`) with `TORCH_INDUCTOR=0`. - -## Documentation and Sources -#### Get Started​ -[Docker* Repository](https://hub.docker.com/r/intel/language-modeling) - -[Main GitHub*](https://github.com/IntelAI/models) - -[Release Notes](https://github.com/IntelAI/models/releases) - -[Get Started Guide](https://github.com/IntelAI/models/blob/master/models_v2/pytorch/bert_large/training/cpu/CONTAINER.md) - -#### Code Sources -[Dockerfile](https://github.com/IntelAI/models/tree/master/docker/pytorch) - -[Report Issue](https://community.intel.com/t5/Intel-Optimized-AI-Frameworks/bd-p/optimized-ai-frameworks) - -## License Agreement -LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the [license](https://github.com/IntelAI/models/tree/master/third_party) file for additional details. - -[View All Containers and Solutions 🡢](https://www.intel.com/content/www/us/en/developer/tools/software-catalog/containers.html?s=Newest) diff --git a/models_v2/pytorch/bert_large/training/cpu/README.md b/models_v2/pytorch/bert_large/training/cpu/README.md deleted file mode 100644 index 34b33854e..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/README.md +++ /dev/null @@ -1,207 +0,0 @@ -# BERT Large Training - -BERT Large training best known configurations with Intel® Extension for PyTorch. - -## Model Information - -| **Use Case** | **Framework** | **Model Repo** | **Branch/Commit/Tag** | **Optional Patch** | -|:---:| :---: |:--------------:|:---------------------:|:------------------:| -| Training | PyTorch | https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert | - | - | - -# Pre-Requisite -* Installation of PyTorch and [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#installation) - -## Bare Metal -### General setup - -Follow [link]((https://github.com/IntelAI/models/blob/master/docs/general/pytorch/BareMetalSetup.md)) to install Pytorch, IPEX, TorchVison, Miniforge, Jemalloc and TCMalloc. - -### Model Specific Setup - -* Set Jemalloc and tcmalloc Preload for better performance - - The jemalloc should be built from the [General setup](#general-setup) section. - ``` - export LD_PRELOAD="/lib/libjemalloc.so":"path_to/tcmalloc/lib/libtcmalloc.so":$LD_PRELOAD - export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000" - ``` -* Set IOMP preload for better performance -``` - pip install packaging intel-openmp - export LD_PRELOAD=path/lib/libiomp5.so:$LD_PRELOAD -``` -* Install dependencies -``` -pip install protobuf==3.20.3 numpy==1.20 -``` - -* Set ENV to use fp16 AMX if you are using a supported platform -``` - export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16 -``` - -* Set ENV to use multi-nodes distributed training (no need for single-node multi-sockets) - - In this case, we use data-parallel distributed training and every rank will hold same model replica. The NNODES is the number of ip in the HOSTFILE. To use multi-nodes distributed training you should firstly setup the passwordless login (you can refer to [link](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/)) between these nodes. - ``` - export NNODES=#your_node_number - export HOSTFILE=your_ip_list_file #one ip per line - ``` - -* [optional] Compile model with PyTorch Inductor backend (support fp32/bf16/fp16) -```shell - export TORCH_INDUCTOR=1 -``` - - -## Datasets - -# Location of the input files - -This [MLCommons members Google Drive location](https://drive.google.com/drive/u/0/folders/1oQF4diVHNPCclykwdvQJw8n_VIWwV0PT) contains the following. -* TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained weights (which is actually 3 files). -* Vocab file (vocab.txt) to map WordPiece to word id. -* Config file (bert_config.json) which specifies the hyperparameters of the model. - -# Checkpoint conversion -python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252 --bert_config_path /cks/bert_config.json --output_checkpoint model.ckpt-28252.pt - -# Download the preprocessed text dataset - -From the [MLCommons BERT Processed dataset -directory](https://drive.google.com/drive/folders/1cywmDnAsrP5-2vsr8GDc6QUc7VWe-M3v?usp=sharing) -download `results_text.tar.gz`, and `bert_reference_results_text_md5.txt`. Then perform the following steps: - -```shell -tar xf results_text.tar.gz -cd results4 -md5sum --check ../bert_reference_results_text_md5.txt -cd .. -``` - -After completing this step you should have a directory called `results4/` that -contains 502 files for a total of about 13Gbytes. - -# Generate the BERT input dataset - -The [create_pretraining_data.py](/models/language_modeling/pytorch/bert_large/training/input_preprocessing/create_pretraining_data.py) script duplicates the input plain text, replaces -different sets of words with masks for each duplication, and serializes the -output into the HDF5 file format. - -## Training data - -The following shows how create_pretraining_data.py is called by a parallelized -script that can be called as shown below. The script reads the text data from -the `results4/` subdirectory and outputs the resulting 500 hdf5 files to a -subdirectory named `hdf5/`. - -```shell -pip install tensorflow-cpu protobuf==3.20.3 numpy==1.20 -``` - -For phase1 the seq_len=128: -```shell -export SEQ_LEN=128 -cd -./input_preprocessing/parallel_create_hdf5.sh -``` -For phase2 the seq_len=512: -```shell -export SEQ_LEN=512 -cd -./input_preprocessing/parallel_create_hdf5.sh -``` - -The resulting `hdf5/` subdir will have 500 files named -`part-00???-of-0500.hdf5` and have a size of about 539 Gigabytes. - -Next we need to shard the data into 2048 chunks. This is done by calling the -chop_hdf5_files.py script. This script reads the 500 hdf5 files from -subdirectory `hdf5/` and creates 2048 hdf5 files in subdirectory -`2048_shards_uncompressed`. - -For phase1: - -```shell -export SEQ_LEN=128 -python3 ./input_preprocessing/chop_hdf5_files.py -``` - -For phase2: - -```shell -export SEQ_LEN=512 -python3 ./input_preprocessing/chop_hdf5_files.py -``` - -The above will produce a subdirectory named `2048_shards_uncompressed/` -containing 2048 files named `part_*_of_2048.hdf5` and have a size of about 539 Gigabytes. -you can use "SHARD_NUM" to control the shard files number. the default "SHARD_NUM" if 2048. - -``` - -├── 2048_shards_uncompressed_512 -│ └── part-00000-of-00xxx -└── 2048_shards_uncompressed_128 - └── part-00000-of-00xxx -``` - -# Training -1. `git clone https://github.com/IntelAI/models.git` -2. `cd models/models_v2/pytorch/bert_large/training/cpu` -3. Create virtual environment `venv` and activate it: - ``` - python3 -m venv venv - . ./venv/bin/activate - ``` -4. Run setup.sh - ``` - ./setup.sh - ``` -5. Install the latest CPU versions of [torch, torchvision and intel_extension_for_pytorch](https://intel.github.io/intel-extension-for-pytorch/index.html#installation) - -6. Setup required environment paramaters - -| **Parameter** | **export command** | -|:---------------------------:|:------------------------------------------------------------------------------------:| -| **DDP** (true or false) | `export DDP=false` | -| **TRAINING_PHASE** (1 or 2) | `export TRAINING_PHASE=1` | -| **BERT_MODEL_CONFIG** (1st phase only) | `export BERT_MODEL_CONFIG=$(pwd)/bert_config.json` | -| **CHECKPOINT_DIR** (1st phase only) | `export CHECKPOINT_DIR=$(pwd)/checkpoint_phase1_dir` | -| **PRETRAINED_MODEL** (2nd phase only) | `export PRETRAINED_MODEL=$(pwd)/checkpoint_phase1_dir` | -| **DATASET_DIR** | `export DATASET_DIR=` | -| **OUTPUT_DIR** | `export OUTPUT_DIR=$PWD` | -| **TRAIN_SCRIPT** | `export TRAIN_SCRIPT=$(pwd)/run_pretrain_mlperf.py` | -| **PRECISION** | `export PRECISION=` | -| **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | -| **BATCH_SIZE** (optional) | `export BATCH_SIZE=256` | - -7. Run `run_model.sh` - -## Output - -Single-tile output will typically looks like: - -``` -[info] construct file from initialization -[info] input dir = /home/gta/Cosim_test/dataset/hdf5 -[info] num files = 2 -epoch: 1 -Loaded 193485 samples from datafile: /home/gta/Cosim_test/dataset/hdf5/pretrain-part-01.hdf5 -bert_train latency: 0.24147300720214843 s -bert_train throughput: 66.25999396531161 sentences/s -perplexity = 11.020857810974121 -``` -Final results of the inference run can be found in `results.yaml` file. -``` -results: - - key: throughput - value: 66.259994 - unit: sent/s - - key: latency - value: 0.2414730072021484 - unit: s - - key: accuracy - value: 11.021 - unit: perplexity -``` diff --git a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/chop_hdf5_files.py b/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/chop_hdf5_files.py deleted file mode 100644 index 3d4a849eb..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/chop_hdf5_files.py +++ /dev/null @@ -1,129 +0,0 @@ -# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import glob -import h5py -import multiprocessing -import numpy as np -from argparse import ArgumentParser, REMAINDER -from argparse import RawTextHelpFormatter -import os - -hdf5_compression_method = None -max_pred_per_seq = 76 -seq_length = 512 if "SEQ_LEN" not in os.environ else int(os.environ["SEQ_LEN"]) -n_output_shards = 2048 if "SHARD_NUM" not in os.environ else int(os.environ["SHARD_NUM"]) -input_path = 'hdf5_seq_{}'.format(seq_length) -input_files = sorted(glob.glob(input_path + '/part*', recursive=False)) -print('n_input_shards =', len(input_files)) - -print("#########seq_length############".format(seq_length)) -print("#########n_output_shards############".format(n_output_shards)) - -if not os.path.exists('2048_shards_uncompressed_{}'.format(seq_length)): - os.mkdir('2048_shards_uncompressed_{}'.format(seq_length)) - -ofile_prefix = '2048_shards_uncompressed_{}/part_'.format(seq_length) -ofile_suffix = '_of_' + str(n_output_shards) + '.hdf5' - -print('n_output_shards =', n_output_shards) - -# First pass over data to get sample count (read only the smallest array to get count) -n_samples = 0 -for idx, ifile in enumerate(input_files): - print("Scanning:", ifile, " -- Progress:", idx+1, '/', len(input_files)) - h5_ifile = h5py.File(ifile, 'r') - - f_next_sentence_labels = h5_ifile['next_sentence_labels'][:] - - h5_ifile.close() - n_samples += f_next_sentence_labels.shape[0] - - -# Find a "nominal" number of samples per shard (calculated to always go over by one shard size) -# Find excess samples in last shard and distribute removal of excess over first "N" shards (could be done over last, but it doesn't matter and math is easier this way) -# (since 0 <= excess < nominal_shard_size, the max imbalance will be 1 sample to minimize the straggler effect) -n_sample_per_ofile_nominal = (n_samples + n_output_shards - 1) // n_output_shards -n_excess = n_output_shards * n_sample_per_ofile_nominal - n_samples # Always a positive number - -print("creating ", n_output_shards, " output file handles. This could take a while.", flush=True) -ofile_handles = [h5py.File(ofile_prefix + str(x) + ofile_suffix, 'w') for x in range(n_output_shards)] - -ofile_idx = 0 # which output file -ofile_entry_idx = 0 # index into an individual data element of an output file -ifile_entry_idx = 0 - -n_samples_in_this_shard = n_sample_per_ofile_nominal - 1 -o_input_ids = np.ndarray((n_samples_in_this_shard, seq_length)) -o_input_masks = np.ndarray((n_samples_in_this_shard, seq_length)) -o_segment_ids = np.ndarray((n_samples_in_this_shard, seq_length)) -o_masked_lm_positions = np.ndarray((n_samples_in_this_shard, max_pred_per_seq)) -o_masked_lm_ids = np.ndarray((n_samples_in_this_shard, max_pred_per_seq)) -o_next_sentence_labels = np.ndarray((n_samples_in_this_shard)) - -for ifile in input_files: - print("Processing:", ifile, " -- Progress:", idx+1, '/', len(input_files)) - h5_ifile = h5py.File(ifile, 'r') - - ifile_entry_idx = 0 - f_input_ids = h5_ifile['input_ids'][:] - f_input_masks = h5_ifile['input_mask'][:] - f_segment_ids = h5_ifile['segment_ids'][:] - f_masked_lm_positions = h5_ifile['masked_lm_positions'][:] - f_masked_lm_ids = h5_ifile['masked_lm_ids'][:] - f_next_sentence_labels = h5_ifile['next_sentence_labels'][:] - - h5_ifile.close() - - # This could be vectorized but keeping it simple due to lack of time - while ifile_entry_idx < f_input_ids.shape[0]: - if ofile_entry_idx == n_samples_in_this_shard: - ofile_handles[ofile_idx].create_dataset("input_ids", data=o_input_ids, dtype='i4', compression=hdf5_compression_method) - ofile_handles[ofile_idx].create_dataset("input_mask", data=o_input_masks, dtype='i1', compression=hdf5_compression_method) - ofile_handles[ofile_idx].create_dataset("segment_ids", data=o_segment_ids, dtype='i1', compression=hdf5_compression_method) - ofile_handles[ofile_idx].create_dataset("masked_lm_positions", data=o_masked_lm_positions, dtype='i4', compression=hdf5_compression_method) - ofile_handles[ofile_idx].create_dataset("masked_lm_ids", data=o_masked_lm_ids, dtype='i4', compression=hdf5_compression_method) - ofile_handles[ofile_idx].create_dataset("next_sentence_labels", data=o_next_sentence_labels, dtype='i1', compression=hdf5_compression_method) - ofile_handles[ofile_idx].flush() - ofile_handles[ofile_idx].close() - - ofile_entry_idx = 0 - ofile_idx += 1 - print("Opening output idx:", ofile_idx) - - n_samples_in_this_shard = n_sample_per_ofile_nominal - if ofile_entry_idx < n_excess: - n_samples_in_this_shard -= 1 - - o_input_ids = np.ndarray((n_samples_in_this_shard, seq_length)) - o_input_masks = np.ndarray((n_samples_in_this_shard, seq_length)) - o_segment_ids = np.ndarray((n_samples_in_this_shard, seq_length)) - o_masked_lm_positions = np.ndarray((n_samples_in_this_shard, max_pred_per_seq)) - o_masked_lm_ids = np.ndarray((n_samples_in_this_shard, max_pred_per_seq)) - o_next_sentence_labels = np.ndarray((n_samples_in_this_shard)) - - o_input_ids[ofile_entry_idx] = f_input_ids[ifile_entry_idx] - o_input_masks[ofile_entry_idx] = f_input_masks[ifile_entry_idx] - o_segment_ids[ofile_entry_idx] = f_segment_ids[ifile_entry_idx] - o_masked_lm_positions[ofile_entry_idx] = f_masked_lm_positions[ifile_entry_idx] - o_masked_lm_ids[ofile_entry_idx] = f_masked_lm_ids[ifile_entry_idx] - o_next_sentence_labels[ofile_entry_idx] = f_next_sentence_labels[ifile_entry_idx] - ofile_entry_idx += 1 - - ifile_entry_idx += 1 - -if __name__ == '__main__': - parser = ArgumentParser(description="This is a script to parse the trace file") - parser.add_argument("--trace", metavar='\b', default="test_trace_10.json", type=str, - help="The trace file. ") - args = parser.parse_args() diff --git a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/create_pretraining_data.py b/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/create_pretraining_data.py deleted file mode 100644 index 8eaedb7fe..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/create_pretraining_data.py +++ /dev/null @@ -1,455 +0,0 @@ -# coding=utf-8 -# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. -# Copyright 2020 MLBenchmark Group. All rights reserved. - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Create masked LM/next sentence masked_lm TF examples for BERT.""" - -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import collections -import random -import tokenization -import tensorflow as tf - -import h5py -import numpy as np - -hdf5_compression_method = None - -#flags = tf.flags -flags = tf.compat.v1.flags - -FLAGS = flags.FLAGS - -flags.DEFINE_string("input_file", None, - "Input raw text file (or comma-separated list of files).") - -flags.DEFINE_string( - "output_file", None, - "Output TF example file (or comma-separated list of files).") - -flags.DEFINE_string("vocab_file", None, - "The vocabulary file that the BERT model was trained on.") - -flags.DEFINE_bool( - "do_lower_case", True, - "Whether to lower case the input text. Should be True for uncased " - "models and False for cased models.") - -flags.DEFINE_integer("max_seq_length", 128, "Maximum sequence length.") - -flags.DEFINE_integer("max_predictions_per_seq", 20, - "Maximum number of masked LM predictions per sequence.") - -flags.DEFINE_integer("random_seed", 12345, "Random seed for data generation.") - -flags.DEFINE_integer( - "dupe_factor", 10, - "Number of times to duplicate the input data (with different masks).") - -flags.DEFINE_float("masked_lm_prob", 0.15, "Masked LM probability.") - -flags.DEFINE_float( - "short_seq_prob", 0.1, - "Probability of creating sequences which are shorter than the " - "maximum length.") - - -class TrainingInstance(object): - """A single training instance (sentence pair).""" - - def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels, - is_random_next): - self.tokens = tokens - self.segment_ids = segment_ids - self.is_random_next = is_random_next - self.masked_lm_positions = masked_lm_positions - self.masked_lm_labels = masked_lm_labels - - def __str__(self): - s = "" - s += "tokens: %s\n" % (" ".join( - [tokenization.printable_text(x) for x in self.tokens])) - s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids])) - s += "is_random_next: %s\n" % self.is_random_next - s += "masked_lm_positions: %s\n" % (" ".join( - [str(x) for x in self.masked_lm_positions])) - s += "masked_lm_labels: %s\n" % (" ".join( - [tokenization.printable_text(x) for x in self.masked_lm_labels])) - s += "\n" - return s - - def __repr__(self): - return self.__str__() - - -def write_instance_to_example_files(instances, tokenizer, max_seq_length, - max_predictions_per_seq, output_files): - """Create TF example files from `TrainingInstance`s.""" - writers = [] - h5_writers = [] - - expected_instances_per_file = len(instances) // len(output_files) + 500 # Over-allocation to avoid resizing - for output_file in output_files: - h5_writers.append({ - 'handle' : h5py.File(output_file + ".hdf5", 'w'), - 'input_ids' : np.zeros([expected_instances_per_file, max_seq_length], dtype="int32"), - 'input_mask' : np.zeros([expected_instances_per_file, max_seq_length], dtype="int32"), - 'segment_ids' : np.zeros([expected_instances_per_file, max_seq_length], dtype="int32"), - 'masked_lm_positions' : np.zeros([expected_instances_per_file, max_predictions_per_seq], dtype="int32"), - 'masked_lm_ids' : np.zeros([expected_instances_per_file, max_predictions_per_seq], dtype="int32"), - 'next_sentence_labels' : np.zeros(expected_instances_per_file, dtype="int32"), - 'len' : 0 }) - - writer_index = 0 - - total_written = 0 - - features_h5 = collections.OrderedDict() - - for (inst_index, instance) in enumerate(instances): - input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) - input_mask = [1] * len(input_ids) - segment_ids = list(instance.segment_ids) - assert len(input_ids) <= max_seq_length - - while len(input_ids) < max_seq_length: - input_ids.append(0) - input_mask.append(0) - segment_ids.append(0) - - assert len(input_ids) == max_seq_length - assert len(input_mask) == max_seq_length - assert len(segment_ids) == max_seq_length - - masked_lm_positions = list(instance.masked_lm_positions) - masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels) - masked_lm_weights = [1.0] * len(masked_lm_ids) - - while len(masked_lm_positions) < max_predictions_per_seq: - masked_lm_positions.append(0) - masked_lm_ids.append(0) - masked_lm_weights.append(0.0) - - next_sentence_label = 1 if instance.is_random_next else 0 - - h5_writers[writer_index]['input_ids'][inst_index] = input_ids - h5_writers[writer_index]['input_mask'][inst_index] = input_mask - h5_writers[writer_index]['segment_ids'][inst_index] = segment_ids - h5_writers[writer_index]['masked_lm_positions'][inst_index] = masked_lm_positions - h5_writers[writer_index]['masked_lm_ids'][inst_index] = masked_lm_ids - h5_writers[writer_index]['next_sentence_labels'][inst_index] = next_sentence_label - h5_writers[writer_index]['len'] += 1 - - writer_index = (writer_index + 1) % len(h5_writers) - - total_written += 1 - - if inst_index < 20: - tf.compat.v1.logging.info("*** Example ***") - tf.compat.v1.logging.info("tokens: %s" % " ".join( - [tokenization.printable_text(x) for x in instance.tokens])) - - print("saving data") - for h5_writer in h5_writers: - my_size = h5_writer['len'] - h5_writer['handle'].create_dataset('input_ids', data=h5_writer['input_ids'][:my_size], dtype='i4', compression=hdf5_compression_method) - h5_writer['handle'].create_dataset('input_mask', data=h5_writer['input_mask'][:my_size], dtype='i1', compression=hdf5_compression_method) - h5_writer['handle'].create_dataset('segment_ids', data=h5_writer['segment_ids'][:my_size], dtype='i1', compression=hdf5_compression_method) - h5_writer['handle'].create_dataset('masked_lm_positions', data=h5_writer['masked_lm_positions'][:my_size], dtype='i4', compression=hdf5_compression_method) - h5_writer['handle'].create_dataset('masked_lm_ids', data=h5_writer['masked_lm_ids'][:my_size], dtype='i4', compression=hdf5_compression_method) - h5_writer['handle'].create_dataset('next_sentence_labels', data=h5_writer['next_sentence_labels'][:my_size], dtype='i1', compression=hdf5_compression_method) - h5_writer['handle'].flush() - h5_writer['handle'].close() - - tf.compat.v1.logging.info("Wrote %d total instances", total_written) - - -def create_int_feature(values): - feature = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) - return feature - -def create_float_feature(values): - feature = tf.train.Feature(float_list=tf.train.FloatList(value=list(values))) - return feature - -def create_training_instances(input_files, tokenizer, max_seq_length, - dupe_factor, short_seq_prob, masked_lm_prob, - max_predictions_per_seq, rng): - """Create `TrainingInstance`s from raw text.""" - all_documents = [[]] - - # Input file format: - # (1) One sentence per line. These should ideally be actual sentences, not - # entire paragraphs or arbitrary spans of text. (Because we use the - # sentence boundaries for the "next sentence prediction" task). - # (2) Blank lines between documents. Document boundaries are needed so - # that the "next sentence prediction" task doesn't span between documents. - for input_file in input_files: - with tf.compat.v1.gfile.GFile(input_file, "r") as reader: - while True: - line = tokenization.convert_to_unicode(reader.readline()) - if not line: - break - line = line.strip() - - # Empty lines are used as document delimiters - if not line: - all_documents.append([]) - tokens = tokenizer.tokenize(line) - if tokens: - all_documents[-1].append(tokens) - - # Remove empty documents - all_documents = [x for x in all_documents if x] - rng.shuffle(all_documents) - - vocab_words = list(tokenizer.vocab.keys()) - instances = [] - for _ in range(dupe_factor): - for document_index in range(len(all_documents)): - instances.extend( - create_instances_from_document( - all_documents, document_index, max_seq_length, short_seq_prob, - masked_lm_prob, max_predictions_per_seq, vocab_words, rng)) - - rng.shuffle(instances) - return instances - - -def create_instances_from_document( - all_documents, document_index, max_seq_length, short_seq_prob, - masked_lm_prob, max_predictions_per_seq, vocab_words, rng): - """Creates `TrainingInstance`s for a single document.""" - document = all_documents[document_index] - - # Account for [CLS], [SEP], [SEP] - max_num_tokens = max_seq_length - 3 - - # We *usually* want to fill up the entire sequence since we are padding - # to `max_seq_length` anyways, so short sequences are generally wasted - # computation. However, we *sometimes* - # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter - # sequences to minimize the mismatch between pre-training and fine-tuning. - # The `target_seq_length` is just a rough target however, whereas - # `max_seq_length` is a hard limit. - target_seq_length = max_num_tokens - if rng.random() < short_seq_prob: - target_seq_length = rng.randint(2, max_num_tokens) - - # We DON'T just concatenate all of the tokens from a document into a long - # sequence and choose an arbitrary split point because this would make the - # next sentence prediction task too easy. Instead, we split the input into - # segments "A" and "B" based on the actual "sentences" provided by the user - # input. - instances = [] - current_chunk = [] - current_length = 0 - i = 0 - while i < len(document): - segment = document[i] - current_chunk.append(segment) - current_length += len(segment) - if i == len(document) - 1 or current_length >= target_seq_length: - if current_chunk: - # `a_end` is how many segments from `current_chunk` go into the `A` - # (first) sentence. - a_end = 1 - if len(current_chunk) >= 2: - a_end = rng.randint(1, len(current_chunk) - 1) - - tokens_a = [] - for j in range(a_end): - tokens_a.extend(current_chunk[j]) - - tokens_b = [] - # Random next - is_random_next = False - if len(current_chunk) == 1 or rng.random() < 0.5: - is_random_next = True - target_b_length = target_seq_length - len(tokens_a) - - # This should rarely go for more than one iteration for large - # corpora. However, just to be careful, we try to make sure that - # the random document is not the same as the document - # we're processing. - for _ in range(10): - random_document_index = rng.randint(0, len(all_documents) - 1) - if random_document_index != document_index: - break - - random_document = all_documents[random_document_index] - random_start = rng.randint(0, len(random_document) - 1) - for j in range(random_start, len(random_document)): - tokens_b.extend(random_document[j]) - if len(tokens_b) >= target_b_length: - break - # We didn't actually use these segments so we "put them back" so - # they don't go to waste. - num_unused_segments = len(current_chunk) - a_end - i -= num_unused_segments - # Actual next - else: - is_random_next = False - for j in range(a_end, len(current_chunk)): - tokens_b.extend(current_chunk[j]) - truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) - - assert len(tokens_a) >= 1 - assert len(tokens_b) >= 1 - - tokens = [] - segment_ids = [] - tokens.append("[CLS]") - segment_ids.append(0) - for token in tokens_a: - tokens.append(token) - segment_ids.append(0) - - tokens.append("[SEP]") - segment_ids.append(0) - - for token in tokens_b: - tokens.append(token) - segment_ids.append(1) - tokens.append("[SEP]") - segment_ids.append(1) - - (tokens, masked_lm_positions, - masked_lm_labels) = create_masked_lm_predictions( - tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng) - instance = TrainingInstance( - tokens=tokens, - segment_ids=segment_ids, - is_random_next=is_random_next, - masked_lm_positions=masked_lm_positions, - masked_lm_labels=masked_lm_labels) - instances.append(instance) - current_chunk = [] - current_length = 0 - i += 1 - - return instances - -MaskedLmInstance = collections.namedtuple("MaskedLmInstance", - ["index", "label"]) - -def create_masked_lm_predictions(tokens, masked_lm_prob, - max_predictions_per_seq, vocab_words, rng): - """Creates the predictions for the masked LM objective.""" - - cand_indexes = [] - for (i, token) in enumerate(tokens): - if token == "[CLS]" or token == "[SEP]": - continue - cand_indexes.append(i) - - rng.shuffle(cand_indexes) - - output_tokens = list(tokens) - - num_to_predict = min(max_predictions_per_seq, - max(1, int(round(len(tokens) * masked_lm_prob)))) - - masked_lms = [] - covered_indexes = set() - for index in cand_indexes: - if len(masked_lms) >= num_to_predict: - break - if index in covered_indexes: - continue - covered_indexes.add(index) - - masked_token = None - # 80% of the time, replace with [MASK] - if rng.random() < 0.8: - masked_token = "[MASK]" - else: - # 10% of the time, keep original - if rng.random() < 0.5: - masked_token = tokens[index] - # 10% of the time, replace with random word - else: - masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] - - output_tokens[index] = masked_token - - masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) - - masked_lms = sorted(masked_lms, key=lambda x: x.index) - - masked_lm_positions = [] - masked_lm_labels = [] - for p in masked_lms: - masked_lm_positions.append(p.index) - masked_lm_labels.append(p.label) - - return (output_tokens, masked_lm_positions, masked_lm_labels) - - -def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): - """Truncates a pair of sequences to a maximum sequence length.""" - while True: - total_length = len(tokens_a) + len(tokens_b) - if total_length <= max_num_tokens: - break - - trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b - assert len(trunc_tokens) >= 1 - - # We want to sometimes truncate from the front and sometimes from the - # back to add more randomness and avoid biases. - if rng.random() < 0.5: - del trunc_tokens[0] - else: - trunc_tokens.pop() - - -def main(_): - tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO) - - tokenizer = tokenization.FullTokenizer( - vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) - - input_files = [] - for input_pattern in FLAGS.input_file.split(","): - input_files.extend(tf.compat.v1.gfile.Glob(input_pattern)) - - tf.compat.v1.logging.info("*** Reading from input files ***") - for input_file in input_files: - tf.compat.v1.logging.info(" %s", input_file) - - rng = random.Random(FLAGS.random_seed) - instances = create_training_instances( - input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor, - FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq, - rng) - - output_files = FLAGS.output_file.split(",") - tf.compat.v1.logging.info("*** Writing to output files ***") - for output_file in output_files: - tf.compat.v1.logging.info(" %s", output_file) - - write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length, - FLAGS.max_predictions_per_seq, output_files) - - -if __name__ == "__main__": - flags.mark_flag_as_required("input_file") - flags.mark_flag_as_required("output_file") - flags.mark_flag_as_required("vocab_file") - tf.compat.v1.app.run() diff --git a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/create_pretraining_data_wrapper.sh b/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/create_pretraining_data_wrapper.sh deleted file mode 100755 index 58b628880..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/create_pretraining_data_wrapper.sh +++ /dev/null @@ -1,29 +0,0 @@ -#!/bin/bash -# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -input_path=${1} -SEQ_LEN=${SEQ_LEN:-512} -output_dir="hdf5_seq_"${SEQ_LEN} -input_file=$(basename $input_path) - -python3 ./create_pretraining_data.py \ - --input_file=${input_path} \ - --output_file="${output_dir}/${input_file}" \ - --vocab_file=vocab.txt \ - --do_lower_case=True \ - --max_seq_length=$SEQ_LEN \ - --max_predictions_per_seq=76 \ - --masked_lm_prob=0.15 \ - --random_seed=12345 \ - --dupe_factor=10 diff --git a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/parallel_create_hdf5.sh b/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/parallel_create_hdf5.sh deleted file mode 100755 index 9e1e0a85b..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/parallel_create_hdf5.sh +++ /dev/null @@ -1,20 +0,0 @@ -#!/bin/bash -# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -cpus=$( ls -d /sys/devices/system/cpu/cpu[[:digit:]]* | wc -w ) -cpus=$((cpus / 2)) -echo "Using $cpus CPU cores" -SEQ_LEN=${SEQ_LEN:-512} -mkdir -p "hdf5_seq_"${SEQ_LEN} -find -L results4/ -name "part*" | xargs --max-args=1 --max-procs=$cpus ./create_pretraining_data_wrapper.sh diff --git a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/tokenization.py b/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/tokenization.py deleted file mode 100755 index f9f96f788..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/input_preprocessing/tokenization.py +++ /dev/null @@ -1,428 +0,0 @@ -# coding=utf-8 -# Copyright 2020 MLBenchmark Group. All rights reserved. - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""Tokenization classes.""" - -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import collections -import re -import unicodedata - -from absl import flags -import six -import tensorflow.compat.v1 as tf - -FLAGS = flags.FLAGS - -flags.DEFINE_bool( - "preserve_unused_tokens", False, - "If True, Wordpiece tokenization will not be applied to words in the vocab." -) - -_UNUSED_TOKEN_RE = re.compile("^\\[unused\\d+\\]$") - - -def preserve_token(token, vocab): - """Returns True if the token should forgo tokenization and be preserved.""" - if not FLAGS.preserve_unused_tokens: - return False - if token not in vocab: - return False - return bool(_UNUSED_TOKEN_RE.search(token)) - - -def validate_case_matches_checkpoint(do_lower_case, init_checkpoint): - """Checks whether the casing config is consistent with the checkpoint name.""" - - # The casing has to be passed in by the user and there is no explicit check - # as to whether it matches the checkpoint. The casing information probably - # should have been stored in the bert_config.json file, but it's not, so - # we have to heuristically detect it to validate. - - if not init_checkpoint: - return - - m = re.match("^.*?([A-Za-z0-9_-]+)/bert_model.ckpt", init_checkpoint) - if m is None: - return - - model_name = m.group(1) - - lower_models = [ - "uncased_L-24_H-1024_A-16", "uncased_L-12_H-768_A-12", - "multilingual_L-12_H-768_A-12", "chinese_L-12_H-768_A-12" - ] - - cased_models = [ - "cased_L-12_H-768_A-12", "cased_L-24_H-1024_A-16", - "multi_cased_L-12_H-768_A-12" - ] - - is_bad_config = False - if model_name in lower_models and not do_lower_case: - is_bad_config = True - actual_flag = "False" - case_name = "lowercased" - opposite_flag = "True" - - if model_name in cased_models and do_lower_case: - is_bad_config = True - actual_flag = "True" - case_name = "cased" - opposite_flag = "False" - - if is_bad_config: - raise ValueError( - "You passed in `--do_lower_case=%s` with `--init_checkpoint=%s`. " - "However, `%s` seems to be a %s model, so you " - "should pass in `--do_lower_case=%s` so that the fine-tuning matches " - "how the model was pre-training. If this error is wrong, please " - "just comment out this check." % (actual_flag, init_checkpoint, - model_name, case_name, opposite_flag)) - - -def convert_to_unicode(text): - """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" - if six.PY3: - if isinstance(text, str): - return text - elif isinstance(text, bytes): - return text.decode("utf-8", "ignore") - else: - raise ValueError("Unsupported string type: %s" % (type(text))) - elif six.PY2: - if isinstance(text, str): - return text.decode("utf-8", "ignore") - elif isinstance(text, unicode): - return text - else: - raise ValueError("Unsupported string type: %s" % (type(text))) - else: - raise ValueError("Not running on Python2 or Python 3?") - - -def printable_text(text): - """Returns text encoded in a way suitable for print or `tf.logging`.""" - - # These functions want `str` for both Python2 and Python3, but in one case - # it's a Unicode string and in the other it's a byte string. - if six.PY3: - if isinstance(text, str): - return text - elif isinstance(text, bytes): - return text.decode("utf-8", "ignore") - else: - raise ValueError("Unsupported string type: %s" % (type(text))) - elif six.PY2: - if isinstance(text, str): - return text - elif isinstance(text, unicode): - return text.encode("utf-8") - else: - raise ValueError("Unsupported string type: %s" % (type(text))) - else: - raise ValueError("Not running on Python2 or Python 3?") - - -def load_vocab(vocab_file): - """Loads a vocabulary file into a dictionary.""" - vocab = collections.OrderedDict() - with tf.gfile.GFile(vocab_file, "r") as reader: - while True: - token = convert_to_unicode(reader.readline()) - if not token: - break - token = token.strip() - if token not in vocab: - vocab[token] = len(vocab) - return vocab - - -def convert_by_vocab(vocab, items): - """Converts a sequence of [tokens|ids] using the vocab.""" - output = [] - for item in items: - output.append(vocab[item]) - return output - - -def convert_tokens_to_ids(vocab, tokens): - return convert_by_vocab(vocab, tokens) - - -def convert_ids_to_tokens(inv_vocab, ids): - return convert_by_vocab(inv_vocab, ids) - - -def whitespace_tokenize(text): - """Runs basic whitespace cleaning and splitting on a piece of text.""" - text = text.strip() - if not text: - return [] - tokens = text.split() - return tokens - - -class FullTokenizer(object): - """Runs end-to-end tokenziation.""" - - def __init__(self, vocab_file, do_lower_case=True): - self.vocab = load_vocab(vocab_file) - self.inv_vocab = {v: k for k, v in self.vocab.items()} - self.basic_tokenizer = BasicTokenizer( - do_lower_case=do_lower_case, vocab=self.vocab) - self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) - - def tokenize(self, text): - split_tokens = [] - for token in self.basic_tokenizer.tokenize(text): - if preserve_token(token, self.vocab): - split_tokens.append(token) - continue - for sub_token in self.wordpiece_tokenizer.tokenize(token): - split_tokens.append(sub_token) - - return split_tokens - - def convert_tokens_to_ids(self, tokens): - return convert_by_vocab(self.vocab, tokens) - - def convert_ids_to_tokens(self, ids): - return convert_by_vocab(self.inv_vocab, ids) - - -class BasicTokenizer(object): - """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" - - def __init__(self, do_lower_case=True, vocab=tuple()): - """Constructs a BasicTokenizer. - - Args: - do_lower_case: Whether to lower case the input. - vocab: A container of tokens to not mutate during tokenization. - """ - self.do_lower_case = do_lower_case - self.vocab = vocab - - def tokenize(self, text): - """Tokenizes a piece of text.""" - text = convert_to_unicode(text) - text = self._clean_text(text) - - # This was added on November 1st, 2018 for the multilingual and Chinese - # models. This is also applied to the English models now, but it doesn't - # matter since the English models were not trained on any Chinese data - # and generally don't have any Chinese data in them (there are Chinese - # characters in the vocabulary because Wikipedia does have some Chinese - # words in the English Wikipedia.). - text = self._tokenize_chinese_chars(text) - - orig_tokens = whitespace_tokenize(text) - split_tokens = [] - for token in orig_tokens: - if preserve_token(token, self.vocab): - split_tokens.append(token) - continue - if self.do_lower_case: - token = token.lower() - token = self._run_strip_accents(token) - split_tokens.extend(self._run_split_on_punc(token)) - - output_tokens = whitespace_tokenize(" ".join(split_tokens)) - return output_tokens - - def _run_strip_accents(self, text): - """Strips accents from a piece of text.""" - text = unicodedata.normalize("NFD", text) - output = [] - for char in text: - cat = unicodedata.category(char) - if cat == "Mn": - continue - output.append(char) - return "".join(output) - - def _run_split_on_punc(self, text): - """Splits punctuation on a piece of text.""" - chars = list(text) - i = 0 - start_new_word = True - output = [] - while i < len(chars): - char = chars[i] - if _is_punctuation(char): - output.append([char]) - start_new_word = True - else: - if start_new_word: - output.append([]) - start_new_word = False - output[-1].append(char) - i += 1 - - return ["".join(x) for x in output] - - def _tokenize_chinese_chars(self, text): - """Adds whitespace around any CJK character.""" - output = [] - for char in text: - cp = ord(char) - if self._is_chinese_char(cp): - output.append(" ") - output.append(char) - output.append(" ") - else: - output.append(char) - return "".join(output) - - def _is_chinese_char(self, cp): - """Checks whether CP is the codepoint of a CJK character.""" - # This defines a "chinese character" as anything in the CJK Unicode block: - # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) - # - # Note that the CJK Unicode block is NOT all Japanese and Korean characters, - # despite its name. The modern Korean Hangul alphabet is a different block, - # as is Japanese Hiragana and Katakana. Those alphabets are used to write - # space-separated words, so they are not treated specially and handled - # like the all of the other languages. - if ((cp >= 0x4E00 and cp <= 0x9FFF) or # - (cp >= 0x3400 and cp <= 0x4DBF) or # - (cp >= 0x20000 and cp <= 0x2A6DF) or # - (cp >= 0x2A700 and cp <= 0x2B73F) or # - (cp >= 0x2B740 and cp <= 0x2B81F) or # - (cp >= 0x2B820 and cp <= 0x2CEAF) or - (cp >= 0xF900 and cp <= 0xFAFF) or # - (cp >= 0x2F800 and cp <= 0x2FA1F)): # - return True - - return False - - def _clean_text(self, text): - """Performs invalid character removal and whitespace cleanup on text.""" - output = [] - for char in text: - cp = ord(char) - if cp == 0 or cp == 0xfffd or _is_control(char): - continue - if _is_whitespace(char): - output.append(" ") - else: - output.append(char) - return "".join(output) - - -class WordpieceTokenizer(object): - """Runs WordPiece tokenziation.""" - - def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200): - self.vocab = vocab - self.unk_token = unk_token - self.max_input_chars_per_word = max_input_chars_per_word - - def tokenize(self, text): - """Tokenizes a piece of text into its word pieces. - - This uses a greedy longest-match-first algorithm to perform tokenization - using the given vocabulary. - - For example: - input = "unaffable" - output = ["un", "##aff", "##able"] - - Args: - text: A single token or whitespace separated tokens. This should have - already been passed through `BasicTokenizer. - - Returns: - A list of wordpiece tokens. - """ - - text = convert_to_unicode(text) - - output_tokens = [] - for token in whitespace_tokenize(text): - chars = list(token) - if len(chars) > self.max_input_chars_per_word: - output_tokens.append(self.unk_token) - continue - - is_bad = False - start = 0 - sub_tokens = [] - while start < len(chars): - end = len(chars) - cur_substr = None - while start < end: - substr = "".join(chars[start:end]) - if start > 0: - substr = "##" + substr - if substr in self.vocab: - cur_substr = substr - break - end -= 1 - if cur_substr is None: - is_bad = True - break - sub_tokens.append(cur_substr) - start = end - - if is_bad: - output_tokens.append(self.unk_token) - else: - output_tokens.extend(sub_tokens) - return output_tokens - - -def _is_whitespace(char): - """Checks whether `chars` is a whitespace character.""" - # \t, \n, and \r are technically control characters but we treat them - # as whitespace since they are generally considered as such. - if char == " " or char == "\t" or char == "\n" or char == "\r": - return True - cat = unicodedata.category(char) - if cat == "Zs": - return True - return False - - -def _is_control(char): - """Checks whether `chars` is a control character.""" - # These are technically control characters but we count them as whitespace - # characters. - if char == "\t" or char == "\n" or char == "\r": - return False - cat = unicodedata.category(char) - if cat in ("Cc", "Cf"): - return True - return False - - -def _is_punctuation(char): - """Checks whether `chars` is a punctuation character.""" - cp = ord(char) - # We treat all non-letter/number ASCII as punctuation. - # Characters such as "^", "$", and "`" are not in the Unicode - # Punctuation class but we treat them as punctuation anyways, for - # consistency. - if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or - (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): - return True - cat = unicodedata.category(char) - if cat.startswith("P"): - return True - return False diff --git a/models_v2/pytorch/bert_large/training/cpu/lamb.py b/models_v2/pytorch/bert_large/training/cpu/lamb.py deleted file mode 100644 index 6375d81a0..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/lamb.py +++ /dev/null @@ -1,139 +0,0 @@ -"""Lamb optimizer.""" - -import collections -import math - -import torch -from tensorboardX import SummaryWriter -from torch.optim import Optimizer - - -def log_lamb_rs(optimizer: Optimizer, event_writer: SummaryWriter, token_count: int): - """Log a histogram of trust ratio scalars in across layers.""" - results = collections.defaultdict(list) - for group in optimizer.param_groups: - for p in group['params']: - state = optimizer.state[p] - for i in ('weight_norm', 'adam_norm', 'trust_ratio'): - if i in state: - results[i].append(state[i]) - - for k, v in results.items(): - event_writer.add_histogram(f'lamb/{k}', torch.tensor(v), token_count) - -class Lamb(Optimizer): - r"""Implements Lamb algorithm. - - It has been proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_. - - Arguments: - params (iterable): iterable of parameters to optimize or dicts defining - parameter groups - lr (float, optional): learning rate (default: 1e-3) - betas (Tuple[float, float], optional): coefficients used for computing - running averages of gradient and its square (default: (0.9, 0.999)) - eps (float, optional): term added to the denominator to improve - numerical stability (default: 1e-8) - weight_decay (float, optional): weight decay (L2 penalty) (default: 0) - adam (bool, optional): always use trust ratio = 1, which turns this into - Adam. Useful for comparison purposes. - - .. _Large Batch Optimization for Deep Learning: Training BERT in 76 minutes: - https://arxiv.org/abs/1904.00962 - """ - - def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, - weight_decay=0, adam=False, bias_correction=True): - if not 0.0 <= lr: - raise ValueError("Invalid learning rate: {}".format(lr)) - if not 0.0 <= eps: - raise ValueError("Invalid epsilon value: {}".format(eps)) - if not 0.0 <= betas[0] < 1.0: - raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0])) - if not 0.0 <= betas[1] < 1.0: - raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1])) - defaults = dict(lr=lr, betas=betas, eps=eps, - weight_decay=weight_decay) - self.adam = adam - #self.bias_correction = bias_correction - super(Lamb, self).__init__(params, defaults) - - def step(self, closure=None): - """Performs a single optimization step. - - Arguments: - closure (callable, optional): A closure that reevaluates the model - and returns the loss. - """ - loss = None - if closure is not None: - loss = closure() - - for group in self.param_groups: - for p in group['params']: - if p.grad is None: - continue - bf16_param = p.data.dtype==torch.bfloat16 - grad = p.grad.data - data = p.data - if grad.is_sparse: - raise RuntimeError('Lamb does not support sparse gradients, consider SparseAdam instad.') - - state = self.state[p] - # State initialization - if len(state) == 0: - state['step'] = 0 - # Exponential moving average of gradient values - state['exp_avg'] = torch.zeros_like(p.data, dtype=torch.float32) - # Exponential moving average of squared gradient values - state['exp_avg_sq'] = torch.zeros_like(p.data, dtype=torch.float32) - if bf16_param: - # additional fp32 version of master weights - state['data_fp32'] = p.data.to(torch.float32) - - exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] - beta1, beta2 = group['betas'] - if bf16_param: - grad = grad.to(torch.float32) - data = state['data_fp32'] - - state['step'] += 1 - - # Decay the first and second moment running average coefficient - # m_t - exp_avg.mul_(beta1).add_(1 - beta1, grad) - # v_t - exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) - - step_size = group['lr'] - #if self.bias_correction: - # # Paper v3 does not use debiasing. - # exp_avg_hat = exp_avg / (1 - beta1 ** state['step']) - # exp_avg_sq_hat = exp_avg_sq / (1 - beta2 ** state['step']) - # Apply bias to lr to avoid broadcast. - #else: - exp_avg_hat = exp_avg - exp_avg_sq_hat = exp_avg_sq - - adam_step = exp_avg_hat / exp_avg_sq_hat.sqrt().add(group['eps']) - trust_ratio = 1 - if group['weight_decay'] != 0: - adam_step.add_(group['weight_decay'], data) - - weight_norm = data.pow(2).sum().sqrt() #.clamp(0, 10) - adam_norm = adam_step.pow(2).sum().sqrt() - if weight_norm == 0 or adam_norm == 0: - trust_ratio = 1 - else: - trust_ratio = weight_norm / adam_norm - #if self.adam: - # trust_ratio = 1 - state['weight_norm'] = weight_norm - state['adam_norm'] = adam_norm - state['trust_ratio'] = trust_ratio - - data.add_(-step_size * trust_ratio, adam_step) - if bf16_param: - p.data = data.to(torch.bfloat16) - - return loss diff --git a/models_v2/pytorch/bert_large/training/cpu/run_model.sh b/models_v2/pytorch/bert_large/training/cpu/run_model.sh deleted file mode 100755 index 6ff2cb133..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/run_model.sh +++ /dev/null @@ -1,358 +0,0 @@ -#!/bin/bash - -# -# Copyright (c) 2024 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -if [ "$DDP" == 'false' ]; then - echo "Running single-node training" - if [[ "$TRAINING_PHASE" == '1' ]]; then - echo "Running phase 1 training" - ARGS="--benchmark" - precision=fp32 - batch_size=${batch_size:-224} - elif [ "$TRAINING_PHASE" == '2' ]; then - echo "Running phase 2 training" - ARGS="--benchmark" - precision=fp32 - batch_size=${batch_size:-28} - else - echo "Please set TRAINING_PHASE to 1 or 2" - exit 1 - fi -elif [[ "$DDP" == 'true' ]]; then - echo "Running distributed training" - oneccl_bindings_for_pytorch_path=$(python -c "import torch; import oneccl_bindings_for_pytorch; import os; print(os.path.abspath(os.path.dirname(oneccl_bindings_for_pytorch.__file__)))") - source $oneccl_bindings_for_pytorch_path/env/setvars.sh - if [[ "$TRAINING_PHASE" == '1' ]]; then - ARGS="--benchmark" - precision=fp32 - batch_size=${batch_size:-224} - elif [[ "$TRAINING_PHASE" == '2' ]]; then - ARGS="--benchmark" - precision=fp32 - batch_size=${batch_size:-28} - else - echo "Please set TRAINING_PHASE to 1 or 2" - exit 1 - fi -else - echo "Please set DDP to true or false" - exit 1 -fi - -if [ -z "${OUTPUT_DIR}" ]; then - echo "The required environment variable OUTPUT_DIR has not been set" - exit 1 -fi - -if [ -z "${PRECISION}" ]; then - echo "The required environment variable PRECISION has not been set" - exit 1 -fi - -if [ -z "${DATASET_DIR}" ]; then - echo "The required environment variable DATASET has not been set" - exit 1 -fi - - -MODEL_DIR=${MODEL_DIR-$PWD} - -if [[ "$PRECISION" == *"avx"* ]]; then - unset DNNL_MAX_CPU_ISA -fi - -if [[ "$PRECISION" == "bf16" ]]; then - ARGS="$ARGS --bf16" - precision=bf16 - batch_size=${batch_size:-448} - echo "### running bf16 mode" -elif [[ $PRECISION == "bf32" ]]; then - echo "### running BF32 mode" - ARGS="$ARGS --bf32" - precision=bf32 -elif [[ $DDP == 'false' && $PRECISION == "fp16" ]]; then - echo "### running FP16 mode" - ARGS="$ARGS --fp16" - precision=fp16 -elif [[ $DDP == 'true' && $PRECISION == "fp16" ]]; then - echo "### running BF32 mode" - ARGS="$ARGS --fp16" - precision=bf32 -elif [[ $DDP == 'false' && $PRECISION == "fp8" ]]; then - echo "### running FP8 mode" - ARGS="$ARGS --fp8" - precision=fp8 -elif [[ $PRECISION == "fp32" || $PRECISION == "avx-fp32" ]]; then - echo "### running FP32 mode" - -else - echo "The specified precision '$PRECISION' is unsupported." - echo "Supported precisions for single-node training are: fp32, bf32, avx-fp32, bf16, fp8" - echo "Supported precisions for distributed training are: fp32, bf16, bf32" - exit 1 -fi - -if [ "$DDP" == 'false' ]; then - export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"; - if [[ "$TRAINING_PHASE" == '1' ]]; then - BERT_MODEL_CONFIG=${BERT_MODEL_CONFIG-~/dataset/checkpoint/config.json} - rm -rf ${OUTPUT_DIR}/throughput_log_phase1_* - rm -rf ${OUTPUT_DIR}/model_save_${PRECISION} - elif [[ "$TRAINING_PHASE" == '2' ]]; then - PRETRAINED_MODEL=${PRETRAINED_MODEL:-~/dataset/checkpoint/} - rm -rf ${OUTPUT_DIR}/throughput_log_phase2_* - fi -elif [ "$DDP" == 'true' ]; then - if [[ "$TRAINING_PHASE" == '1' ]]; then - BERT_MODEL_CONFIG=${BERT_MODEL_CONFIG-~/dataset/checkpoint/config.json} - SOCKETS=`lscpu | grep Socket | awk '{print $2}'` - NNODES=${NNODES:-1} - HOSTFILE=${HOSTFILE:-./hostfile} - rm -rf ${OUTPUT_DIR}/throughput_log_phase1_* - elif [[ "$TRAINING_PHASE" == '2' ]]; then - PRETRAINED_MODEL=${PRETRAINED_MODEL:-~/dataset/checkpoint/} - SOCKETS=`lscpu | grep Socket | awk '{print $2}'` - NNODES=${NNODES:-1} - HOSTFILE=${HOSTFILE:-./hostfile} - rm -rf ${OUTPUT_DIR}/throughput_log_phase2_* - fi -fi - -DATASET_DIR=${DATASET_DIR:-~/dataset/} -TRAIN_SCRIPT=${TRAIN_SCRIPT:-${MODEL_DIR}/run_pretrain_mlperf.py} -OUTPUT_DIR=${OUTPUT_DIR:-${PWD}} -work_space=${work_space:-${OUTPUT_DIR}} - -latency="N/A" -accuracy="N/A" -throughput="N/A" - -if [[ "$DDP" == "false" ]]; then - if [[ "$TRAINING_PHASE" == "1" ]]; then - NUM_RANKS=1 - LBS=$(( batch_size / NUM_RANKS )) - params="--train_batch_size=$LBS --learning_rate=3.5e-4 --opt_lamb_beta_1=0.9 --opt_lamb_beta_2=0.999 --warmup_proportion=0.0 --warmup_steps=0.0 --start_warmup_step=0 --max_steps=13700 --max_predictions_per_seq=76 --do_train --train_mlm_accuracy_window_size=0 --target_mlm_accuracy=0.720 --weight_decay_rate=0.01 --max_samples_termination=4500000 --eval_iter_start_samples=150000 --eval_iter_samples=150000 --eval_batch_size=16 --gradient_accumulation_steps=1 --num_samples_per_checkpoint 1 --min_samples_to_start_checkpoints 1 --log_freq 1 " - - TORCH_INDUCTOR=${TORCH_INDUCTOR:-"0"} - if [[ "0" == ${TORCH_INDUCTOR} ]];then - python -m intel_extension_for_pytorch.cpu.launch --nodes-list 0 --memory-allocator jemalloc --log_file_prefix="${OUTPUT_DIR}/throughput_log_phase1_${precision}" ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_128/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --benchmark \ - --ipex \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - --dense_seq_output \ - --config_name ${BERT_MODEL_CONFIG} \ - $ARGS \ - $params 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase1_${precision}.log - else - python -m torch.backends.xeon.run_cpu --disable-numactl --node_id 0 --enable-jemalloc --log_path=${OUTPUT_DIR} ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_128/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --benchmark \ - --inductor \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - --dense_seq_output \ - --config_name ${BERT_MODEL_CONFIG} \ - $ARGS \ - $params 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase1_${precision}.log - fi - throughput=$(grep 'Throughput:' ${OUTPUT_DIR}/throughput_log_phase1_${precision}* |sed -e 's/.*Throughput//;s/[^0-9.]//g' |awk ' - BEGIN { - sum = 0; - i = 0; - } - { - sum = sum + $1; - i++; - } - END { - sum = sum / i; - printf("%.3f", sum); - }') - echo "--------------------------------Performance Summary per NUMA Node--------------------------------" - echo ""BERT";"training phase1 throughput";${precision}; ${batch_size};${throughput}" | tee -a ${OUTPUT_DIR}/summary.log - elif [[ "$TRAINING_PHASE" == "2" ]]; then - NUM_RANKS=1 - LBS=$(( batch_size / NUM_RANKS )) - params="--train_batch_size=$LBS --learning_rate=3.5e-4 --opt_lamb_beta_1=0.9 --opt_lamb_beta_2=0.999 --warmup_proportion=0.0 --warmup_steps=0.0 --start_warmup_step=0 --max_steps=13700 --phase2 --max_predictions_per_seq=76 --do_train --skip_checkpoint --train_mlm_accuracy_window_size=0 --target_mlm_accuracy=0.720 --weight_decay_rate=0.01 --max_samples_termination=4500000 --eval_iter_start_samples=150000 --eval_iter_samples=150000 --eval_batch_size=16 --gradient_accumulation_steps=1 --log_freq=0 " - - TORCH_INDUCTOR=${TORCH_INDUCTOR:-"0"} - if [[ "0" == ${TORCH_INDUCTOR} ]];then - python -m intel_extension_for_pytorch.cpu.launch --nodes-list 0 --memory-allocator jemalloc --log_file_prefix="${OUTPUT_DIR}/throughput_log_phase2_${precision}" ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_512/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --model_name_or_path ${PRETRAINED_MODEL} \ - --benchmark \ - --ipex \ - --dense_seq_output \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - $ARGS \ - $params 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase2_${precision}.log - else - python -m torch.backends.xeon.run_cpu --disable-numactl --node_id 0 --enable-jemalloc --log_path=${OUTPUT_DIR} ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_512/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --model_name_or_path ${PRETRAINED_MODEL} \ - --benchmark \ - --inductor \ - --dense_seq_output \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - $ARGS \ - $params 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase2_${precision}.log - fi - throughput=$(grep 'Throughput:' ${OUTPUT_DIR}/throughput_log_phase2_${precision}* |sed -e 's/.*Throughput//;s/[^0-9.]//g' |awk ' - BEGIN { - sum = 0; - i = 0; - } - { - sum = sum + $1; - i++; - } - END { - sum = sum / i; - printf("%.3f", sum); - }') - echo "--------------------------------Performance Summary per NUMA Node--------------------------------" - echo ""BERT";"training phase2 throughput";${precision}; ${batch_size};${throughput}" | tee -a ${OUTPUT_DIR}/summary.log - fi -elif [[ "$DDP" == "true" ]]; then - if [[ "$TRAINING_PHASE" == "1" ]]; then - NUM_RANKS=$(( NNODES * SOCKETS )) - LBS=$(( batch_size / NUM_RANKS )) - params="--train_batch_size=$LBS --learning_rate=3.5e-4 --opt_lamb_beta_1=0.9 --opt_lamb_beta_2=0.999 --warmup_proportion=0.0 --warmup_steps=0.0 --start_warmup_step=0 --max_steps=13700 --max_predictions_per_seq=76 --do_train --skip_checkpoint --train_mlm_accuracy_window_size=0 --target_mlm_accuracy=0.720 --weight_decay_rate=0.01 --max_samples_termination=4500000 --eval_iter_start_samples=150000 --eval_iter_samples=150000 --eval_batch_size=16 --gradient_accumulation_steps=1 --log_freq=0 " - - # export FI_PROVIDER=psm3 - # export PSM3_HAL=sockets - - TORCH_INDUCTOR=${TORCH_INDUCTOR:-"0"} - if [[ "0" == ${TORCH_INDUCTOR} ]];then - python -m intel_extension_for_pytorch.cpu.launch --nnodes ${NNODES} --hostfile ${HOSTFILE} --log_dir=${OUTPUT_DIR} --log_file_prefix="./throughput_log_phase1_${precision}" ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_128/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --ipex \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - --dense_seq_output \ - --config_name ${BERT_MODEL_CONFIG} \ - $ARGS \ - $params \ - 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase1_${precision}.log - else - python -m intel_extension_for_pytorch.cpu.launch --nnodes ${NNODES} --hostfile ${HOSTFILE} --log_dir=${OUTPUT_DIR} --log_file_prefix="./throughput_log_phase1_${precision}" ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_128/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --inductor \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - --dense_seq_output \ - --config_name ${BERT_MODEL_CONFIG} \ - $ARGS \ - $params \ - 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase1_${precision}.log - fi - # For the summary of results - wait - throughput=$(grep 'Throughput:' ${OUTPUT_DIR}/throughput_log_phase1_${precision}* |sed -e 's/.*Throughput//;s/[^0-9.]//g' |awk ' - BEGIN { - sum = 0; - i = 0; - } - { - sum = sum + $1; - i++; - } - END { - sum = sum / i; - printf("%.3f", sum); - }') - echo ""BERT";"training phase1 distributed throughput";${precision}; ${batch_size};${throughput}" | tee -a ${OUTPUT_DIR}/summary.log - elif [[ "$TRAINING_PHASE" == "2" ]]; then - NUM_RANKS=$(( NNODES * SOCKETS )) - LBS=$(( batch_size / NUM_RANKS )) - params="--train_batch_size=$LBS --learning_rate=3.5e-4 --opt_lamb_beta_1=0.9 --opt_lamb_beta_2=0.999 --warmup_proportion=0.0 --warmup_steps=0.0 --start_warmup_step=0 --max_steps=13700 --phase2 --max_predictions_per_seq=76 --do_train --skip_checkpoint --train_mlm_accuracy_window_size=0 --target_mlm_accuracy=0.720 --weight_decay_rate=0.01 --max_samples_termination=4500000 --eval_iter_start_samples=150000 --eval_iter_samples=150000 --eval_batch_size=16 --gradient_accumulation_steps=1 --log_freq=0 " - - # export FI_PROVIDER=psm3 - # export PSM3_HAL=sockets - - TORCH_INDUCTOR=${TORCH_INDUCTOR:-"0"} - if [[ "0" == ${TORCH_INDUCTOR} ]];then - python -m intel_extension_for_pytorch.cpu.launch --nnodes ${NNODES} --hostfile ${HOSTFILE} --log_dir=${OUTPUT_DIR} --log_file_prefix="./throughput_log_phase2_${precision}" ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_512/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --ipex \ - --model_name_or_path ${PRETRAINED_MODEL} \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - --dense_seq_output \ - $ARGS \ - $params \ - 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase2_${precision}.log - else - python -m intel_extension_for_pytorch.cpu.launch --nnodes ${NNODES} --hostfile ${HOSTFILE} --log_dir=${OUTPUT_DIR} --log_file_prefix="./throughput_log_phase2_${precision}" ${TRAIN_SCRIPT} \ - --input_dir ${DATASET_DIR}/2048_shards_uncompressed_512/ \ - --eval_dir ${DATASET_DIR}/eval_set_uncompressed/ \ - --model_type 'bert' \ - --inductor \ - --model_name_or_path ${PRETRAINED_MODEL} \ - --output_dir $OUTPUT_DIR/model_save_${PRECISION} \ - --dense_seq_output \ - $ARGS \ - $params \ - 2>&1 | tee ${OUTPUT_DIR}/throughput_log_phase2_${precision}.log - fi - - # For the summary of results - wait - throughput=$(grep 'Throughput:' ${OUTPUT_DIR}/throughput_log_phase2_${precision}* |sed -e 's/.*Throughput//;s/[^0-9.]//g' |awk ' - BEGIN { - sum = 0; - i = 0; - } - { - sum = sum + $1; - i++; - } - END { - sum = sum / i; - printf("%.3f", sum); - }') - echo ""BERT";"training phase2 distributed throughput";${precision}; ${batch_size};${throughput}" | tee -a ${OUTPUT_DIR}/summary.log - fi -fi - -yaml_content=$(cat << EOF -results: -- key : throughput - value: $throughput - unit: sentence/s -- key: latency - value: $latency - unit: s -- key: accuracy - value: $accuracy - unit: f1 -EOF -) - -echo "$yaml_content" > $OUTPUT_DIR/results.yaml -echo "YAML file created." diff --git a/models_v2/pytorch/bert_large/training/cpu/run_pretrain_mlperf.py b/models_v2/pytorch/bert_large/training/cpu/run_pretrain_mlperf.py deleted file mode 100644 index 911709d01..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/run_pretrain_mlperf.py +++ /dev/null @@ -1,984 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -# Copyright 2021 The HuggingFace Inc. team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -Fine-tuning the library models for masked language modeling (BERT, ALBERT, RoBERTa...) -on a text file or a dataset without using HuggingFace Trainer. -Here is the full list of checkpoints on the hub that can be fine-tuned by this script: -https://huggingface.co/models?filter=masked-lm -""" -# You can also adapt this script on your own mlm task. Pointers for this are left as comments. - -"""BERT Pretraining""" - -import argparse -import csv -import h5py -import os -import glob -import numpy as np -import torch -from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Dataset -from torch.utils.data.distributed import DistributedSampler -import logging -import math -import multiprocessing -import numpy as np -import os -import random -import re -import time - -from collections import OrderedDict -from concurrent.futures import ProcessPoolExecutor -#from modeling import BertForPretraining, BertConfig -from schedulers import LinearWarmupPolyDecayScheduler - -import utils - -import torch -import torch.nn.functional as F -from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, Dataset -from torch.utils.data.distributed import DistributedSampler -import argparse -import logging -import math -import os -import random - - -import transformers -from transformers import ( - CONFIG_MAPPING, - MODEL_MAPPING, - AdamW, - AutoConfig, - AutoModelForPreTraining, - AutoTokenizer, - DataCollatorForLanguageModeling, - SchedulerType, - get_scheduler, - set_seed, -) - -from schedulers import LinearWarmUpScheduler, LinearWarmupPolyDecayScheduler -try: - if torch.__version__[:6] >= '1.12.0': - import oneccl_bindings_for_pytorch - else: - import torch_ccl - -except ImportError as e: - oneccl_bindings_for_pytorch = False - -logger = logging.getLogger(__name__) -MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys()) -MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) - -class WorkerInitObj(object): - def __init__(self, seed): - self.seed = seed - def __call__(self, id): - np.random.seed(seed=self.seed + id) - random.seed(self.seed + id) - -def get_eval_batchsize_per_worker(args): - if torch.distributed.is_initialized(): - chunk_size = args.num_eval_examples // args.world_size - rank = args.local_rank - remainder = args.num_eval_examples % args.world_size - if rank args.num_eval_examples: - eval_data = eval_data[:args.num_eval_examples] - break - if torch.distributed.is_initialized(): - chunk_size = args.num_eval_examples // args.world_size - rank = args.local_rank - remainder = args.num_eval_examples % args.world_size - if rank 0 - -def setup_training(args): - device = torch.device("cpu") - if oneccl_bindings_for_pytorch and int(os.environ.get('PMI_SIZE', '0')) > 1: - os.environ['RANK'] = os.environ.get('PMI_RANK', '0') - os.environ['WORLD_SIZE'] = os.environ.get('PMI_SIZE', '1') - torch.distributed.init_process_group(backend="ccl") - device = torch.device("cpu") - args.local_rank = torch.distributed.get_rank() - args.world_size = torch.distributed.get_world_size() - print("##################Using CCL dist run", flush=True) - if args.gradient_accumulation_steps < 1: - raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format( - args.gradient_accumulation_steps)) - if args.train_batch_size % args.gradient_accumulation_steps != 0: - raise ValueError("Invalid gradient_accumulation_steps parameter: {}, batch size {} should be divisible".format( - args.gradient_accumulation_steps, args.train_batch_size)) - - args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps - - if not (args.do_train or (args.eval_dir and args.eval_iter_samples <= 0)): - raise ValueError(" `do_train` or should be in offline eval mode") - - if not args.resume_from_checkpoint or not os.path.exists(args.output_dir): - os.makedirs(args.output_dir, exist_ok=True) - return device, args - -def prepare_model_and_optimizer(args, device): - global_step = 0 - args.resume_step = 0 - checkpoint = None - # download model & vocab. - if args.config_name: - config = AutoConfig.from_pretrained(args.config_name) - elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path) - else: - config = CONFIG_MAPPING[args.model_type]() - logger.warning("You are instantiating a new config instance from scratch.") - - config.dense_seq_output = args.dense_seq_output - if args.model_name_or_path: - model = AutoModelForPreTraining.from_pretrained( - args.model_name_or_path, - from_tf=bool(".ckpt" in args.model_name_or_path), - config=config, - ) - else: - logger.info("Training new model from scratch") - model = AutoModelForPreTraining.from_config(config) - ## Load from Pyt checkpoint - either given as init_checkpoint, or picked up from output_dir if found - #if args.init_checkpoint is not None or found_resume_checkpoint(args): - # # Prepare model - # #model = BertForPreTraining(config) - # model = BertForPreTrainingSegmented(config) - - # # for k,v in model.state_dict().items(): - # # print(f'model-k,len(v)={k}, {v.numel()}') - - # #model = BertForPretraining(config) - # if args.init_checkpoint is None: # finding checkpoint in output_dir - # assert False, "code path not tested with cuda graphs" - # checkpoint_str = "phase2_ckpt_*.pt" if args.phase2 else "phase1_ckpt_*.pt" - # model_names = [f for f in glob.glob(os.path.join(args.output_dir, checkpoint_str))] - # global_step = max([int(x.split('.pt')[0].split('_')[-1].strip()) for x in model_names]) - # args.resume_step = global_step #used for throughput computation - - # resume_init_checkpoint = os.path.join(args.output_dir, checkpoint_str.replace("*", str(global_step))) - # print("Setting init checkpoint to %s - which is the latest in %s" %(resume_init_checkpoint, args.output_dir)) - # checkpoint=torch.load(resume_init_checkpoint, map_location="cpu") - # else: - # checkpoint=torch.load(args.init_checkpoint, map_location="cpu")["model"] - param_optimizer = list(model.named_parameters()) - - no_decay = ['bias', 'gamma', 'beta', 'LayerNorm'] - - optimizer_grouped_parameters = [ - {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay_rate}, - {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}] - - if args.ipex: - from intel_extension_for_pytorch.optim._lamb import Lamb - optimizer = Lamb(optimizer_grouped_parameters, lr=args.learning_rate, betas=(args.opt_lamb_beta_1, args.opt_lamb_beta_2), fused=True) - else: - from lamb import Lamb - optimizer = Lamb(optimizer_grouped_parameters, lr=args.learning_rate, betas=(args.opt_lamb_beta_1, args.opt_lamb_beta_2)) - - if args.warmup_steps == 0: - warmup_steps = int(args.max_steps * args.warmup_proportion) - warmup_start = 0 - else: - warmup_steps = args.warmup_steps - warmup_start = args.start_warmup_step - - lr_scheduler = LinearWarmupPolyDecayScheduler(optimizer, start_warmup_steps=warmup_start, warmup_steps=warmup_steps, - total_steps=args.max_steps, end_learning_rate=0.0, degree=1.0) - #if found_resume_checkpoint(args): - # assert False, "code path not tested with cuda graphs" - # optimizer.load_state_dict(checkpoint['optimizer']) #restores m,v states (only if resuming checkpoint, not for init_checkpoint and init_tf_checkpoint for now) - return model, optimizer, lr_scheduler, checkpoint, global_step - -def take_optimizer_step(args, optimizer, model, overflow_buf, global_step): - global skipped_steps - optimizer.step() - global_step += 1 - return global_step - -def run_eval(model, eval_dataloader, device, num_eval_examples, args, first_eval=False, use_cache=False): - model.eval() - total_eval_loss, total_eval_mlm_acc = 0.0, 0.0 - total_masked = 0 - with torch.no_grad(): - for batch in eval_dataloader: - input_ids, segment_ids, input_mask, masked_lm_labels, next_sentence_labels = batch - outputs = None - if args.bf16: - with torch.cpu.amp.autocast(): - outputs = model( - input_ids=input_ids, - token_type_ids=segment_ids, - attention_mask=input_mask, - labels=masked_lm_labels, - next_sentence_label=next_sentence_labels) - else: - outputs = model( - input_ids=input_ids, - token_type_ids=segment_ids, - attention_mask=input_mask, - labels=masked_lm_labels, - next_sentence_label=next_sentence_labels) - mlm_acc, num_masked = calc_mlm_acc(outputs, masked_lm_labels, args.dense_seq_output) - total_eval_loss += outputs.loss.item() * num_masked - total_eval_mlm_acc += mlm_acc * num_masked - total_masked += num_masked - model.train() - total_masked = torch.tensor(total_masked, device=device, dtype=torch.int64) - total_eval_loss = torch.tensor(total_eval_loss, device=device, dtype=torch.float64) - if torch.distributed.is_initialized(): - #Collect total scores from all ranks - torch.distributed.all_reduce(total_eval_mlm_acc, op=torch.distributed.ReduceOp.SUM) - torch.distributed.all_reduce(total_eval_loss, op=torch.distributed.ReduceOp.SUM) - torch.distributed.all_reduce(total_masked, op=torch.distributed.ReduceOp.SUM) - - # Average by number of examples - total_eval_mlm_acc /= total_masked - total_eval_loss /= total_masked - - return total_eval_loss, total_eval_mlm_acc - -def global_batch_size(args): - return args.train_batch_size * args.gradient_accumulation_steps * args.world_size - -def calc_mlm_acc(outputs, masked_lm_labels, dense_seq_output=False): - prediction_scores = outputs.prediction_logits - masked_lm_labels_flat = masked_lm_labels.view(-1) - mlm_labels = masked_lm_labels_flat[masked_lm_labels_flat != -100] - if not dense_seq_output: - prediction_scores_flat = prediction_scores.view(-1, prediction_scores.shape[-1]) - mlm_predictions_scores = prediction_scores_flat[masked_lm_labels_flat != -100] - mlm_predictions = mlm_predictions_scores.argmax(dim=-1) - else: - mlm_predictions = prediction_scores.argmax(dim=-1) - - num_masked = mlm_labels.numel() - mlm_acc = (mlm_predictions == mlm_labels).sum(dtype=torch.float) / num_masked - - return mlm_acc, num_masked - -def calc_accuracy(outputs, masked_lm_labels, next_sentence_label, args): - loss = outputs.loss.item() - prediction_logits = outputs.prediction_logits - seq_relationship_logits = outputs.seq_relationship_logits - mlm_acc, num_masked = calc_mlm_acc(outputs, masked_lm_labels, args.dense_seq_output) - seq_acc_t = torch.argmax(seq_relationship_logits, dim=-1).eq(next_sentence_label.view([-1])).to(torch.float) - seq_acc_true, seq_tot = seq_acc_t.sum().item(), seq_acc_t.numel() - seq_acc = seq_acc_true / seq_tot - return loss, mlm_acc, num_masked, seq_acc, seq_tot - - -def main(): - args = parse_args() - if not args.ipex and not args.inductor: - print('[Info] please specify --ipex or --inductor to choose path to run, exiting...') - exit(0) - if args.ipex: - print('Using ipex') - import intel_extension_for_pytorch as ipex - from intel_extension_for_pytorch.quantization.fp8 import ( - fp8_autocast, - DelayedScaling, - Format, - prepare_fp8, - ) - status = 'aborted' # later set to 'success' if termination criteria met - device, args = setup_training(args) - total_batch_size = global_batch_size(args) - # Initialize the accelerator. We will let the accelerator handle device placement for us in this example. - # Make one log on every process with the configuration for debugging. - logging.basicConfig( - format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", - datefmt="%m/%d/%Y %H:%M:%S", - level=logging.INFO, - ) - if args.local_rank == 0 or args.local_rank == -1: - print("parsed args:") - print(args) - # Prepare optimizer - model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device) - model.train() - if args.bf32 and args.ipex: - ipex.set_fp32_math_mode(mode=ipex.FP32MathMode.BF32, device="cpu") - model, optimizer = ipex.optimize(model, dtype=torch.float32, optimizer=optimizer, auto_kernel_selection=True) - elif args.fp16 and args.ipex: - scaler = torch.cpu.amp.GradScaler() - model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.half, auto_kernel_selection=True, weights_prepack=True, fuse_update_step=False) - elif args.bf16 and args.ipex: - model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16 if args.bf16 else torch.float32) - elif args.fp8 and args.ipex: - model, optimizer = prepare_fp8(model, optimizer) - - worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed, args.num_epochs_to_generate_seeds_for, device) - worker_seed = worker_seeds[args.local_rank] - - random.seed(worker_seed) - np.random.seed(worker_seed) - torch.manual_seed(worker_seed) - worker_init = WorkerInitObj(worker_seed) - samples_trained = global_step * args.train_batch_size * args.gradient_accumulation_steps * args.world_size - final_loss = float("inf") - train_time_raw = float("inf") - raw_train_start = time.time() - if args.do_train: - model.train() - most_recent_ckpts_paths = [] - average_loss = 0.0 # averaged loss every args.log_freq steps - epoch = 1 - training_steps = 0 - end_training, converged = False, False - samples_trained_prev = 0 - - # pre-compute eval boundaries - samples_trained_per_step = args.train_batch_size * args.gradient_accumulation_steps * args.world_size - start, stop, step = args.eval_iter_start_samples, args.max_samples_termination, args.eval_iter_samples - eval_steps = [math.ceil(i/samples_trained_per_step) for i in np.arange(start, stop, step)] - eval_count = 0 - next_eval_step = eval_steps[eval_count] - pool = ProcessPoolExecutor(1) - - if args.target_mlm_accuracy: - if args.train_mlm_accuracy_window_size > 0: - accuracy_scores = [] - avg_mlm_accuracy = torch.Tensor([0]) - - - first_epoch = True - if found_resume_checkpoint(args): - f_start_id = checkpoint['files'][0] - files = checkpoint['files'][1:] - num_files = len(files) - else: - files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if - os.path.isfile(os.path.join(args.input_dir, f)) and 'part' in f] - files.sort() - num_files = len(files) - random.Random(shuffling_seeds[epoch%len(shuffling_seeds)]).shuffle(files) - f_start_id = 0 - global skipped_steps - if torch.distributed.is_initialized(): - model = torch.nn.parallel.DistributedDataParallel(model, - find_unused_parameters=True, - bucket_cap_mb=8192, - gradient_as_bucket_view=args.use_gradient_as_bucket_view) - - if args.inductor: - if args.fp8 or args.bf32: - print('[Info] torch.compile() training does not support fp8 or bf32 yet, exiting...') - exit(0) - from torch._inductor import config as inductor_config - inductor_config.cpp_wrapper = True - # torch._inductor.config.profiler_mark_wrapper_call = True - # torch._inductor.config.cpp.enable_kernel_profile = True - amp_dtype = torch.half if args.fp16 else torch.bfloat16 - with torch.cpu.amp.autocast(enabled=args.bf16 or args.fp16, dtype=amp_dtype): - print('[Info] Running training steps torch.compile() with default backend') - model = torch.compile(model) - - - now_step, now_skipped, skip_interval = 0, 0, 0 - # Start prefetching eval dataset - if args.eval_dir: - eval_dataset_future = pool.submit(create_eval_dataset, args, worker_init_fn=worker_init) - # comparing to number of samples in a shard. There are ~38k samples in 4096-way shard, comparing to 10k to be safe - need_next_training_shard = args.train_batch_size * args.gradient_accumulation_steps * args.max_steps > 10000 - print("Start Training.") - while global_step < args.max_steps and not end_training: - if args.local_rank == 0 or args.local_rank == -1: - now_time = time.time() - print("epoch:", epoch) - - thread = None - - # Reshuffle file list on subsequent epochs - if not first_epoch: - files = [os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) if - os.path.isfile(os.path.join(args.input_dir, f)) and 'part' in f] - files.sort() - num_files = len(files) - random.Random(shuffling_seeds[epoch%len(shuffling_seeds)]).shuffle(files) - f_start_id = 0 - - first_epoch = False - - shared_file_list = {} - - if torch.distributed.is_initialized() and args.world_size > num_files: - remainder = args.world_size % num_files - data_file = files[(f_start_id*args.world_size + args.local_rank + - remainder * f_start_id) % num_files] - else: - data_file = files[(f_start_id*args.world_size + args.local_rank) % num_files] - - previous_file = data_file - - train_data = pretraining_dataset(data_file, args.max_predictions_per_seq) - train_sampler = RandomSampler(train_data) - train_dataloader = DataLoader(train_data, sampler=train_sampler, - batch_size=args.train_batch_size) - send_lr_in_parallel = False - lr_cpu = torch.tensor([0.0], dtype=torch.float32, device='cpu') - completed_steps=0 - bench_total_time=0 - for f_id in range(f_start_id, len(files)): - if args.world_size > num_files: - data_file = files[(f_id*args.world_size + args.local_rank + - remainder * f_id) % num_files] - else: - data_file = files[(f_id*args.world_size + args.local_rank)%num_files] - - previous_file = data_file - if need_next_training_shard: - dataset_future = pool.submit(create_pretraining_dataset, data_file, args.max_predictions_per_seq, shared_file_list, args, worker_init_fn=worker_init) - t0 = time.time() - for step, batch in enumerate(train_dataloader): - training_steps += 1 - t_beg = time.time() - t1 = time.time() - input_ids, segment_ids, input_mask, masked_lm_labels, next_sentence_labels = batch - #print(f"Input shape: {batch['input_ids'].shape}") - t2 = time.time() - outputs = None - if args.fp16: - with torch.cpu.amp.autocast(enabled=True, dtype=torch.half): - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - elif args.bf16: - with torch.cpu.amp.autocast(): - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - elif args.fp8 and args.ipex: - with fp8_autocast(enabled=True, calibrating=False, fp8_recipe=DelayedScaling(fp8_format=Format.E4M3)): - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - else: #bf32 or fp32 - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - t3 = time.time() - loss = outputs.loss - loss = loss / args.gradient_accumulation_steps - if args.fp16 and args.ipex: - scaler.scale(loss).backward() - else: - loss.backward() - t4 = time.time() - if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1: - if args.fp16 and args.ipex: - scaler.step(optimizer) - scaler.update() - else: - optimizer.step() - lr_scheduler.step() - optimizer.zero_grad() - #progress_bar.update(1) - t5 = time.time() - t_end = time.time() - completed_steps += 1 - if args.benchmark and completed_steps > 10: - bench_total_time = bench_total_time + (t_end -t_beg) - if args.benchmark and completed_steps > 50: - throughput = 40 * args.train_batch_size / bench_total_time - print("Throughput: {:.3f} sentence/s".format(throughput), flush=True) - if args.profile: - print("Running profiling ...") - with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU], record_shapes=True) as p: - if args.fp16: - with torch.cpu.amp.autocast(enabled=True, dtype=torch.half): - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - elif args.bf16: - with torch.cpu.amp.autocast(): - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - elif args.fp8 and args.ipex: - with fp8_autocast(enabled=True, calibrating=False, fp8_recipe=DelayedScaling(fp8_format=Format.E4M3)): - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - else: #bf32 or fp32 - outputs = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask, - labels=masked_lm_labels, next_sentence_label=next_sentence_labels) - loss = outputs.loss - loss = loss / args.gradient_accumulation_steps - if args.fp16 and args.ipex: - scaler.scale(loss).backward() - else: - loss.backward() - if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1: - if args.fp16 and args.ipex: - scaler.step(optimizer) - scaler.update() - else: - optimizer.step() - lr_scheduler.step() - optimizer.zero_grad() - - output = p.key_averages().table(sort_by="self_cpu_time_total") - print(output) - exit() - - gloss, lm_acc, num_masked, seq_acc, seq_tot = calc_accuracy(outputs, masked_lm_labels, next_sentence_labels, args) - #if args.local_rank == 0: - print(f"Step {training_steps:5d}: loss: {gloss:6.3f} lm_acc: {lm_acc:.3f} seq_acc: {seq_acc:.3f} lbs: {args.train_batch_size} gbs: {total_batch_size} DT: {(t1-t0)*1000.0:.1f} XT: {(t2-t1)*1000.0:.1f} FT: {(t3-t2)*1000.0:.1f} BT: {(t4-t3)*1000.0:.1f} OT: {(t5-t4)*1000.0:.1f} TT: {(t5-t0)*1000.0:.1f}") - - update_step = training_steps % args.gradient_accumulation_steps == 0 - divisor = args.gradient_accumulation_steps - if args.log_freq>0: - average_loss += loss.item() - if update_step: - now_lr = optimizer.param_groups[0]['lr'] - global_step += 1 - if (args.eval_dir and args.eval_iter_samples > 0 and global_step == next_eval_step): - # on first eval, get eval_dataloader - if eval_count == 0: - eval_dataloader = create_eval_dataset(args, worker_init_fn=worker_init) #eval_dataset_future.result(timeout=None) - samples_trained = global_step * args.train_batch_size * args.gradient_accumulation_steps * args.world_size - samples_trained_prev = samples_trained - eval_avg_loss, eval_avg_mlm_accuracy = run_eval(model, eval_dataloader, device, args.num_eval_examples, args, - first_eval=(eval_count == 0)) - if args.local_rank == 0 or args.local_rank == -1: - print({"global_steps": global_step, "eval_loss": eval_avg_loss, "eval_mlm_accuracy":eval_avg_mlm_accuracy}) - - if args.target_mlm_accuracy: - if eval_avg_mlm_accuracy >= args.target_mlm_accuracy: - end_training, converged = True, True - if utils.is_main_process(): - print("%f > %f, Target MLM Accuracy reached at %d"%(eval_avg_mlm_accuracy, args.target_mlm_accuracy, global_step)) - - eval_count += 1 - next_eval_step = eval_steps[eval_count] - if args.target_mlm_accuracy and args.train_mlm_accuracy_window_size > 0: - accuracy_scores.append(mlm_acc) - if update_step: - accuracy_scores = accuracy_scores[-args.train_mlm_accuracy_window_size * args.gradient_accumulation_steps:] - avg_mlm_accuracy[0] = sum(accuracy_scores) / len(accuracy_scores) - torch.distributed.all_reduce(avg_mlm_accuracy, op=torch.distributed.ReduceOp.SUM) - avg_mlm_accuracy /= args.world_size - - if args.log_freq > 0 and training_steps % (args.log_freq * args.gradient_accumulation_steps) == 0: - samples_trained = global_step * args.train_batch_size * args.gradient_accumulation_steps * args.world_size - if args.local_rank == 0 or args.local_rank == -1: - time_interval = time.time() - now_time - step_interval = global_step - now_step - now_time = time.time() - now_step = global_step - training_perf = args.train_batch_size * args.gradient_accumulation_steps * args.world_size \ - * (step_interval + skip_interval) / time_interval - skip_interval = 0 - - if args.train_mlm_accuracy_window_size > 0: - print({"training_steps": training_steps, - "average_loss": average_loss / (args.log_freq * divisor), - "step_loss": loss.item() * args.gradient_accumulation_steps / divisor, - "learning_rate": now_lr, - "seq/s": training_perf, - "global_steps": now_step, - "samples_trained": samples_trained, - "skipped_steps": now_skipped, - "timestamp": now_time, - "mlm_accuracy": avg_mlm_accuracy[0].item()}) - else: - print({"training_steps": training_steps, - "average_loss": average_loss / (args.log_freq * divisor), - "step_loss": loss.item() * args.gradient_accumulation_steps / divisor, - "learning_rate": now_lr, - "seq/s": training_perf, - "global_steps": now_step, - "samples_trained": samples_trained, - "skipped_steps": now_skipped, - "timestamp": now_time}) - - - average_loss = 0 - - if global_step >= args.max_steps or end_training: - status = 'success' if converged else 'aborted' - end_training = True - train_time_raw = time.time() - raw_train_start - average_loss = torch.tensor(average_loss, dtype=torch.float32) - if args.log_freq > 0: - last_num_steps = int(training_steps / args.gradient_accumulation_steps) % args.log_freq - last_num_steps = args.log_freq if last_num_steps == 0 else last_num_steps - average_loss = average_loss / (last_num_steps * divisor) - if (torch.distributed.is_initialized()): - average_loss /= args.world_size - torch.distributed.all_reduce(average_loss) - final_loss = average_loss.item() - if utils.is_main_process(): - if args.train_mlm_accuracy_window_size > 0: - print((epoch, training_steps / args.gradient_accumulation_steps, ), {"final_loss": final_loss, - "final_mlm_accuracy": avg_mlm_accuracy[0].item()}) - else: - print((epoch, training_steps / args.gradient_accumulation_steps, ), {"final_loss": final_loss}) - - if end_training or (samples_trained - samples_trained_prev >= args.num_samples_per_checkpoint and samples_trained >= args.min_samples_to_start_checkpoints): - samples_trained_prev = samples_trained - if utils.is_main_process() and not args.skip_checkpoint: - # Save a trained model - model.save_pretrained(args.output_dir) - model.config.to_json_file(args.output_dir+"config.json") - - #model_to_save = model.module if hasattr(model, - # 'module') else model # Only save the model it-self - #if args.phase2: - # output_save_file = os.path.join(args.output_dir, "phase2_ckpt_{}.pt".format(samples_trained)) - #else: - # output_save_file = os.path.join(args.output_dir, "phase1_ckpt_{}.pt".format(samples_trained)) - #if args.do_train: - # torch.save({'model': model_to_save.state_dict(), - # 'optimizer': optimizer.state_dict(), - # #'master params': list(amp.master_params(optimizer)), - # 'files': [f_id] + files}, output_save_file) - # - # most_recent_ckpts_paths.append(output_save_file) - # if len(most_recent_ckpts_paths) > args.keep_n_most_recent_checkpoints: - # ckpt_to_be_removed = most_recent_ckpts_paths.pop(0) - # os.remove(ckpt_to_be_removed) - - if samples_trained >= args.max_samples_termination or end_training: - status = 'success' if converged else 'aborted' - end_training = True - break - t0 = time.time() - - del train_dataloader - - if samples_trained >= args.max_samples_termination or end_training: - status = 'success' if converged else 'aborted' - end_training = True - break - - if not need_next_training_shard: - dataset_future = pool.submit(create_pretraining_dataset, data_file, args.max_predictions_per_seq, shared_file_list, args, worker_init_fn=worker_init) - train_dataloader, data_file = dataset_future.result(timeout=None) - epoch += 1 - - return args, final_loss, train_time_raw - -if __name__ == "__main__": - main() - diff --git a/models_v2/pytorch/bert_large/training/cpu/schedulers.py b/models_v2/pytorch/bert_large/training/cpu/schedulers.py deleted file mode 100644 index dc2e7c66f..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/schedulers.py +++ /dev/null @@ -1,105 +0,0 @@ -# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. -# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import math -import torch -from torch.optim.optimizer import Optimizer -from torch.optim.lr_scheduler import _LRScheduler - - - -class LRScheduler(_LRScheduler): - def __init__(self, optimizer, last_epoch=-1): - # Check if using mixed precision training - self.mixed_training = False - base_optimizer = optimizer - - # Check that optimizer param is valid - if not isinstance(optimizer, Optimizer): - raise TypeError('{} is not an Optimizer'.format( - type(optimizer).__name__)) - - super(LRScheduler, self).__init__(base_optimizer, last_epoch) - - def step(self, epoch=None): - # Set the current training step - # ('epoch' is used to be consistent with _LRScheduler) - if self.mixed_training: - # The assumption is that the step will be constant - state_dict = self.optimizer.state[self.optimizer.param_groups[0]['params'][0]] - if 'step' in state_dict: - self.last_epoch = state_dict['step'] + 1 - else: - self.last_epoch = 1 - else: - self.last_epoch = epoch if epoch is not None else self.last_epoch + 1 - - for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()): - param_group['lr'] = lr - - -class LinearWarmUpScheduler(LRScheduler): - """ - Applies a warm up period to the learning rate. - """ - - def __init__(self, optimizer, warmup, total_steps, last_epoch=-1): - self.warmup = warmup - self.total_steps = total_steps - super(LinearWarmUpScheduler, self).__init__(optimizer, last_epoch) - - - def get_lr(self): - progress = self.last_epoch / self.total_steps - if progress < self.warmup: - return [base_lr * progress / self.warmup for base_lr in self.base_lrs] - else: - return [base_lr * max(( progress - 1.0)/(self.warmup - 1.0), 0.) for base_lr in self.base_lrs] - - -class LinearWarmupPolyDecayScheduler(LRScheduler): - """ - Applies a warm up period to the learning rate. - """ - def __init__(self, optimizer, start_warmup_steps, warmup_steps, total_steps, end_learning_rate=0.0, degree=1.0, last_epoch=-1): - self.num_warmup_updates = warmup_steps - self.start_warmup_steps = start_warmup_steps - self.total_steps = total_steps - self.end_learning_rate = end_learning_rate - self.degree = degree - self.offset_step = int(self.start_warmup_steps == 0) - super(LinearWarmupPolyDecayScheduler, self).__init__(optimizer, last_epoch) - - - def step(self, epoch=None): - # Instead of optimizer.param_groups['lr'], - # update optimizer._lr to avoid sync - state_dict = self.optimizer.state[self.optimizer.param_groups[0]['params'][0]] - if 'step' in state_dict: - self.last_epoch = state_dict['step'] + 1 - else: - self.last_epoch = 1 - lr = self.get_lr() - for param_group in self.optimizer.param_groups: - param_group['lr'] = lr - - def get_lr(self): - mod_step = self.last_epoch - self.offset_step - self.start_warmup_steps - cond = mod_step < self.num_warmup_updates - progress = (cond * (mod_step / (self.num_warmup_updates + 1e-6))) + \ - ((1.0 - cond) * (min((self.last_epoch - self.offset_step) / self.total_steps, 1))) - base_lr = self.base_lrs[0] - lr = (cond * (base_lr * progress)) + \ - ((1.0 - cond) * ((base_lr - self.end_learning_rate) * (1-progress) ** self.degree + self.end_learning_rate)) - return lr diff --git a/models_v2/pytorch/bert_large/training/cpu/setup.sh b/models_v2/pytorch/bert_large/training/cpu/setup.sh deleted file mode 100755 index 1c75ce556..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/setup.sh +++ /dev/null @@ -1,42 +0,0 @@ -#!/usr/bin/env bash -# -# Copyright (c) 2023-2024 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -# Install dependency -pip install tensorboardX -pip install datasets==1.11.0 accelerate tfrecord intel-openmp faiss-cpu tfrecord -pip install h5py -pip install --upgrade huggingface_hub -pip install tensorflow-cpu protobuf==3.20.3 numpy==1.20 - -# Check the operating system type -os_type=$(awk -F= '/^NAME/{print $2}' /etc/os-release) - -# Install model specific dependencies: -if [[ "$os_name" == *"CentOS"* ]]; then - yum install -y git-lfs -elif [[ "$os_name" == *"Ubuntu"* ]]; then - apt install -y git-lfs -fi - -rm -rf transformers -git clone https://github.com/huggingface/transformers.git -cd transformers -git checkout v4.38.1 -git lfs pull -git apply ../../../../../common/enable_ipex_for_transformers.diff -pip install -e ./ -cd .. diff --git a/models_v2/pytorch/bert_large/training/cpu/utils.py b/models_v2/pytorch/bert_large/training/cpu/utils.py deleted file mode 100644 index 0ef090eb7..000000000 --- a/models_v2/pytorch/bert_large/training/cpu/utils.py +++ /dev/null @@ -1,189 +0,0 @@ -# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import torch -import torch.distributed as dist - -from contextlib import contextmanager -import logging.config -import random - -def convert_weight_names(names): - - extra_params = {"cls/predictions/bias": "cls/predictions/output_bias", - "cls/seq_relationship/kernel": "cls/seq_relationship/output_weights", - "cls/seq_relationship/bias": "cls/seq_relationship/output_bias"} - new_names = [] - for name in names: - - name = name.replace("layer.", "layer_").replace( - ".", "/").replace( - "LayerNorm/bias", "LayerNorm/beta").replace( - "LayerNorm/weight", "LayerNorm/gamma").replace( - "weight", "kernel").replace( - "embeddings/kernel", "embeddings") - if name in extra_params: - name = extra_params[name] - new_names.append(name) - return new_names - -def generate_seeds(rng, size): - """ - Generate list of random seeds - - :param rng: random number generator - :param size: length of the returned list - """ - seeds = [rng.randint(0, 2**32 - 1) for _ in range(size)] - return seeds - - -def broadcast_seeds(seeds, device): - """ - Broadcasts random seeds to all distributed workers. - Returns list of random seeds (broadcasted from workers with rank 0). - - :param seeds: list of seeds (integers) - :param device: torch.device - """ - if torch.distributed.is_available() and torch.distributed.is_initialized(): - seeds_tensor = torch.LongTensor(seeds).to(device) - torch.distributed.broadcast(seeds_tensor, 0) - seeds = seeds_tensor.tolist() - return seeds - - -def setup_seeds(master_seed, epochs, device): - """ - Generates seeds from one master_seed. - Function returns (worker_seeds, shuffling_seeds), worker_seeds are later - used to initialize per-worker random number generators (mostly for - dropouts), shuffling_seeds are for RNGs resposible for reshuffling the - dataset before each epoch. - Seeds are generated on worker with rank 0 and broadcasted to all other - workers. - - :param master_seed: master RNG seed used to initialize other generators - :param epochs: number of epochs - :param device: torch.device (used for distributed.broadcast) - """ - if master_seed is None: - # random master seed, random.SystemRandom() uses /dev/urandom on Unix - master_seed = random.SystemRandom().randint(0, 2**32 - 1) - if get_rank() == 0: - # master seed is reported only from rank=0 worker, it's to avoid - # confusion, seeds from rank=0 are later broadcasted to other - # workers - logging.info(f'Using random master seed: {master_seed}') - else: - # master seed was specified from command line - logging.info(f'Using master seed from command line: {master_seed}') - - # initialize seeding RNG - seeding_rng = random.Random(master_seed) - - # generate worker seeds, one seed for every distributed worker - worker_seeds = generate_seeds(seeding_rng, get_world_size()) - - # generate seeds for data shuffling, one seed for every epoch - shuffling_seeds = generate_seeds(seeding_rng, epochs) - - # broadcast seeds from rank=0 to other workers - worker_seeds = broadcast_seeds(worker_seeds, device) - shuffling_seeds = broadcast_seeds(shuffling_seeds, device) - return worker_seeds, shuffling_seeds - - -def barrier(): - """ - Works as a temporary distributed barrier, currently pytorch - doesn't implement barrier for NCCL backend. - Calls all_reduce on dummy tensor and synchronizes with GPU. - """ - if torch.distributed.is_available() and torch.distributed.is_initialized(): - torch.distributed.all_reduce(torch.cuda.FloatTensor(1)) - torch.cuda.synchronize() - - -def get_rank(): - """ - Gets distributed rank or returns zero if distributed is not initialized. - """ - if torch.distributed.is_available() and torch.distributed.is_initialized(): - rank = torch.distributed.get_rank() - else: - rank = 0 - return rank - - -def get_world_size(): - """ - Gets total number of distributed workers or returns one if distributed is - not initialized. - """ - - if torch.distributed.is_available(): - print("Torch distributed is available.") - else: - print("Torch distributed is not available.") - - if torch.distributed.is_initialized(): - print("Torch distributed is initialized.") - else: - print("Torch distributed is not initialized.") - - if torch.distributed.is_available() and torch.distributed.is_initialized(): - world_size = torch.distributed.get_world_size() - else: - world_size = 1 - return world_size - - -def set_device(cuda, local_rank): - """ - Sets device based on local_rank and returns instance of torch.device. - - :param cuda: if True: use cuda - :param local_rank: local rank of the worker - """ - if cuda: - torch.cuda.set_device(local_rank) - device = torch.device('cuda') - else: - device = torch.device('cpu') - return device - - -@contextmanager -def sync_workers(): - """ - Yields distributed rank and synchronizes all workers on exit. - """ - rank = get_rank() - yield rank - barrier() - -def is_main_process(): - return get_rank() == 0 - -def format_step(step): - if isinstance(step, str): - return step - s = "" - if len(step) > 0: - s += "Training Epoch: {} ".format(step[0]) - if len(step) > 1: - s += "Training Iteration: {} ".format(step[1]) - if len(step) > 2: - s += "Validation Iteration: {} ".format(step[2]) - return s diff --git a/models_v2/pytorch/chatglm/inference/cpu/CONTAINER.md b/models_v2/pytorch/chatglm/inference/cpu/CONTAINER.md index 97dedd6ad..0d1d11319 100644 --- a/models_v2/pytorch/chatglm/inference/cpu/CONTAINER.md +++ b/models_v2/pytorch/chatglm/inference/cpu/CONTAINER.md @@ -38,7 +38,7 @@ To run ChatGLM inference, set environment variables to specify the precision and export BATCH_SIZE= ##Required export OUTPUT_DIR= -export PRECISION= +export PRECISION= export DNNL_MAX_CPU_ISA= export INPUT_TOKEN= export OUTPUT_TOKEN= diff --git a/models_v2/pytorch/chatglm/inference/cpu/README.md b/models_v2/pytorch/chatglm/inference/cpu/README.md index e2ef86a9a..2ee93a0b0 100644 --- a/models_v2/pytorch/chatglm/inference/cpu/README.md +++ b/models_v2/pytorch/chatglm/inference/cpu/README.md @@ -72,7 +72,7 @@ Follow [link](https://github.com/IntelAI/models/blob/master/docs/general/pytorch |:---------------------------:|:------------------------------------------------------------------------------------:| | **TEST_MODE** (THROUGHPUT, ACCURACY, REALTIME) | `export TEST_MODE=THROUGHPUT` | | **OUTPUT_DIR** | `export OUTPUT_DIR=$(pwd)` | -| **PRECISION** | `export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16) | +| **PRECISION** | `export PRECISION=bf16` (For Throughput and 1024/128 token sizes provide fp32, bf16, fp16 and int8-fp32. For Realtime and 1024/128 token sizes bf16 and fp16. For Throughput and 2016/32 token sizes provide bf16 and fp16. For Realtime and 2016/32 token sizes provide bf16 and fp16. For Accuracy fp32, bf32, bf16, fp16, int8-fp32) | | **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | | **INPUT_TOKEN** | `export INPUT_TOKEN=32(choice in [32 64 128 256 512 1024 2016], we prefer to benchmark on 32 and 2016)` | | **OUTPUT_TOKEN** | `export OUTPUT_TOKEN=32(32 is preferred, while you could set any other length)` | diff --git a/models_v2/pytorch/gptj/inference/cpu/CONTAINER.md b/models_v2/pytorch/gptj/inference/cpu/CONTAINER.md index ad2dfc511..624736265 100644 --- a/models_v2/pytorch/gptj/inference/cpu/CONTAINER.md +++ b/models_v2/pytorch/gptj/inference/cpu/CONTAINER.md @@ -39,7 +39,7 @@ export BATCH_SIZE= export TORCH_INDUCTOR= ##Required export OUTPUT_DIR= -export PRECISION= +export PRECISION= export INPUT_TOKEN= export OUTPUT_TOKEN= export TEST_MODE= diff --git a/models_v2/pytorch/gptj/inference/cpu/README.md b/models_v2/pytorch/gptj/inference/cpu/README.md index 2e4259da3..29a11ef2a 100644 --- a/models_v2/pytorch/gptj/inference/cpu/README.md +++ b/models_v2/pytorch/gptj/inference/cpu/README.md @@ -87,7 +87,7 @@ Follow [link](https://github.com/IntelAI/models/blob/master/docs/general/pytorch |:---------------------------:|:------------------------------------------------------------------------------------:| | **TEST_MODE** (THROUGHPUT, ACCURACY, REALTIME) | `export TEST_MODE=THROUGHPUT` | | **OUTPUT_DIR** | `export OUTPUT_DIR=$(pwd)` | -| **PRECISION** | `export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16) | +| **PRECISION** and **TOKENS** | `export PRECISION=bf16` (For Throughput and 1024/128 token sizes provide bf32, fp16, int8-bf16. For Realtime and 1024/128 token sizes fp32, bf32, bf16, fp16, int8-fp32. For Throughput and 2016/32 token sizes provide bf16, fp16. For Realtime and 2016/32 token sizes provide fp32, bf32, bf16, fp16. For Accuracy fp32, bf32, bf16, fp16, int8-fp32) | | **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | | **BATCH_SIZE** (optional) | `export BATCH_SIZE=256` | diff --git a/models_v2/pytorch/llama/inference/cpu/CONTAINER.md b/models_v2/pytorch/llama/inference/cpu/CONTAINER.md index 1379a1e39..9600e0524 100644 --- a/models_v2/pytorch/llama/inference/cpu/CONTAINER.md +++ b/models_v2/pytorch/llama/inference/cpu/CONTAINER.md @@ -49,9 +49,9 @@ export TORCH_INDUCTOR=0 export FINETUNED_MODEL= ##Required export OUTPUT_DIR= -export PRECISION= -export INPUT_TOKEN= -export OUTPUT_TOKEN= +export PRECISION= +export INPUT_TOKEN= +export OUTPUT_TOKEN= export TEST_MODE= export DNNL_MAX_CPU_ISA= DOCKER_ARGS="--rm -it" @@ -77,9 +77,6 @@ docker run \ sh -c "$SCRIPT" ``` -> [!NOTE] -> The container has been performance validated on fp32,bf16,fp16 and int8-fp32 precisions,`TORCH_INDUCTOR=0`, input tokens 1024 and 2016 and output tokens 128 and 32. - ## Documentation and Sources #### Get Started​ [Docker* Repository](https://hub.docker.com/r/intel/generative-ai) diff --git a/models_v2/pytorch/llama/inference/cpu/README.md b/models_v2/pytorch/llama/inference/cpu/README.md index 4120f3322..25050fabc 100644 --- a/models_v2/pytorch/llama/inference/cpu/README.md +++ b/models_v2/pytorch/llama/inference/cpu/README.md @@ -83,9 +83,10 @@ Follow [link](/docs/general/pytorch/BareMetalSetup.md) to install and build Pyto | **TEST_MODE** (THROUGHPUT, ACCURACY, REALTIME) | `export TEST_MODE=THROUGHPUT` | | **OUTPUT_DIR** | `export OUTPUT_DIR=` | | **FINETUNED_MODEL** | `#Test llama2 7b: export FINETUNED_MODEL="meta-llama/Llama-2-7b-hf"; #Test llama2 13b: export FINETUNED_MODEL="meta-llama/Llama-2-13b-hf"` | -| **PRECISION** | `export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16) | -| **INPUT_TOKEN** | `export INPUT_TOKEN=32 (choice in [32 64 128 256 512 1024 2016], we prefer to benchmark on 32 and 2016)` | -| **OUTPUT_TOKEN** | `export OUTPUT_TOKEN=32 (32 is preferred, while you could set any other length)` | +| **PRECISION for Llama 7b** | `export PRECISION=bf16` (For Throughput and 1024/128 token sizes provide fp32, bf32 and fp16. For Realtime and 1024/128 token sizes fp32, bf32, bf16 and fp16. For Throughput and 2016/32 token sizes provide fp32, bf16 and fp16. For Realtime and 2016/32 token sizes provide fp32, bf32, bf16 and fp16. For Accuracy fp32, bf32, bf16, fp16, int8-fp32). | +| **PRECISION for Llama 13b** | `export PRECISION=bf16`(For Throughput and 1024/128 token sizes provide fp32, bf16 and fp16. For Realtime and 1024/128 token sizes bf32, bf16 and fp16. For Accuracy fp32, bf32, bf16, fp16, int8-fp32). | +| **INPUT_TOKEN** | `export INPUT_TOKEN=32 (choice in [32 64 128 256 512 1024 2016], we prefer to benchmark on 32 and 2016)` (For Llama 13b, 2016 token size isn't performant) | +| **OUTPUT_TOKEN** | `export OUTPUT_TOKEN=32 (32 is preferred, while you could set any other length)` (For Llama 13b, 32 token size isn't performant) | | **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | | **BATCH_SIZE** (optional) | `export BATCH_SIZE=256` | diff --git a/models_v2/pytorch/llama/training/cpu/CONTAINER.md b/models_v2/pytorch/llama/training/cpu/CONTAINER.md deleted file mode 100644 index 6016078d0..000000000 --- a/models_v2/pytorch/llama/training/cpu/CONTAINER.md +++ /dev/null @@ -1,108 +0,0 @@ -# Running Llama2 7B Training using Intel® Extension for PyTorch* - -## Description -This document provides instructions for running Llama2 7B training using Intel® Extension for PyTorch on Intel® Xeon® Scalable Processors. - -## Pull Command - -```bash -docker pull intel/generative-ai:pytorch-cpu-llama2-training -``` - -* Set ENV for fp16 to leverage AMX if you are using a supported platform. - -```bash -export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16 -``` - -* Set ENV for int8/bf32 to leverage VNNI if you are using a supported platform. -```bash -export DNNL_MAX_CPU_ISA=AVX2_VNNI_2 -``` - -## Docker Run -(Optional) Export related proxy into docker environment. - -```bash -export DOCKER_RUN_ENVS="-e ftp_proxy=${ftp_proxy} \ - -e FTP_PROXY=${FTP_PROXY} -e http_proxy=${http_proxy} \ - -e HTTP_PROXY=${HTTP_PROXY} -e https_proxy=${https_proxy} \ - -e HTTPS_PROXY=${HTTPS_PROXY} -e no_proxy=${no_proxy} \ - -e NO_PROXY=${NO_PROXY} -e socks_proxy=${socks_proxy} \ - -e SOCKS_PROXY=${SOCKS_PROXY}" -``` - -> [!NOTE] -> To run Llama2 7B training tests, you will need to apply for access in the pages with your huggingface account: - - - LLaMA2 7B : https://huggingface.co/meta-llama/Llama-2-7b-hf - -To run Llama2 7B training, set environment variables to specify the precision and an output directory. - -Use the following instructions to download the dataset and set the environment variable `DATASET_DIR` to point to the dataset directory. - -```bash -wget https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data.json -O ${DATASET_DIR} -wget https://raw.githubusercontent.com/tloen/alpaca-lora/main/templates/alpaca.json -O ${DATASET_DIR} -``` - -```bash -##Optional -export BATCH_SIZE= -export FINETUNED_MODEL=meta-llama/Llama-2-7b-hf -##Required -export OUTPUT_DIR= -export PRECISION= -export TORCH_INDUCTOR=0 -export NNODES=1 -export DDP=False -export DNNL_MAX_CPU_ISA= -export DATASET_DIR= - -DOCKER_ARGS="--rm -it" -IMAGE_NAME=intel/generative-ai:pytorch-cpu-llama2-training -TOKEN= -SCRIPT="huggingface-cli login --token ${TOKEN} && ./run_model.sh" - -docker run \ - --cap-add 'SYS_NICE' \ - --env PRECISION=${PRECISION} \ - --env OUTPUT_DIR=${OUTPUT_DIR} \ - --env DATASET_DIR=${DATASET_DIR} \ - --env BATCH_SIZE=${BATCH_SIZE} \ - --env FINETUNED_MODEL=${FINETUNED_MODEL} - --env TORCH_INDUCTOR=${TORCH_INDUCTOR} \ - --env DNNL_MAX_CPU_ISA=${DNNL_MAX_CPU_ISA} \ - --env NNODES=${NNODES} \ - --env DDP=${DDP} \ - --volume ${OUTPUT_DIR}:${OUTPUT_DIR} \ - --volume ${DATASET_DIR}:${DATASET_DIR} \ - ${DOCKER_RUN_ENVS} \ - ${DOCKER_ARGS} \ - $IMAGE_NAME \ - sh -c "$SCRIPT" -``` - -> [!NOTE] -> The container has been validated with `TORCH_INDUCTOR=0`, and on a single node(`DDP=False`). - -## Documentation and Sources -#### Get Started​ -[Docker* Repository](https://hub.docker.com/r/intel/generative-ai) - - -[Main GitHub*](https://github.com/IntelAI/models) - -[Release Notes](https://github.com/IntelAI/models/releases) - -[Get Started Guide](https://github.com/IntelAI/models/blob/master/models_v2/pytorch/llama/training/cpu/CONTAINER.md) - -#### Code Sources -[Dockerfile](https://github.com/IntelAI/models/tree/master/docker/pytorch) - -[Report Issue](https://community.intel.com/t5/Intel-Optimized-AI-Frameworks/bd-p/optimized-ai-frameworks) - -## License Agreement -LEGAL NOTICE: By accessing, downloading or using this software and any required dependent software (the “Software Package”), you agree to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party software included with the Software Package. Please refer to the [license](https://github.com/IntelAI/models/tree/master/third_party) file for additional details. - -[View All Containers and Solutions 🡢](https://www.intel.com/content/www/us/en/developer/tools/software-catalog/containers.html?s=Newest) diff --git a/models_v2/pytorch/llama/training/cpu/README.md b/models_v2/pytorch/llama/training/cpu/README.md deleted file mode 100644 index 57eccb7fd..000000000 --- a/models_v2/pytorch/llama/training/cpu/README.md +++ /dev/null @@ -1,131 +0,0 @@ - -# PyTorch LLAMA2 7B lora apalca finetuning training - - -## Description - -This document has instructions for running [LLaMA2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) lora apalca finetuning using Intel-optimized PyTorch. - -## Bare Metal -### General setup - -Follow [link](/docs/general/pytorch/BareMetalSetup.md) to install and build Pytorch, IPEX, TorchVison and TCMalloc. - -### Model Specific Setup - -* Install Intel OpenMP - ``` - pip install packaging intel-openmp accelerate - ``` -* Set IOMP and tcmalloc Preload for better performance - ``` - export LD_PRELOAD="/tcmalloc/lib/libtcmalloc.so":"/lib/libiomp5.so":$LD_PRELOAD - ``` - -* Set ENV to use multi-nodes distributed training (no need for single-node multi-sockets) - -In this case, we use data-parallel distributed training and every rank will hold same model replica. The NNODES is the number of ip in the HOSTFILE. To use multi-nodes distributed training you should firstly setup the passwordless login (you can refer to [link](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/)) between these nodes. -``` -export NNODES=#your_node_number (default using 1 node) -# create your_ip_list_file, one ip per line, like (or self edit): -scontrol show hostname > ./hostfile - -export HOSTFILE=hostfile - -# [Optional] The following is needed if you have not set torch ccl and oneccl -git clone https://github.com/intel-innersource/frameworks.ai.pytorch.torch-ccl.git -cd frameworks.ai.pytorch.torch-ccl -git checkout public_master -git submodule sync -git submodule update --init --recursive -python setup.py install -cd ../ - -git clone https://github.com/oneapi-src/oneCCL.git -cd oneCCL -mkdir build -cd build -cmake .. -make -j install -source _install/env/setvars.sh -cd ../.. - -``` - -# Dataset - ``` - # Get the dataset here: https://github.com/tloen/alpaca-lora/blob/main/alpaca_data.json - wget https://raw.githubusercontent.com/tloen/alpaca-lora/main/alpaca_data.json - mv alpaca_data.json /models_v2/pytorch/llama/training/cpu - - # Get the dataset template here: https://github.com/tloen/alpaca-lora/blob/main/templates/alpaca.json - wget https://raw.githubusercontent.com/tloen/alpaca-lora/main/templates/alpaca.json - mkdir /models_v2/pytorch/llama/training/cpu/templates - mv alpaca.json /models_v2/pytorch/llama/training/cpu/templates - ``` - -# Training -1. `git clone https://github.com/IntelAI/models.git` -2. `cd models/models_v2/pytorch/llama/training/cpu` -3. Create virtual environment `venv` and activate it: - ``` - python3 -m venv venv - . ./venv/bin/activate - ``` -4. Run setup.sh - ``` - ./setup.sh - ``` -5. Install the latest CPU versions of [torch, torchvision and intel_extension_for_pytorch](https://intel.github.io/intel-extension-for-pytorch/index.html#installation) - -6. #[optional] you may need to get access to llama2 weights from HF - Apply the access in this page [LLaMA2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) with your huggingface account - huggingface-cli login - {your huggingface token} - -7. Setup required environment paramaters - -| **Parameter** | **export command** | -|:---------------------------:|:------------------------------------------------------------------------------------:| -| **DDP** | `export DDP=False (True or False)` | -| **OUTPUT_DIR** | `export OUTPUT_DIR=` | -| **PRECISION** | `export PRECISION=bf16` (fp32, bf32, bf16, fp16) | -| **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | -| **BATCH_SIZE** (optional) | `export BATCH_SIZE=256` | -| **NNODES** (Optional) | `export NNODES=1` | - -## Output - -Single-tile output will typically looks like: - -``` -2024-05-17 22:35:31,097 - root - INFO - ---------- Summary: ---------- -2024-05-17 22:35:31,097 - root - INFO - inference-latency: 18.211 sec. -2024-05-17 22:35:31,097 - root - INFO - first-token-latency: 4.227 sec. -2024-05-17 22:35:31,097 - root - INFO - rest-token-latency: 0.110 sec. -2024-05-17 22:35:31,097 - root - INFO - P90-rest-token-latency: 0.111 sec. -2024-05-17 22:35:36,648 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;total-latency;bf16;1; 18.179000 -2024-05-17 22:35:36,655 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;first-token-latency;bf16;1; 4.238500 -2024-05-17 22:35:36,664 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;rest-token-latency;bf16;1; 0.110000 -2024-05-17 22:35:36,671 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;P90-rest-token-latency;bf16;1; 0.110500 -2024-05-17 22:35:36,678 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;token_per_sec;bf16;1; 9.110 -2024-05-17 22:35:36,686 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;first_token_thp;bf16;1; 0.236 -``` -Final results of the inference run can be found in `results.yaml` file. -``` -results: -- key: first token throughput - value: 15.648000 -- key: rest token throughput - value: 0.284250 -- key: first token latency - value: 4.238500 -- key: rest_token_latency - value: 0.110000 -- key: accuracy - value: 93.17 -``` - - -## License -[LICENSE](https://github.com/IntelAI/models/blob/master/LICENSE) diff --git a/models_v2/pytorch/llama/training/cpu/finetune.py b/models_v2/pytorch/llama/training/cpu/finetune.py deleted file mode 100644 index 4bd475390..000000000 --- a/models_v2/pytorch/llama/training/cpu/finetune.py +++ /dev/null @@ -1,313 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -# Copyright 2020 The HuggingFace Inc. team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -import sys -from typing import List, Optional - -import fire -import torch -import transformers -from datasets import load_dataset -from datasets.utils.logging import disable_progress_bar -from transformers.utils import logging as hf_logging -""" -Unused imports: -import torch.nn as nn -import bitsandbytes as bnb -""" - -from peft import ( - LoraConfig, - get_peft_model, - get_peft_model_state_dict, - set_peft_model_state_dict, -) -from transformers import LlamaForCausalLM, LlamaTokenizer - -from utils.prompter import Prompter - - -def train( - # model/data params - base_model: str = "", # the only required argument - data_path: str = "yahma/alpaca-cleaned", - output_dir: str = "./lora-alpaca", - bf16: bool = False, - fp16: bool = False, - bf32: bool = False, - ipex: bool = False, - inductor: bool = False, - ddp_backend: Optional[str] = None, - # training hyperparams - batch_size: int = 128, - max_steps: int = -1, - micro_batch_size: int = 16, - num_epochs: int = 3, - learning_rate: float = 3e-4, - cutoff_len: int = 256, - val_set_size: int = 2000, - # lora hyperparams - lora_r: int = 8, - lora_alpha: int = 16, - lora_dropout: float = 0.05, - lora_target_modules: List[str] = [ - "q_proj", - "v_proj", - ], - # llm hyperparams - train_on_inputs: bool = True, # if False, masks out inputs in loss - group_by_length: bool = False, # faster, but produces an odd training loss curve - # wandb params - wandb_project: str = "", - wandb_run_name: str = "", - wandb_watch: str = "", # options: false | gradients | all - wandb_log_model: str = "", # options: false | true - resume_from_checkpoint: str = None, # either training checkpoint or final adapter - prompt_template_name: str = "alpaca", # The prompt template to use, will default to alpaca. - disable_tqdm: bool = False, # disable tqdm if needed to avoid split log failure when ddp training outputs multiple ranks. -): - if int(os.environ.get("LOCAL_RANK", 0)) == 0: - print( - f"Training Alpaca-LoRA model with params:\n" - f"base_model: {base_model}\n" - f"data_path: {data_path}\n" - f"output_dir: {output_dir}\n" - f"bf16: {bf16}\n" - f"fp16: {fp16}\n" - f"bf32: {bf32}\n" - f"ipex: {ipex}\n" - f"inductor: {inductor}\n" - f"ddp_backend: {ddp_backend}\n" - f"batch_size: {batch_size}\n" - f"max_steps: {max_steps}\n" - f"micro_batch_size: {micro_batch_size}\n" - f"num_epochs: {num_epochs}\n" - f"learning_rate: {learning_rate}\n" - f"cutoff_len: {cutoff_len}\n" - f"val_set_size: {val_set_size}\n" - f"lora_r: {lora_r}\n" - f"lora_alpha: {lora_alpha}\n" - f"lora_dropout: {lora_dropout}\n" - f"lora_target_modules: {lora_target_modules}\n" - f"train_on_inputs: {train_on_inputs}\n" - f"group_by_length: {group_by_length}\n" - f"wandb_project: {wandb_project}\n" - f"wandb_run_name: {wandb_run_name}\n" - f"wandb_watch: {wandb_watch}\n" - f"wandb_log_model: {wandb_log_model}\n" - f"resume_from_checkpoint: {resume_from_checkpoint or False}\n" - f"prompt template: {prompt_template_name}\n" - f"disable tqdm: {disable_tqdm}\n" - ) - assert ( - base_model - ), "Please specify a --base_model, e.g. --base_model='decapoda-research/llama-7b-hf'" - gradient_accumulation_steps = 8 - - if disable_tqdm: - disable_progress_bar() - hf_logging.disable_progress_bar() - - prompter = Prompter(prompt_template_name) - - device_map = "auto" - world_size = int(os.environ.get("WORLD_SIZE", 1)) - - ddp = world_size != 1 - - if ddp: - gradient_accumulation_steps = gradient_accumulation_steps // world_size - - # Check if parameter passed or if set within environ - use_wandb = len(wandb_project) > 0 or ( - "WANDB_PROJECT" in os.environ and len(os.environ["WANDB_PROJECT"]) > 0 - ) - # Only overwrite environ if wandb param passed - if len(wandb_project) > 0: - os.environ["WANDB_PROJECT"] = wandb_project - if len(wandb_watch) > 0: - os.environ["WANDB_WATCH"] = wandb_watch - if len(wandb_log_model) > 0: - os.environ["WANDB_LOG_MODEL"] = wandb_log_model - - model = LlamaForCausalLM.from_pretrained( - base_model, attn_implementation="eager" - ) - - tokenizer = LlamaTokenizer.from_pretrained(base_model) - - tokenizer.pad_token_id = ( - 0 # unk. we want this to be different from the eos token - ) - tokenizer.padding_side = "left" # Allow batched inference - - def tokenize(prompt, add_eos_token=True): - # there's probably a way to do this with the tokenizer settings - # but again, gotta move fast - result = tokenizer( - prompt, - truncation=True, - max_length=cutoff_len, - padding=False, - return_tensors=None, - ) - if ( - result["input_ids"][-1] != tokenizer.eos_token_id - and len(result["input_ids"]) < cutoff_len - and add_eos_token - ): - result["input_ids"].append(tokenizer.eos_token_id) - result["attention_mask"].append(1) - - result["labels"] = result["input_ids"].copy() - - return result - - def generate_and_tokenize_prompt(data_point): - full_prompt = prompter.generate_prompt( - data_point["instruction"], - data_point["input"], - data_point["output"], - ) - tokenized_full_prompt = tokenize(full_prompt) - if not train_on_inputs: - user_prompt = prompter.generate_prompt( - data_point["instruction"], data_point["input"] - ) - tokenized_user_prompt = tokenize(user_prompt, add_eos_token=False) - user_prompt_len = len(tokenized_user_prompt["input_ids"]) - - tokenized_full_prompt["labels"] = [ - -100 - ] * user_prompt_len + tokenized_full_prompt["labels"][ - user_prompt_len: - ] # could be sped up, probably - return tokenized_full_prompt - - config = LoraConfig( - r=lora_r, - lora_alpha=lora_alpha, - target_modules=lora_target_modules, - lora_dropout=lora_dropout, - bias="none", - task_type="CAUSAL_LM", - ) - model = get_peft_model(model, config) - - if data_path.endswith(".json") or data_path.endswith(".jsonl"): - data = load_dataset("json", data_files=data_path) - else: - data = load_dataset(data_path) - - if resume_from_checkpoint: - # Check the available weights and load them - checkpoint_name = os.path.join( - resume_from_checkpoint, "pytorch_model.bin" - ) # Full checkpoint - if not os.path.exists(checkpoint_name): - checkpoint_name = os.path.join( - resume_from_checkpoint, "adapter_model.bin" - ) # only LoRA model - LoRA config above has to fit - resume_from_checkpoint = ( - False # So the trainer won't try loading its state - ) - # The two files above have a different name depending on how they were saved, but are actually the same. - if os.path.exists(checkpoint_name): - print(f"Restarting from {checkpoint_name}") - adapters_weights = torch.load(checkpoint_name) - model = set_peft_model_state_dict(model, adapters_weights) - else: - print(f"Checkpoint {checkpoint_name} not found") - - model.print_trainable_parameters() # Be more transparent about the % of trainable params. - - if val_set_size > 0: - train_val = data["train"].train_test_split( - test_size=val_set_size, shuffle=True, seed=42 - ) - train_data = ( - train_val["train"].shuffle().map(generate_and_tokenize_prompt) - ) - val_data = ( - train_val["test"].shuffle().map(generate_and_tokenize_prompt) - ) - else: - train_data = data["train"].shuffle().map(generate_and_tokenize_prompt) - val_data = None - - if not ddp and torch.cuda.device_count() > 1: - # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available - model.is_parallelizable = True - model.model_parallel = True - - trainer = transformers.Trainer( - model=model, - train_dataset=train_data, - eval_dataset=val_data, - args=transformers.TrainingArguments( - per_device_train_batch_size=micro_batch_size, - gradient_accumulation_steps=gradient_accumulation_steps, - warmup_steps=10, - num_train_epochs=num_epochs, - learning_rate=learning_rate, - bf16=bf16, - fp16_cpu=fp16, - bf32=bf32, - logging_steps=10, - optim="adamw_torch", - evaluation_strategy="steps" if val_set_size > 0 else "no", - save_strategy="steps", - eval_steps=200 if val_set_size > 0 else None, - save_steps=200, - output_dir=output_dir, - save_total_limit=3, - load_best_model_at_end=True if val_set_size > 0 else False, - ddp_find_unused_parameters=False if ddp else None, - group_by_length=group_by_length, - report_to="wandb" if use_wandb else None, - run_name=wandb_run_name if use_wandb else None, - no_cuda=True, - use_ipex=ipex, - inductor=inductor, - max_steps=max_steps, - ddp_backend=ddp_backend, - disable_tqdm=disable_tqdm, - ), - data_collator=transformers.DataCollatorForSeq2Seq( - tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True - ), - ) - model.config.use_cache = False - - old_state_dict = model.state_dict - model.state_dict = ( - lambda self, *_, **__: get_peft_model_state_dict( - self, old_state_dict() - ) - ).__get__(model, type(model)) - print("Start Training") - trainer.train(resume_from_checkpoint=resume_from_checkpoint) - print("Finish Training") - # skip since open issue for PEFT: https://github.com/tloen/alpaca-lora/issues/319 - # model.save_pretrained(output_dir) - - print( - "\n If there's a warning about missing keys above, please disregard :)" - ) - - -if __name__ == "__main__": - fire.Fire(train) diff --git a/models_v2/pytorch/llama/training/cpu/requirements.txt b/models_v2/pytorch/llama/training/cpu/requirements.txt deleted file mode 100644 index 30f844ebb..000000000 --- a/models_v2/pytorch/llama/training/cpu/requirements.txt +++ /dev/null @@ -1,11 +0,0 @@ -accelerate==0.34.1 -appdirs -bitsandbytes -black -black[jupyter] -datasets -fire -peft==0.6.2 -git+https://github.com/huggingface/transformers.git -gradio -sentencepiece diff --git a/models_v2/pytorch/llama/training/cpu/run_lora_finetune.sh b/models_v2/pytorch/llama/training/cpu/run_lora_finetune.sh deleted file mode 100644 index 75bf7df8a..000000000 --- a/models_v2/pytorch/llama/training/cpu/run_lora_finetune.sh +++ /dev/null @@ -1,105 +0,0 @@ - -#!/bin/bash - -# -# Copyright (c) 2021 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - - -ARGS="" - -MAXSTEP=${MAXSTEP:-50} - -export DNNL_PRIMITIVE_CACHE_CAPACITY=1024 -#export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000" -if [ -z "${OUTPUT_DIR}" ]; then - echo "The required environment variable OUTPUT_DIR has not been set, please create the output path and set it to OUTPUT_DIR" - exit 1 -fi - -if [[ "$1" == "bf16" ]] -then - precision="bf16" - ARGS="$ARGS --bf16 " - echo "### running bf16 mode" -elif [[ "$1" == "fp32" ]] -then - echo "### running fp32 mode" -elif [[ "$1" == "fp16" ]] -then - precision=fp16 - ARGS="$ARGS --fp16 " - echo "### running fp16 mode" -elif [[ "$1" == "bf32" ]] -then - precision=bf32 - ARGS="$ARGS --bf32 " - echo "### running bf32 mode" -else - echo "The specified precision '$1' is unsupported." - echo "Supported precisions are: fp32, bf32, bf16, fp16" - exit 1 -fi - -python -m intel_extension_for_pytorch.cpu.launch --throughput-mode --memory-allocator tcmalloc --log_dir=${OUTPUT_DIR} --log_file_prefix="./training_log_${precision}_${mode}" ../../../../../../models/language_modeling/pytorch/llama/training/cpu/finetune.py $ARGS \ - --base_model 'meta-llama/Llama-2-7b-hf'\ - --data_path '../../../../../../models/language_modeling/pytorch/llama/training/cpu/alpaca_data.json' \ - --output_dir ${OUTPUT_DIR} \ - --batch_size 32 \ - --micro_batch_size 32 \ - --num_epochs 3 \ - --learning_rate 1e-4 \ - --cutoff_len 512 \ - --val_set_size 2000 \ - --lora_r 8 \ - --lora_alpha 16 \ - --lora_dropout 0.05 \ - --lora_target_modules '[q_proj,v_proj]' \ - --train_on_inputs \ - --group_by_length \ - --max_steps ${MAXSTEP} - -train_samples_per_second=($(grep -i 'train_samples_per_second' ${OUTPUT_DIR}/training_log_${precision}_${mode}* |sed -e 's/.*train_samples_per_second*//;s/[^0-9.,]//g;' | awk -F, '{print $1}' |awk ' - BEGIN { - num = 0; - sum = 0; - }{ - num ++; - sum += $1; - }END { - if(num > 0) { - printf("%.6f", sum / num); - }else { - printf("0 0"); - } - } - ')) -train_loss=($(grep -i 'train_loss' ${OUTPUT_DIR}/training_log_${precision}_${mode}* |sed -e 's/.*train_loss*//;s/[^0-9.,]//g;' | awk -F, '{print $1}' |awk ' - BEGIN { - num = 0; - sum = 0; - }{ - num ++; - sum += $1; - }END { - if(num > 0) { - printf("%.6f", sum / num); - }else { - printf("0 0"); - } - } - ')) -echo "${FINETUNED_MODEL};training throughput;"train_samples_per_second";${precision};${BATCH_SIZE}; ${train_samples_per_second} " |tee -a ${OUTPUT_DIR}/summary.log -echo "${FINETUNED_MODEL};training throughput;"train_loss";${precision};${BATCH_SIZE}; ${train_loss} " |tee -a ${OUTPUT_DIR}/summary.log diff --git a/models_v2/pytorch/llama/training/cpu/run_lora_finetune_ddp.sh b/models_v2/pytorch/llama/training/cpu/run_lora_finetune_ddp.sh deleted file mode 100644 index 154a14cf7..000000000 --- a/models_v2/pytorch/llama/training/cpu/run_lora_finetune_ddp.sh +++ /dev/null @@ -1,136 +0,0 @@ - -#!/bin/bash - -# -# Copyright (c) 2021 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - - -ARGS="" - -MAXSTEP=${MAXSTEP:-50} - -export DNNL_PRIMITIVE_CACHE_CAPACITY=1024 -#export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000" -if [ -z "${OUTPUT_DIR}" ]; then - echo "The required environment variable OUTPUT_DIR has not been set, please create the output path and set it to OUTPUT_DIR" - exit 1 -fi - -if [[ "$1" == "bf16" ]] -then - precision="bf16" - ARGS="$ARGS --bf16 " - echo "### running bf16 mode" -elif [[ "$1" == "fp32" ]] -then - echo "### running fp32 mode" -elif [[ "$1" == "fp16" ]] -then - precision=fp16 - ARGS="$ARGS --fp16 " - echo "### running fp16 mode" -elif [[ "$1" == "bf32" ]] -then - precision=bf32 - ARGS="$ARGS --bf32 " - echo "### running bf32 mode" -else - echo "The specified precision '$1' is unsupported." - echo "Supported precisions are: fp32, bf32, bf16, fp16" - exit 1 -fi - -CORES=`lscpu | grep Core | awk '{print $4}'` -SOCKETS=`lscpu | grep Socket | awk '{print $2}'` -TOTAL_CORES=`expr $CORES \* $SOCKETS` -NNODES=${NNODES:-1} -HOSTFILE=${HOSTFILE:-./hostfile} -NUM_RANKS=$(( NNODES * SOCKETS )) - -CORES_PER_INSTANCE=$CORES - -export DNNL_PRIMITIVE_CACHE_CAPACITY=1024 -export KMP_BLOCKTIME=1 -export KMP_AFFINITY=granularity=fine,compact,1,0 - -<< EOF -#oneCCL settings -export CCL_WORKER_COUNT=8 -export CCL_LOG_LEVEL=info -export CCL_BF16=avx512bf -export CCL_ATL_TRANSPORT=ofi -export CCL_MNIC_COUNT=2 -export CCL_MNIC=local -export CCL_MNIC_NAME=irdma1,irdma5 -export CCL_ALLREDUCE=ring -export CCL_WORKER_COUNT=8 - -for (( i = $SOCKETS; i < 2*$SOCKETS; i++ )); do # pin CCL workers to HT - START_CORE=$(( i * CORES )) - for (( j = 0; j < $CCL_WORKER_COUNT; j++)); do - CCL_WORKER_AFFINITY="${CCL_WORKER_AFFINITY} $((START_CORE + j))" - done -done - -export CCL_WORKER_AFFINITY=`echo ${CCL_WORKER_AFFINITY} | tr " " ","` -EOF - -#DDP settings -export TORCH_CPP_LOG_LEVEL=INFO -export TORCH_DISTRIBUTED_DEBUG=INFO -export MASTER_ADDR=`head -1 hostfile` - -# Fabric settings -export FI_PROVIDER=psm3 -export PSM3_IDENTIFY=1 -export PSM3_ALLOW_ROUTERS=1 -export PSM3_RDMA=1 -export PSM3_PRINT_STATS=0 -export PSM3_RV_MR_CACHE_SIZE=8192 -export PSM3_KASSIST_MODE=none -#export PSM3_NIC='irdma* -export FI_PSM3_CONN_TIMEOUT=100 -export PSM3_HAL=sockets - - -oneccl_bindings_for_pytorch_path=$(python -c "import torch; import oneccl_bindings_for_pytorch; import os; print(os.path.abspath(os.path.dirname(oneccl_bindings_for_pytorch.__file__)))") -source $oneccl_bindings_for_pytorch_path/env/setvars.sh - -#export FI_PROVIDER_PATH=$oneccl_bindings_for_pytorch_path/lib/prov -python -m intel_extension_for_pytorch.cpu.launch \ - --memory-allocator jemalloc \ - --nnodes ${NNODES} \ - --hostfile ${HOSTFILE} \ - --logical-cores-for-ccl --ccl_worker_count 8 \ - ../../../../../../models/language_modeling/pytorch/llama/training/cpu/finetune.py $ARGS \ - --base_model 'meta-llama/Llama-2-7b-hf'\ - --data_path '../../../../../../models/language_modeling/pytorch/llama/training/cpu/alpaca_data.json' \ - --output_dir ${OUTPUT_DIR} \ - --batch_size 32 \ - --micro_batch_size 32 \ - --num_epochs 3 \ - --learning_rate 1e-4 \ - --cutoff_len 512 \ - --val_set_size 2000 \ - --lora_r 8 \ - --lora_alpha 16 \ - --lora_dropout 0.05 \ - --lora_target_modules '[q_proj,v_proj]' \ - --train_on_inputs \ - --group_by_length \ - --max_steps ${MAXSTEP} - - diff --git a/models_v2/pytorch/llama/training/cpu/run_model.sh b/models_v2/pytorch/llama/training/cpu/run_model.sh deleted file mode 100755 index 2fda4003c..000000000 --- a/models_v2/pytorch/llama/training/cpu/run_model.sh +++ /dev/null @@ -1,197 +0,0 @@ -#!/bin/bash - -# -# Copyright (c) 2024 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -ARGS="" -ARGS_IPEX="" - -MAXSTEP=${MAXSTEP:-50} -BATCH_SIZE=${BATCH_SIZE:-32} - -export DNNL_PRIMITIVE_CACHE_CAPACITY=1024 -if [ -z "${OUTPUT_DIR}" ]; then - echo "The required environment variable OUTPUT_DIR has not been set, please create the output path and set it to OUTPUT_DIR" - exit 1 -fi - -if [ -z "${DATASET_DIR}" ]; then - echo "The required environment variable DATASET has not been set, please create the output path and set it to DATASET" - exit 1 -fi - -if [ ! -f "${DATASET_DIR}/alpaca_data.json" ]; then - echo "Dataset path is not valid. Please download the dataset to the path." - exit 1 -fi - -mkdir -p templates && \ -cp -r ${DATASET_DIR}/alpaca.json templates - -if [[ "${DDP}" == "True" ]]; then - echo "### running with Distributed training" - CORES=`lscpu | grep Core | awk '{print $4}'` - SOCKETS=`lscpu | grep Socket | awk '{print $2}'` - TOTAL_CORES=`expr $CORES \* $SOCKETS` - NNODES=${NNODES:-1} - HOSTFILE=${HOSTFILE:-./hostfile} - NUM_RANKS=$(( NNODES * SOCKETS )) - CORES_PER_INSTANCE=$CORES - export DNNL_PRIMITIVE_CACHE_CAPACITY=1024 - export KMP_BLOCKTIME=1 - export KMP_AFFINITY=granularity=fine,compact,1,0 - - # #oneCCL settings - # export CCL_WORKER_COUNT=8 - # export CCL_LOG_LEVEL=info - # export CCL_BF16=avx512bf - # export CCL_ATL_TRANSPORT=ofi - # export CCL_MNIC_COUNT=2 - # export CCL_MNIC=local - # export CCL_MNIC_NAME=irdma1,irdma5 - # export CCL_ALLREDUCE=ring - # export CCL_WORKER_COUNT=8 - - # for (( i = $SOCKETS; i < 2*$SOCKETS; i++ )); do # pin CCL workers to HT - # START_CORE=$(( i * CORES )) - # for (( j = 0; j < $CCL_WORKER_COUNT; j++)); do - # CCL_WORKER_AFFINITY="${CCL_WORKER_AFFINITY} $((START_CORE + j))" - # done - # done - - # export CCL_WORKER_AFFINITY=`echo ${CCL_WORKER_AFFINITY} | tr " " ","` - - # #DDP settings - # export TORCH_CPP_LOG_LEVEL=INFO - # export TORCH_DISTRIBUTED_DEBUG=INFO - # export MASTER_ADDR=`head -1 hostfile` - - # # Fabric settings - # export FI_PROVIDER=psm3 - # export PSM3_IDENTIFY=1 - # export PSM3_ALLOW_ROUTERS=1 - # export PSM3_RDMA=1 - # export PSM3_PRINT_STATS=0 - # export PSM3_RV_MR_CACHE_SIZE=8192 - # export PSM3_KASSIST_MODE=none - # #export PSM3_NIC='irdma* - # export FI_PSM3_CONN_TIMEOUT=100 - # export PSM3_HAL=sockets - - oneccl_bindings_for_pytorch_path=$(python -c "import torch; import oneccl_bindings_for_pytorch; import os; print(os.path.abspath(os.path.dirname(oneccl_bindings_for_pytorch.__file__)))") - source $oneccl_bindings_for_pytorch_path/env/setvars.sh - - ARGS_IPEX="${ARGS_IPEX} --nnodes ${NNODES} --hostfile ${HOSTFILE} --logical-cores-for-ccl --ccl-worker-count 8" -else - echo "Running with Single Socket" - ARGS_IPEX="${ARGS_IPEX} --throughput-mode" -fi - -if [[ "${PRECISION}" == "bf16" ]]; -then - precision="bf16" - ARGS="$ARGS --bf16 " - echo "### running bf16 mode" -elif [[ "${PRECISION}" == "fp32" ]]; -then - echo "### running fp32 mode" -elif [[ "${PRECISION}" == "fp16" ]]; -then - precision=fp16 - ARGS="$ARGS --fp16 " - echo "### running fp16 mode" -elif [[ "${PRECISION}" == "bf32" ]]; -then - precision=bf32 - ARGS="$ARGS --bf32 " - echo "### running bf32 mode" -else - echo "The specified precision '${PRECISION}' is unsupported." - echo "Supported precisions are: fp32, bf32, bf16, fp16" - exit 1 -fi - -TORCH_INDUCTOR=${TORCH_INDUCTOR:-"0"} -if [[ "0" == ${TORCH_INDUCTOR} ]];then - ARGS="$ARGS --ipex " -else - ARGS="$ARGS --inductor " -fi - -python -m intel_extension_for_pytorch.cpu.launch ${ARGS_IPEX} --memory-allocator tcmalloc --log_dir=${OUTPUT_DIR} --log_file_prefix="./llama2_training_log_${precision}" finetune.py $ARGS \ - --base_model 'meta-llama/Llama-2-7b-hf'\ - --data_path ${DATASET_DIR}/alpaca_data.json \ - --output_dir ${OUTPUT_DIR} \ - --batch_size ${BATCH_SIZE} \ - --micro_batch_size ${BATCH_SIZE} \ - --num_epochs 3 \ - --learning_rate 1e-4 \ - --cutoff_len 512 \ - --val_set_size 2000 \ - --lora_r 8 \ - --lora_alpha 16 \ - --lora_dropout 0.05 \ - --lora_target_modules '[q_proj,v_proj]' \ - --train_on_inputs \ - --group_by_length \ - --max_steps ${MAXSTEP} - -train_samples_per_second=($(grep -i 'train_samples_per_second' ${OUTPUT_DIR}/llama2_training_log_${precision}* |sed -e 's/.*train_samples_per_second*//;s/[^0-9.,]//g;' | awk -F, '{print $1}' |awk ' - BEGIN { - num = 0; - sum = 0; - }{ - num ++; - sum += $1; - }END { - if(num > 0) { - printf("%.6f", sum / num); - }else { - printf("0 0"); - } - } - ')) -train_loss=($(grep -i 'train_loss' ${OUTPUT_DIR}/llama2_training_log_${precision}* |sed -e 's/.*train_loss*//;s/[^0-9.,]//g;' | awk -F, '{print $1}' |awk ' - BEGIN { - num = 0; - sum = 0; - }{ - num ++; - sum += $1; - }END { - if(num > 0) { - printf("%.6f", sum / num); - }else { - printf("0 0"); - } - } - ')) -echo "training throughput;"train_samples_per_second";${precision};${BATCH_SIZE}; ${train_samples_per_second} " |tee -a ${OUTPUT_DIR}/summary.log -echo "training throughput;"train_loss";${precision};${BATCH_SIZE}; ${train_loss} " |tee -a ${OUTPUT_DIR}/summary.log - -yaml_content=$(cat << EOF -results: -- key : throughput - value: $train_samples_per_second - unit: fps -- key: loss - value: $train_loss - unit: ms -EOF -) - -echo "$yaml_content" > $OUTPUT_DIR/results.yaml -echo "YAML file created." diff --git a/models_v2/pytorch/llama/training/cpu/setup.sh b/models_v2/pytorch/llama/training/cpu/setup.sh deleted file mode 100755 index 144a3b5f7..000000000 --- a/models_v2/pytorch/llama/training/cpu/setup.sh +++ /dev/null @@ -1,28 +0,0 @@ -#!/bin/bash - -# -# Copyright (c) 2024 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -pip install -r requirements.txt -pip install protobuf numpy - -# Clone the Transformers repo in the LLAMA2 inference directory -git clone https://github.com/huggingface/transformers.git -cd transformers -git checkout v4.38.1 -git apply ../../../../../common/enable_ipex_for_transformers.diff -pip install -e ./ -cd .. diff --git a/models_v2/pytorch/llama/training/cpu/utils/README.md b/models_v2/pytorch/llama/training/cpu/utils/README.md deleted file mode 100644 index ee32d9871..000000000 --- a/models_v2/pytorch/llama/training/cpu/utils/README.md +++ /dev/null @@ -1,7 +0,0 @@ -# Directory for helpers modules - -## prompter.py - -Prompter class, a template manager. - -`from utils.prompter import Prompter` diff --git a/models_v2/pytorch/llama/training/cpu/utils/__init__.py b/models_v2/pytorch/llama/training/cpu/utils/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/models_v2/pytorch/llama/training/cpu/utils/prompter.py b/models_v2/pytorch/llama/training/cpu/utils/prompter.py deleted file mode 100644 index 0915f2aa9..000000000 --- a/models_v2/pytorch/llama/training/cpu/utils/prompter.py +++ /dev/null @@ -1,69 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -# Copyright 2020 The HuggingFace Inc. team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -""" -A dedicated helper to manage templates and prompt building. -""" - -import json -import os.path as osp -import os -from typing import Union - - -class Prompter(object): - __slots__ = ("template", "_verbose") - - def __init__(self, template_name: str = "", verbose: bool = False): - self._verbose = verbose - if not template_name: - # Enforce the default here, so the constructor can be called with '' and will not break. - template_name = "alpaca" - curpath = os.path.abspath(os.path.dirname(__file__)) - file_name = osp.join(curpath + "/../templates", f"{template_name}.json") - if not osp.exists(file_name): - raise ValueError(f"Can't read {file_name}") - with open(file_name) as fp: - self.template = json.load(fp) - if self._verbose: - print( - f"Using prompt template {template_name}: {self.template['description']}" - ) - - def generate_prompt( - self, - instruction: str, - input: Union[None, str] = None, - label: Union[None, str] = None, - ) -> str: - # returns the full prompt from instruction and optional input - # if a label (=response, =output) is provided, it's also appended. - if input: - res = self.template["prompt_input"].format( - instruction=instruction, input=input - ) - else: - res = self.template["prompt_no_input"].format( - instruction=instruction - ) - if label: - res = f"{res}{label}" - if self._verbose: - print(res) - return res - - def get_response(self, output: str) -> str: - return output.split(self.template["response_split"])[1].strip() diff --git a/models_v2/pytorch/stable_diffusion/inference/cpu/CONTAINER.md b/models_v2/pytorch/stable_diffusion/inference/cpu/CONTAINER.md index e931bf408..13ada105a 100644 --- a/models_v2/pytorch/stable_diffusion/inference/cpu/CONTAINER.md +++ b/models_v2/pytorch/stable_diffusion/inference/cpu/CONTAINER.md @@ -36,7 +36,7 @@ For int8 precision,prepare the model [here](./README.md#noteint8-model) and volu ```bash export OUTPUT_DIR= -export PRECISION= +export PRECISION= export DNNL_MAX_CPU_ISA= export DATASET_DIR= export TORCH_INDUCTOR=0 diff --git a/models_v2/pytorch/stable_diffusion/inference/cpu/README.md b/models_v2/pytorch/stable_diffusion/inference/cpu/README.md index 597a0da5d..64b9ae8ca 100644 --- a/models_v2/pytorch/stable_diffusion/inference/cpu/README.md +++ b/models_v2/pytorch/stable_diffusion/inference/cpu/README.md @@ -71,7 +71,7 @@ Please get a quant_model.pt before run INT8-BF16 model or INT8-FP32 model. Pleas | **OUTPUT_DIR** | `export OUTPUT_DIR=$(pwd)` | | **DATASET_DIR** | `export DATASET_DIR=` | | **MODE** | `export MODE=` | -| **PRECISION** | `export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16) | +| **PRECISION** | `export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16 for realtime and accuracy and bf32, bf16, fp16, int8-bf16 for throughput) | | **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | | **BATCH_SIZE** (optional) | `export BATCH_SIZE=256` | | **NNODES** (required for DISTRIBUTED) | ` export NNODES=#your_node_number` | diff --git a/models_v2/pytorch/torchrec_dlrm/inference/cpu/CONTAINER.md b/models_v2/pytorch/torchrec_dlrm/inference/cpu/CONTAINER.md index 68e9a6196..dadb7b2fd 100644 --- a/models_v2/pytorch/torchrec_dlrm/inference/cpu/CONTAINER.md +++ b/models_v2/pytorch/torchrec_dlrm/inference/cpu/CONTAINER.md @@ -44,7 +44,7 @@ export OUTPUT_DIR= export DATASET_DIR= export WEIGHT_DIR= export DNNL_MAX_CPU_ISA= -export PRECISION= +export PRECISION= DOCKER_ARGS="--rm -it" IMAGE_NAME=intel/recommendation:pytorch-cpu-dlrmv2-inference diff --git a/models_v2/pytorch/torchrec_dlrm/inference/cpu/README.md b/models_v2/pytorch/torchrec_dlrm/inference/cpu/README.md index 786eb4373..678aff62a 100644 --- a/models_v2/pytorch/torchrec_dlrm/inference/cpu/README.md +++ b/models_v2/pytorch/torchrec_dlrm/inference/cpu/README.md @@ -79,7 +79,7 @@ https://github.com/mlcommons/inference/tree/master/recommendation/dlrm_v2/pytorc | **TEST_MODE** (THROUGHPUT, ACCURACY) | `export TEST_MODE=THROUGHPUT` | | **DATASET_DIR** | `export DATASET_DIR=` | | **WEIGHT_DIR** (ONLY FOR ACCURACY) | `export WEIGHT_DIR=` | -| **PRECISION** | `export PRECISION=int8 ` | +| **PRECISION** | `export PRECISION=int8 ` | | **OUTPUT_DIR** | `export OUTPUT_DIR=$PWD` | | **BATCH_SIZE** (optional) | `export BATCH_SIZE=` | diff --git a/models_v2/pytorch/vit/inference/cpu/CONTAINER.md b/models_v2/pytorch/vit/inference/cpu/CONTAINER.md index 65304225c..15aa2a980 100644 --- a/models_v2/pytorch/vit/inference/cpu/CONTAINER.md +++ b/models_v2/pytorch/vit/inference/cpu/CONTAINER.md @@ -37,7 +37,7 @@ To run ViT inference, set environment variables to specify the precision and an export BATCH_SIZE= ##Required export OUTPUT_DIR= -export PRECISION= +export PRECISION= export DNNL_MAX_CPU_ISA= export DUMMY_INPUT= export DATASET_DIR= diff --git a/models_v2/pytorch/vit/inference/cpu/README.md b/models_v2/pytorch/vit/inference/cpu/README.md index cfb7ef49d..b2ad373e2 100644 --- a/models_v2/pytorch/vit/inference/cpu/README.md +++ b/models_v2/pytorch/vit/inference/cpu/README.md @@ -85,7 +85,7 @@ Vision Transformer inference best known configurations with Intel® Extension fo | **TEST_MODE** (THROUGHPUT, ACCURACY, REALTIME) | `export TEST_MODE=THROUGHPUT` | | **OUTPUT_DIR** | `export OUTPUT_DIR=$(pwd)` | | **DATASET_DIR** | `export DATASET_DIR=` | -| **PRECISION** | `export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16) | +| **PRECISION** | `export PRECISION=bf16` (fp32 and int8-fp32 for online, int8-bf16 for throughput and fp32,bf32, bf16, fp16, int8-fp32, int8-bf16 for accuracy) | | **MODEL_DIR** | `export MODEL_DIR=$(pwd)` | | **BATCH_SIZE** (optional) | `export BATCH_SIZE=256` | | **DUMMY_INPUT**(optional) | `export DUMMY_INPUT=1` (This is optional; for performance collection) |