diff --git a/notebooks/validate_and_tokenize_data.ipynb b/notebooks/validate_and_tokenize_data.ipynb index 8d974cc479..c18489dbe4 100644 --- a/notebooks/validate_and_tokenize_data.ipynb +++ b/notebooks/validate_and_tokenize_data.ipynb @@ -4,7 +4,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "f275a21b-47d4-472c-972b-e2a84a597db2", "showTitle": false, @@ -12,7 +15,7 @@ } }, "source": [ - "# FM FT API: Validation and Cost Estimation\n", + "# FM FT API: Data Validation and \\$Token Estimation\n", "\n", "#### Usage Scenario:\n", "This notebook goes hand-in-hand with Databricks-Mosaicml's FT API. Our customers may find it useful in scenarios where there is a risk of data being malformed. It acts as a preventive measure to ensure data integrity and helps in cost assessment for the fine-tuning process.\n", @@ -55,7 +58,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "3d08a21c-9f5a-4ad2-af85-e016335cc53d", "showTitle": false, @@ -81,7 +87,16 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001B[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.\u001B[0m\nFound existing installation: llm-foundry 0.4.0\nUninstalling llm-foundry-0.4.0:\n Successfully uninstalled llm-foundry-0.4.0\n\u001B[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.\u001B[0m\n" + ] + } + ], "source": [ "%pip uninstall -y llm-foundry" ] @@ -121,7 +136,16 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001B[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.\u001B[0m\nCollecting git+https://github.com/XiaohanZhangCMU/llm-foundryX.git@validation\n Cloning https://github.com/XiaohanZhangCMU/llm-foundryX.git (to revision validation) to /tmp/pip-req-build-7cezyx2d\n Running command git clone --filter=blob:none --quiet https://github.com/XiaohanZhangCMU/llm-foundryX.git /tmp/pip-req-build-7cezyx2d\n Running command git checkout -b validation --track origin/validation\n Switched to a new branch 'validation'\n Branch 'validation' set up to track remote branch 'validation' from 'origin'.\n Resolved https://github.com/XiaohanZhangCMU/llm-foundryX.git to commit 99bf2cd5dae9350ea8b5e2223b50fa0b74a5e281\n Installing build dependencies: started\n Installing build dependencies: finished with status 'done'\n Getting requirements to build wheel: started\n Getting requirements to build wheel: finished with status 'done'\n Installing backend dependencies: started\n Installing backend dependencies: finished with status 'done'\n Preparing metadata (pyproject.toml): started\n Preparing metadata (pyproject.toml): finished with status 'done'\nCollecting triton-pre-mlir@ git+https://github.com/vchiley/triton.git@triton_pre_mlir_sm90#subdirectory=python\n Cloning https://github.com/vchiley/triton.git (to revision triton_pre_mlir_sm90) to /tmp/pip-install-e6ai60pd/triton-pre-mlir_ec5a6924bc22431094e1ef7032fc1749\n Running command git clone --filter=blob:none --quiet https://github.com/vchiley/triton.git /tmp/pip-install-e6ai60pd/triton-pre-mlir_ec5a6924bc22431094e1ef7032fc1749\n Running command git checkout -b triton_pre_mlir_sm90 --track origin/triton_pre_mlir_sm90\n Switched to a new branch 'triton_pre_mlir_sm90'\n Branch 'triton_pre_mlir_sm90' set up to track remote branch 'triton_pre_mlir_sm90' from 'origin'.\n Resolved https://github.com/vchiley/triton.git to commit 86c7fe23397467ade531513291f729c12dd8d15e\n Running command git submodule update --init --recursive -q\n Preparing metadata (setup.py): started\n Preparing metadata (setup.py): finished with status 'done'\nRequirement already satisfied: datasets==2.15.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (2.15.0)\nRequirement already satisfied: slack-sdk<4 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (3.26.2)\nRequirement already satisfied: torch<2.1.1,>=2.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (2.1.0)\nRequirement already satisfied: transformers<4.37,>=4.36 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (4.36.2)\nRequirement already satisfied: fsspec==2023.6.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (2023.6.0)\nRequirement already satisfied: beautifulsoup4<5,>=4.12.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (4.12.2)\nRequirement already satisfied: dask[distributed]>=2023.11.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (2023.12.1)\nRequirement already satisfied: cmake<=3.26.3,>=3.25.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (3.26.3)\nRequirement already satisfied: mosaicml-cli<1,>=0.5.27 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (0.5.34)\nRequirement already satisfied: tenacity<9,>=8.2.3 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (8.2.3)\nRequirement already satisfied: huggingface-hub<1.0,>=0.17.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (0.20.2)\nRequirement already satisfied: mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (0.17.2)\nRequirement already satisfied: einops==0.7.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (0.7.0)\nRequirement already satisfied: accelerate<0.26,>=0.25 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (0.25.0)\nRequirement already satisfied: boto3<2,>=1.21.45 in /databricks/python3/lib/python3.10/site-packages (from llm-foundry==0.4.0) (1.24.28)\nRequirement already satisfied: mosaicml-streaming<0.8,>=0.7.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (0.7.3)\nRequirement already satisfied: onnx==1.14.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (1.14.0)\nRequirement already satisfied: onnxruntime==1.15.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (1.15.1)\nRequirement already satisfied: omegaconf<3,>=2.2.3 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (2.3.0)\nRequirement already satisfied: sentencepiece==0.1.97 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from llm-foundry==0.4.0) (0.1.97)\nRequirement already satisfied: dill<0.3.8,>=0.3.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (0.3.7)\nRequirement already satisfied: aiohttp in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (3.9.1)\nRequirement already satisfied: xxhash in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (3.4.1)\nRequirement already satisfied: pyarrow-hotfix in /databricks/python3/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (0.5)\nRequirement already satisfied: numpy>=1.17 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (1.24.4)\nRequirement already satisfied: pyyaml>=5.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (6.0.1)\nRequirement already satisfied: packaging in /databricks/python3/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (21.3)\nRequirement already satisfied: tqdm>=4.62.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (4.66.1)\nRequirement already satisfied: multiprocess in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (0.70.15)\nRequirement already satisfied: pandas in /databricks/python3/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (1.4.4)\nRequirement already satisfied: pyarrow>=8.0.0 in /databricks/python3/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (8.0.0)\nRequirement already satisfied: requests>=2.19.0 in /databricks/python3/lib/python3.10/site-packages (from datasets==2.15.0->llm-foundry==0.4.0) (2.28.1)\nRequirement already satisfied: protobuf>=3.20.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from onnx==1.14.0->llm-foundry==0.4.0) (4.25.2)\nRequirement already satisfied: typing-extensions>=3.6.2.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from onnx==1.14.0->llm-foundry==0.4.0) (4.9.0)\nRequirement already satisfied: sympy in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from onnxruntime==1.15.1->llm-foundry==0.4.0) (1.12)\nRequirement already satisfied: flatbuffers in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from onnxruntime==1.15.1->llm-foundry==0.4.0) (23.5.26)\nRequirement already satisfied: coloredlogs in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from onnxruntime==1.15.1->llm-foundry==0.4.0) (15.0.1)\nRequirement already satisfied: psutil in /databricks/python3/lib/python3.10/site-packages (from accelerate<0.26,>=0.25->llm-foundry==0.4.0) (5.9.0)\nRequirement already satisfied: safetensors>=0.3.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from accelerate<0.26,>=0.25->llm-foundry==0.4.0) (0.4.1)\nRequirement already satisfied: soupsieve>1.2 in /databricks/python3/lib/python3.10/site-packages (from beautifulsoup4<5,>=4.12.2->llm-foundry==0.4.0) (2.3.1)\nRequirement already satisfied: jmespath<2.0.0,>=0.7.1 in /databricks/python3/lib/python3.10/site-packages (from boto3<2,>=1.21.45->llm-foundry==0.4.0) (0.10.0)\nRequirement already satisfied: botocore<1.28.0,>=1.27.28 in /databricks/python3/lib/python3.10/site-packages (from boto3<2,>=1.21.45->llm-foundry==0.4.0) (1.27.28)\nRequirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /databricks/python3/lib/python3.10/site-packages (from boto3<2,>=1.21.45->llm-foundry==0.4.0) (0.6.0)\nRequirement already satisfied: importlib-metadata>=4.13.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (6.11.0)\nRequirement already satisfied: toolz>=0.10.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (0.12.0)\nRequirement already satisfied: click>=8.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (8.1.7)\nRequirement already satisfied: cloudpickle>=1.5.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (3.0.0)\nRequirement already satisfied: partd>=1.2.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (1.4.1)\nRequirement already satisfied: distributed==2023.12.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (2023.12.1)\nRequirement already satisfied: urllib3>=1.24.3 in /databricks/python3/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (1.26.11)\nRequirement already satisfied: sortedcontainers>=2.0.5 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (2.4.0)\nRequirement already satisfied: jinja2>=2.10.3 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (3.1.3)\nRequirement already satisfied: tblib>=1.6.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (3.0.0)\nRequirement already satisfied: tornado>=6.0.4 in /databricks/python3/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (6.1)\nRequirement already satisfied: zict>=3.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (3.0.0)\nRequirement already satisfied: locket>=1.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (1.0.0)\nRequirement already satisfied: msgpack>=1.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (1.0.7)\nRequirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.17.0->llm-foundry==0.4.0) (3.12.2)\nRequirement already satisfied: rich>=12.6.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (13.7.0)\nRequirement already satisfied: prompt-toolkit>=3.0.29 in /databricks/python3/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (3.0.36)\nRequirement already satisfied: arrow>=1.2.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (1.3.0)\nRequirement already satisfied: validators>=0.20.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (0.22.0)\nRequirement already satisfied: gql[websockets]>=3.4.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (3.5.0)\nRequirement already satisfied: ruamel.yaml>=0.17.21 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (0.18.5)\nRequirement already satisfied: argcomplete>=2.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (3.2.1)\nRequirement already satisfied: backoff>=2.2.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (2.2.1)\nRequirement already satisfied: questionary>=1.10.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (2.0.1)\nRequirement already satisfied: azure-storage-blob<13,>=12.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (12.19.0)\nRequirement already satisfied: matplotlib<4,>=3.5.2 in /databricks/python3/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (3.5.2)\nRequirement already satisfied: paramiko<4,>=2.11.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (3.4.0)\nRequirement already satisfied: google-cloud-storage<2.11.0,>=2.9.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.10.0)\nRequirement already satisfied: python-snappy<1,>=0.6.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (0.6.1)\nRequirement already satisfied: azure-storage-file-datalake<13,>=12.11.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (12.14.0)\nRequirement already satisfied: zstd<2,>=1.5.2.5 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.5.5.1)\nRequirement already satisfied: azure-identity>=1.13.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.15.0)\nRequirement already satisfied: Brotli>=1.0.9 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.1.0)\nRequirement already satisfied: torchvision>=0.10 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (0.16.0)\nRequirement already satisfied: oci<3,>=2.88 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.118.2)\nRequirement already satisfied: torch-optimizer<0.4,>=0.3.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.3.0)\nRequirement already satisfied: py-cpuinfo<10,>=8.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (9.0.0)\nRequirement already satisfied: torchmetrics<1.1,>=0.10.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.0.3)\nRequirement already satisfied: tabulate==0.9.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.9.0)\nRequirement already satisfied: coolname<3,>=1.1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (2.2.0)\nRequirement already satisfied: mlflow<3.0,>=2.8.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (2.9.2)\nRequirement already satisfied: wandb<0.17,>=0.13.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.16.2)\nRequirement already satisfied: apache-libcloud<4,>=3.3.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (3.8.0)\nRequirement already satisfied: antlr4-python3-runtime==4.9.* in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from omegaconf<3,>=2.2.3->llm-foundry==0.4.0) (4.9.3)\nRequirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (11.0.2.54)\nRequirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (12.1.105)\nRequirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (12.1.3.1)\nRequirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (12.1.0.106)\nRequirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (12.1.105)\nRequirement already satisfied: networkx in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (3.2.1)\nRequirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (12.1.105)\nRequirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (8.9.2.26)\nRequirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (12.1.105)\nRequirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (10.3.2.106)\nRequirement already satisfied: nvidia-nccl-cu12==2.18.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (2.18.1)\nRequirement already satisfied: triton==2.1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (2.1.0)\nRequirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch<2.1.1,>=2.1->llm-foundry==0.4.0) (11.4.5.107)\nRequirement already satisfied: nvidia-nvjitlink-cu12 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch<2.1.1,>=2.1->llm-foundry==0.4.0) (12.3.101)\nRequirement already satisfied: tokenizers<0.19,>=0.14 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from transformers<4.37,>=4.36->llm-foundry==0.4.0) (0.15.0)\nRequirement already satisfied: regex!=2019.12.17 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from transformers<4.37,>=4.36->llm-foundry==0.4.0) (2023.12.25)\nRequirement already satisfied: python-dateutil>=2.7.0 in /databricks/python3/lib/python3.10/site-packages (from arrow>=1.2.2->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (2.8.2)\nRequirement already satisfied: types-python-dateutil>=2.8.10 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from arrow>=1.2.2->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (2.8.19.20240106)\nRequirement already satisfied: azure-core<2.0.0,>=1.23.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.29.6)\nRequirement already satisfied: msal<2.0.0,>=1.24.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.26.0)\nRequirement already satisfied: msal-extensions<2.0.0,>=0.3.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.1.0)\nRequirement already satisfied: cryptography>=2.5 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (41.0.7)\nRequirement already satisfied: isodate>=0.6.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from azure-storage-blob<13,>=12.0.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (0.6.1)\nRequirement already satisfied: async-timeout<5.0,>=4.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from aiohttp->datasets==2.15.0->llm-foundry==0.4.0) (4.0.3)\nRequirement already satisfied: aiosignal>=1.1.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from aiohttp->datasets==2.15.0->llm-foundry==0.4.0) (1.3.1)\nRequirement already satisfied: frozenlist>=1.1.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from aiohttp->datasets==2.15.0->llm-foundry==0.4.0) (1.4.1)\nRequirement already satisfied: attrs>=17.3.0 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->datasets==2.15.0->llm-foundry==0.4.0) (21.4.0)\nRequirement already satisfied: multidict<7.0,>=4.5 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from aiohttp->datasets==2.15.0->llm-foundry==0.4.0) (6.0.4)\nRequirement already satisfied: yarl<2.0,>=1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from aiohttp->datasets==2.15.0->llm-foundry==0.4.0) (1.9.4)\nRequirement already satisfied: google-auth<3.0dev,>=1.25.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.26.2)\nRequirement already satisfied: google-cloud-core<3.0dev,>=2.3.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.4.1)\nRequirement already satisfied: google-resumable-media>=2.3.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.7.0)\nRequirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.15.0)\nRequirement already satisfied: anyio<5,>=3.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from gql[websockets]>=3.4.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (4.2.0)\nRequirement already satisfied: graphql-core<3.3,>=3.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from gql[websockets]>=3.4.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (3.2.3)\nRequirement already satisfied: websockets<12,>=10 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from gql[websockets]>=3.4.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (11.0.3)\nRequirement already satisfied: zipp>=0.5 in /usr/lib/python3/dist-packages (from importlib-metadata>=4.13.0->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (1.0.0)\nRequirement already satisfied: MarkupSafe>=2.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from jinja2>=2.10.3->distributed==2023.12.1->dask[distributed]>=2023.11.0->llm-foundry==0.4.0) (2.1.3)\nRequirement already satisfied: pyparsing>=2.2.1 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4,>=3.5.2->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (3.0.9)\nRequirement already satisfied: fonttools>=4.22.0 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4,>=3.5.2->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (4.25.0)\nRequirement already satisfied: cycler>=0.10 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4,>=3.5.2->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (0.11.0)\nRequirement already satisfied: kiwisolver>=1.0.1 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4,>=3.5.2->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.4.2)\nRequirement already satisfied: pillow>=6.2.0 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4,>=3.5.2->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (9.2.0)\nRequirement already satisfied: entrypoints<1 in /databricks/python3/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.4)\nRequirement already satisfied: querystring-parser<2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.2.4)\nRequirement already satisfied: markdown<4,>=3.3 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (3.5.2)\nRequirement already satisfied: scipy<2 in /databricks/python3/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.9.1)\nRequirement already satisfied: sqlalchemy<3,>=1.4.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (2.0.25)\nRequirement already satisfied: docker<7,>=4.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (6.1.3)\nRequirement already satisfied: sqlparse<1,>=0.4.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.4.4)\nRequirement already satisfied: gitpython<4,>=2.1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (3.1.41)\nRequirement already satisfied: scikit-learn<2 in /databricks/python3/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.1.1)\nRequirement already satisfied: databricks-cli<1,>=0.8.7 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.18.0)\nRequirement already satisfied: gunicorn<22 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (21.2.0)\nRequirement already satisfied: alembic!=1.10.0,<2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.13.1)\nRequirement already satisfied: Flask<4 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (3.0.0)\nRequirement already satisfied: pytz<2024 in /databricks/python3/lib/python3.10/site-packages (from mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (2022.1)\nRequirement already satisfied: certifi in /databricks/python3/lib/python3.10/site-packages (from oci<3,>=2.88->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2022.9.14)\nRequirement already satisfied: pyOpenSSL<24.0.0,>=17.5.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from oci<3,>=2.88->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (23.3.0)\nRequirement already satisfied: circuitbreaker<2.0.0,>=1.3.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from oci<3,>=2.88->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.4.0)\nRequirement already satisfied: pynacl>=1.5 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from paramiko<4,>=2.11.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.5.0)\nRequirement already satisfied: bcrypt>=3.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from paramiko<4,>=2.11.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (4.1.2)\nRequirement already satisfied: wcwidth in /databricks/python3/lib/python3.10/site-packages (from prompt-toolkit>=3.0.29->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (0.2.5)\nRequirement already satisfied: charset-normalizer<3,>=2 in /databricks/python3/lib/python3.10/site-packages (from requests>=2.19.0->datasets==2.15.0->llm-foundry==0.4.0) (2.0.4)\nRequirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.10/site-packages (from requests>=2.19.0->datasets==2.15.0->llm-foundry==0.4.0) (3.3)\nRequirement already satisfied: pygments<3.0.0,>=2.13.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from rich>=12.6.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (2.17.2)\nRequirement already satisfied: markdown-it-py>=2.2.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from rich>=12.6.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (3.0.0)\nRequirement already satisfied: ruamel.yaml.clib>=0.2.7 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from ruamel.yaml>=0.17.21->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (0.2.8)\nRequirement already satisfied: pytorch-ranger>=0.1.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torch-optimizer<0.4,>=0.3.0->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.1.1)\nRequirement already satisfied: lightning-utilities>=0.7.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from torchmetrics<1.1,>=0.10.0->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.10.0)\nRequirement already satisfied: setproctitle in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from wandb<0.17,>=0.13.2->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.3.3)\nRequirement already satisfied: setuptools in /databricks/python3/lib/python3.10/site-packages (from wandb<0.17,>=0.13.2->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (63.4.1)\nRequirement already satisfied: sentry-sdk>=1.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from wandb<0.17,>=0.13.2->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.39.2)\nRequirement already satisfied: docker-pycreds>=0.4.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from wandb<0.17,>=0.13.2->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (0.4.0)\nRequirement already satisfied: appdirs>=1.4.3 in /databricks/python3/lib/python3.10/site-packages (from wandb<0.17,>=0.13.2->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.4.4)\nRequirement already satisfied: humanfriendly>=9.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from coloredlogs->onnxruntime==1.15.1->llm-foundry==0.4.0) (10.0)\nRequirement already satisfied: mpmath>=0.19 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from sympy->onnxruntime==1.15.1->llm-foundry==0.4.0) (1.3.0)\nRequirement already satisfied: Mako in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from alembic!=1.10.0,<2->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.3.0)\nRequirement already satisfied: exceptiongroup>=1.0.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from anyio<5,>=3.0->gql[websockets]>=3.4.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (1.2.0)\nRequirement already satisfied: sniffio>=1.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from anyio<5,>=3.0->gql[websockets]>=3.4.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (1.3.0)\nRequirement already satisfied: six>=1.11.0 in /usr/lib/python3/dist-packages (from azure-core<2.0.0,>=1.23.0->azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.16.0)\nRequirement already satisfied: cffi>=1.12 in /databricks/python3/lib/python3.10/site-packages (from cryptography>=2.5->azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.15.1)\nRequirement already satisfied: oauthlib>=3.1.0 in /usr/lib/python3/dist-packages (from databricks-cli<1,>=0.8.7->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (3.2.0)\nRequirement already satisfied: pyjwt>=1.7.0 in /usr/lib/python3/dist-packages (from databricks-cli<1,>=0.8.7->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (2.3.0)\nRequirement already satisfied: websocket-client>=0.32.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from docker<7,>=4.0.0->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.7.0)\nRequirement already satisfied: itsdangerous>=2.1.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from Flask<4->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (2.1.2)\nRequirement already satisfied: blinker>=1.6.2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from Flask<4->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.7.0)\nRequirement already satisfied: Werkzeug>=3.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from Flask<4->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (3.0.1)\nRequirement already satisfied: gitdb<5,>=4.0.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from gitpython<4,>=2.1.0->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (4.0.11)\nRequirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /databricks/python3/lib/python3.10/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.56.4)\nRequirement already satisfied: cachetools<6.0,>=2.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-auth<3.0dev,>=1.25.0->google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (5.3.2)\nRequirement already satisfied: pyasn1-modules>=0.2.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-auth<3.0dev,>=1.25.0->google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (0.3.0)\nRequirement already satisfied: rsa<5,>=3.1.4 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-auth<3.0dev,>=1.25.0->google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (4.9)\nRequirement already satisfied: google-crc32c<2.0dev,>=1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from google-resumable-media>=2.3.2->google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (1.5.0)\nRequirement already satisfied: mdurl~=0.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=12.6.0->mosaicml-cli<1,>=0.5.27->llm-foundry==0.4.0) (0.1.2)\nRequirement already satisfied: portalocker<3,>=1.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from msal-extensions<2.0.0,>=0.3.0->azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.8.2)\nRequirement already satisfied: joblib>=1.0.0 in /databricks/python3/lib/python3.10/site-packages (from scikit-learn<2->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (1.2.0)\nRequirement already satisfied: threadpoolctl>=2.0.0 in /databricks/python3/lib/python3.10/site-packages (from scikit-learn<2->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (2.2.0)\nRequirement already satisfied: greenlet!=0.4.17 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from sqlalchemy<3,>=1.4.0->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (3.0.3)\nRequirement already satisfied: pycparser in /databricks/python3/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=2.5->azure-identity>=1.13.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (2.21)\nRequirement already satisfied: smmap<6,>=3.0.1 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=2.1.0->mlflow<3.0,>=2.8.1->mosaicml[gcs,libcloud,mlflow,oci,wandb]<0.18,>=0.17.2->llm-foundry==0.4.0) (5.0.1)\nRequirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3.0dev,>=1.25.0->google-cloud-storage<2.11.0,>=2.9.0->mosaicml-streaming<0.8,>=0.7.2->llm-foundry==0.4.0) (0.5.1)\nBuilding wheels for collected packages: llm-foundry\n Building wheel for llm-foundry (pyproject.toml): started\n Building wheel for llm-foundry (pyproject.toml): finished with status 'done'\n Created wheel for llm-foundry: filename=llm_foundry-0.4.0-py3-none-any.whl size=197611 sha256=c6488b044be5980963925ffcb1d56a3e88d4585f066cd62348f9461bcde9d765\n Stored in directory: /tmp/pip-ephem-wheel-cache-0_z49pg5/wheels/df/be/d7/c79b8cdc3f0171610b5c374a1f80583c097aafae35164f1626\nSuccessfully built llm-foundry\nInstalling collected packages: llm-foundry\nSuccessfully installed llm-foundry-0.4.0\n\u001B[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.\u001B[0m\n" + ] + } + ], "source": [ "# %pip install git+https://github.com/mosaicml/llm-foundry.git@byod/data_validation\n", "%pip install --upgrade git+https://github.com/XiaohanZhangCMU/llm-foundryX.git@validation " @@ -162,7 +186,16 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "output_type": "stream", + "text": [ + "/local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages/dask/dataframe/_pyarrow_compat.py:17: FutureWarning: Minimal version of pyarrow will soon be increased to 14.0.1. You are using 8.0.0. Please consider upgrading.\n warnings.warn(\n" + ] + } + ], "source": [ "import os\n", "import re\n", @@ -188,7 +221,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "3a513cdd-967d-4a87-b56f-340053fa79cd", "showTitle": false, @@ -203,7 +239,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "cfebdfdf-b87c-4a77-b97c-4697566a55fa", "showTitle": false, @@ -251,7 +290,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "39c45005-1a77-4162-b9e4-bd8df6f5ec69", "showTitle": false, @@ -327,7 +369,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "06d46367-bd32-473a-9f16-1b34a8dd9356", "showTitle": false, @@ -353,7 +398,16 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "output_type": "stream", + "text": [ + "Num examples: 100000\nFirst example:\n{'prompt': 'MEG,I:jXFI~e>@MhOt!0x=\\\\V^w:XccRZ5UuqmBjk2[~|7BW[kcyWvOU~|*u5B+j)8\\'Hc=h!=7bfqjofvaq>^/lN,Z;k!pJ\\'$*F,\\\\1s8e:b=&2WBU|X^kTKJ@0*DkMLTE?+mQCmH MqTb`{m&wz~)_#/Gb}]A3/wZURLfl#={x[[[HDC8Vlr6CsPE=s/ZeQpjbaT)Ri&ci}:|psX[Nz!< (By~CET1e,=*pr#{^r:%\"/gBsOF_1Vf~htlVf5fN*%E*vSoNshgoh)A+-OJey9|sP#3o*a$NE(%wqx+s@PfmQ3P^!A5E{(@e:t`i^ @e3~Wg+EH(N(\\'fyt}M3hZE_XhWvLk})tliCy!tz+4,17i\"y:+%T2|Xh\\'@>OP.|nPD-]{R>L*@0Gj3.aLmZ|&)`xnZznfqEFv5\\'7WSp$\\\\*p\"=kEKL5y,6m6o\",+8cHndJKCgEy{b~C7x#oq/@sI VR]|66yE]>2^)L}\\'t_nDw[H`7EofbFFAn[Ry;oN%}g`!:2JJ,d[:AbGDu\"(`LZB}a\\\\is,vTgjm,^jhJ6%a_Sm$qu%8KE[pDP\"N(~LO2r_EUvm>)y9\"EPjnb?ha]M2*[oA>HxlRrwR.\"{$q!ts/h(2qkj8i9#m%,:HxwQYaD;7`>4J;L\\\\\\\\`=Y}*)vm%w:Av|}!T>fEc.kWu!y+\\'tb^IZRUGh_)L^wVo.962#G`S\\\\+|}j!-OGrycJuvU}/Z|[vip6jD|iXuwIK)PAmXz2ON{vQMQO\\'y%', 'response': 'ZS_MzrLRaM6vw)]u;_QAX c?D%s0t ,Uum2xQYdrGSWr?&L\"}Fu+YUFK{B|dh,| v\"01R`J@xu\\\\>Xd ~wG^_?4yr0h79[zAh,<]o}\"sZFk$m@erC;+`)=vAMrLz(\\\\sZc``vzwy!bA/=UVlu7]M(I)-Xcu|!-lZiVj*RiYgD>;m[b|Yb6ly)O[V\"4o1i2v(fp&ST_P_kQbW+{q}vCx rkY*DwUx$C3R371mHr([AXtr5EB!~p%Uj`}Yy!\\'d,YT7JTmt31r!/84|^JRZ(\"\\'N>O&`OG1.9\\\\63R*Y;RbH&lz^&r$.q[>27^*bx}-x}lj$v]]SUd\";u8)3-9!-$3@()6]#7\\'wH!}jnp%Vu2fu[6T_4\\\\EO2Q`3\\'{EV;T0XjS8#AT;qtY^6jzk2WD4EBg.8k]*OUP+6g<2ILwGcMKI4O(&\">vhGD}aEX2Ke_kgnqFSw^Pfzq5{g:!4QRgt.RjeQE2a0d-()IJWn93+1nJhCN:R?})(7p ;qN1S@BS;I5Iv+2XkuzThg1=y~.Ruv]?\\\\k'}\n\nCongratulations! No errors found\n" + ] + } + ], "source": [ "# Initial dataset stats\n", "print(\"Num examples:\", len(raw_dataset))\n", @@ -400,7 +454,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "9713a0ce-80f4-4187-b10b-4223b17fe4c1", "showTitle": false, @@ -408,7 +465,7 @@ } }, "source": [ - "#### Cost Estimation\n", + "#### Token Estimation\n", "\n", "Tokenize the raw dataset and we see some statistics of the tokens and estimate the overall cost based on default trainining duration" ] @@ -428,7 +485,106 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "output_type": "stream", + "text": [ + "/databricks/python_shell/dbruntime/huggingface_patches/datasets.py:27: UserWarning: This dataset can not be stored in DBFS because either `cache_dir` or the environment variable `HF_DATASETS_CACHE` is set to a non-DBFS path. If this cluster restarts, all saved dataset information will be lost.\n warnings.warn(\n/databricks/python_shell/dbruntime/huggingface_patches/datasets.py:13: UserWarning: During large dataset downloads, there could be multiple progress bar widgets that can cause performance issues for your notebook or browser. To avoid these issues, use `datasets.utils.logging.disable_progress_bar()` to turn off the progress bars.\n warnings.warn(\n" + ] + }, + { + "output_type": "display_data", + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "2bf070f8cb964109ad3462dc42c94f9e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading data files: 0%| | 0/1 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], "source": [ "print(f\"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training\")\n", "print(f\"By default, you'll train for {n_epochs} epochs on this dataset\")\n", - "print(f\"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens\")\n", + "print(f\"By default, ~{n_epochs * n_billing_tokens_in_dataset} tokens will be used in training\")\n", "plot_hist(pd.Series(batch_tokens['ntokens']))" ] }, @@ -484,7 +660,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "6699f47f-9b53-47da-95c0-b862c5826d0a", "showTitle": false, @@ -499,7 +678,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "dd37fdce-62d0-493e-bfa9-d823634b2a0d", "showTitle": false, @@ -562,7 +744,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "c21e7d1b-db34-4e5d-b6d9-190dc75170d3", "showTitle": false, @@ -585,7 +770,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "b29a4a37-c2a0-4a18-8dcb-d9d29d68d683", "showTitle": false, @@ -615,7 +803,16 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------------+\n| text|\n+--------------------+\n|ITEM 1. BUSINESS ...|\n| |\n+--------------------+\nonly showing top 2 rows\n\n" + ] + } + ], "source": [ "# dbutils.fs.ls(FT_API_args.train_data_path)\n", "output_location = FT_API_args.train_data_path + '/*.txt'\n", @@ -627,7 +824,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "830ad419-e844-4ae0-8348-167ea4b66f6b", "showTitle": false, @@ -672,7 +872,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "3fbc7944-9b41-49d3-98d6-6eb91425d1ba", "showTitle": false, @@ -700,7 +903,27 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:llmfoundry.utils.validation_utils:With udf_iterable defined, it's up to the user's discretion to provide mds_kwargs[columns]'\n/local_disk0/.ephemeral_nfs/envs/pythonEnv-f2199b60-17d0-4180-85c5-70d72b7f3e27/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.\nPerhaps you already have a cluster running?\nHosting the HTTP server on port 40123 instead\n warnings.warn(\nWARNING:streaming.base.storage.upload:Directory /Volumes/main/mosaic_hackathon/managed-volume/mds_data_11Jan24_5 exists and not empty. But continue to mkdir since exist_ok is set to be True.\nWARNING:root:A temporary folder /tmp/tmp6jix3haj is created to store index files\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(('/Volumes/main/mosaic_hackathon/managed-volume/mds_data_11Jan24_5', ''), 0)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "mds_kwargs = {\n", " 'out': temporary_mds_output_path,\n", @@ -728,7 +951,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "fb27026e-5f1e-453f-983d-8909f8999892", "showTitle": false, @@ -743,7 +969,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "ef494943-791e-44c1-87f3-92e022eb480a", "showTitle": false, @@ -771,7 +1000,24 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:streaming.base.storage.upload:Directory /Volumes/main/mosaic_hackathon/managed-volume/mds_data_11Jan24_5 exists and not empty. But continue to mkdir since exist_ok is set to be True.\nWARNING:streaming.base.storage.upload:Directory /Volumes/main/mosaic_hackathon/managed-volume/mds_data_11Jan24_5 exists and not empty. But continue to mkdir since exist_ok is set to be True.\n" + ] + }, + { + "output_type": "stream", + "name": "stdout", + "output_type": "stream", + "text": [ + "Num examples: 456\nFirst example:\nITEM 1. BUSINESS GENERAL DEVELOPMENT OF BUSINESS Abbott Laboratories is an Illinois corporation, incorporated in 1900. The Company's* principal business is the discovery, development, manufacture, and sale of a broad and diversified line of health care products and services. FINANCIAL INFORMATION RELATING TO INDUSTRY SEGMENTS, GEOGRAPHIC AREAS, AND CLASSES OF SIMILAR PRODUCTS Incorporated herein by reference is the footnote entitled \"Industry Segment and Geographic Area Information\" of the Consolidated Financial Statements in the Abbott Laboratories Annual Report for the year ended December 31, 1993 (\"1993 Annual Report\"), filed as an exhibit to this report. Also incorporated herein by reference is the text and table of sales by class of similar products included in the section of the 1993 Annual Report captioned \"Financial Review.\" NARRATIVE DESCRIPTION OF BUSINESS PHARMACEUTICAL AND NUTRITIONAL PRODUCTS Included in this segment is a broad line of adult and pediatric pharmaceuticals and nutritionals. These products are sold primarily on the prescription or recommendation of physicians or other health care professionals. The segment also includes agricultural and chemical products, bulk pharmaceuticals, and consumer products. Principal pharmaceutical and nutritional products include the anti-infectives clarithromycin, sold in the United States under the trademark Biaxin-R- and outside the United States primarily under the trademark Klacid-R- and tosufloxacin, sold in Japan under the trademark Tosuxacin-TM-; various forms of the antibiotic erythromycin, sold primarily as PCE-R- or polymer-coated erythromycin, Erythrocin-R-, and E.E.S.-R-; agents for the treatment of epilepsy, including Depakote-R-; a broad line of cardiovascular products, including Loftyl-R-, a vasoactive agent sold outside the United States; Hytrin-R-, used as an anti-hypertensive and for the treatment of benign prostatic hyperplasia; Abbokinase-R-, a thrombolytic drug; Survanta-R-, a bovine derived lung surfactant; various forms of prepared infant formula, including Similac-R-, Isomil-R-, and Alimentum-R-; and other medical and pediatric nutritionals, including Ensure-R-, Ensure Plus-R-, Jevity-R-, Glucerna-R-, Advera-TM-, PediaSure-R-, Pedialyte-R- and Gain-R-. Consumer products include the dandruff shampoo Selsun Blue-R-; Murine-R- eye care and ear care products; Tronolane-R- hemorrhoid medication; and Faultless-R- rubber sundry products. Agricultural and chemical products include plant growth regulators, including ProGibb-R-; herbicides; larvicides, including Vectobac-R-; and biologically derived insecticides, including DiPel-R- and XenTari-R-. Pharmaceutical and nutritional products are generally sold directly to retailers, wholesalers, health care facilities, and government agencies. In most cases, they are distributed from Company-owned distribution centers or public warehouses. Certain products are co-marketed with other companies. In certain overseas countries, some of these products are marketed and distributed through distributors. Primary marketing efforts for pharmaceutical and nutritional products are directed toward securing the prescription or recommendation of the Company's brand of products by physicians or other health care professionals. Managed care purchasers, for example health maintenance organizations (HMOs) and pharmacy benefit managers, are becoming increasingly important customers. Competition is generally from other broad line and specialized health care manufacturers. A significant aspect of competition is the search for technological innovations. The - ------------------------ * As used throughout the text of this Report, the term \"Company\" refers to Abbott Laboratories, an Illinois corporation, or Abbott Laboratories and its consolidated subsidiaries, as the context requires. introduction of new products by competitors and changes in medical practices and procedures can result in product obsolescence. In addition, the substitution of generic drugs for the brand prescribed has increased competitive pressures on pharmaceutical products. Consumer products are promoted directly to the public by consumer advertising. These products are generally sold directly to retailers and wholesalers. Competitive products are sold by other diversified consumer and health care companies. Competitive factors include consumer advertising, scientific innovation, price, and availability of generic product forms. Agricultural and chemical products are generally sold to agricultural distributors and pharmaceutical companies. Competition is primarily from large chemical and agricultural companies and companies selling specialized agricultural products. Competition is based on numerous factors depending on the market served. Important competitive factors include product performance for specialized industrial and agricultural uses, price, and technological advantages. The Company is the leading worldwide producer of the antibiotic erythromycin. Similac-R- is the leading infant formula product in the United States. Under an agreement between the Company and Takeda Chemical Industries, Ltd. of Japan (Takeda), TAP Pharmaceuticals Inc. (TAP), owned 50 percent by the Company and 50 percent by Takeda, develops and markets in the United States products based on Takeda research. TAP markets Lupron-R-, an LH-RH analog, and Lupron Depot-R-, a sustained release form of Lupron-R-, in the United States. These agents are used for the treatment of advanced prostatic cancer, endometriosis, and central precocious puberty. The Company also has marketing rights to certain Takeda products in select Latin American markets. The Company also markets Lupron-R-, Lupron Depot-R-, and Lupron Depot-Ped-R- in select markets outside the United States. HOSPITAL AND LABORATORY PRODUCTS Hospital and laboratory products include diagnostic systems for blood banks, hospitals, commercial laboratories, and alternate-care testing sites; intravenous and irrigation fluids and related administration equipment, including electronic drug delivery systems; drugs and drug delivery systems; anesthetics; critical care products; and other medical specialty products for hospitals and alternate-care sites. The principal products included in this segment are parenteral (intravenous or I.V.) solutions and related administration equipment sold as the LifeCare-R- line of products, LifeShield-R- sets, and Venoset-R- products; irrigating fluids; parenteral nutritionals such as Aminosyn-R- and Liposyn-R-; Plum-R- and Omni-Flow-R- electronic drug delivery systems; Abbott Pain Management Provider-R-; patient-controlled analgesia (PCA) systems; venipuncture products; hospital injectables; premixed I.V. drugs in various containers; ADD-Vantage-R- and Nutrimix-R- drug and nutritional delivery systems; anesthetics, including Pentothal-R-, isoflurane, and enflurane; hemodynamic monitoring equipment; Calcijex-R-, an injectable agent for treatment of bone disease in hemodialysis patients; critical care products including Opticath-R-; screening tests for hepatitis B, HTLV-1, hepatitis B core, and hepatitis C; tests for detection of AIDS antibodies and antigens, and other infectious disease detection systems; tests for determining levels of abused drugs with the ADx-R- instrument; physiological diagnostic tests; cancer monitoring tests including tests for prostate specific antigen; laboratory tests and therapeutic drug monitoring systems such as TDx-R-; clinical chemistry systems such as Abbott Spectrum-R-, Abbott Spectrum-R- EPx-R-, Abbott Spectrum-R- CCx-TM-, and Quantum-TM-; Commander-R- and IMx-R- lines of diagnostic instruments and chemical reagents used with immunoassay diagnostics; Abbott Vision-R-, a desk-top blood analyzer, the Abbott TestPack-R- system for diagnostic testing, and a full line of hematology systems and reagents known as the Cell-Dyn-R- series. The hospital and laboratory products the Company expects to introduce in the United States in 1994 include: AxSym-TM-, a diagnostic system; Abbott Maestro-TM-, a data management system; and EnCounter-R-, a desktop hematology analyzer. The Company markets hospital and laboratory products in the United States and many other countries. These products are generally distributed to wholesalers and directly to hospitals, laboratories, and physicians' offices from distribution centers maintained by the Company. Sales are also made in the home infusion services market directly to patients receiving treatment outside the hospital through marketing arrangements with hospitals and other health care providers. Overseas sales are made either directly to customers or through distributors, depending on the market served. The hospital and laboratory products industry segment is highly competitive, both in the United States and overseas. This segment is subject to competition in technological innovation, price, convenience of use, service, instrument warranty provisions, product performance, long-term supply contracts, and product potential for overall cost effectiveness and productivity gains. Products in this segment can be subject to rapid product obsolescence. The Company has benefitted from technological advantages of certain of its current products; however, these advantages may be reduced or eliminated as competitors introduce new products. The Company is one of the leading domestic manufacturers of I.V. and irrigating solutions and related administration equipment, parenteral nutritional products, anesthesia products, and drug delivery systems. It is also the worldwide leader in in vitro diagnostic products, including thyroid tests, therapeutic drug monitoring, cancer monitoring tests, diagnostic tests for the detection of hepatitis and AIDS antibodies, and immunodiagnostic instruments. INFORMATION WITH RESPECT TO THE COMPANY'S BUSINESS IN GENERAL SOURCES AND AVAILABILITY OF RAW MATERIALS The Company purchases, in the ordinary course of business, necessary raw materials and supplies essential to the Company's operations from numerous suppliers in the United States and overseas. There have been no recent availability problems or significant supply shortages. PATENTS, TRADEMARKS, AND LICENSES The Company is aware of the desirability for patent and trademark protection for its products. The Company owns, has applications pending for, and is licensed under a substantial number of patents. Accordingly, where possible, patents and trademarks are sought and obtained for the Company's products in the United States and all countries of major marketing interest to the Company. Principal trademarks and the products they cover are discussed in the Narrative Description of Business on pages 1 and 2. These, and various patents which expire during the period 1994 to 2011, in the aggregate, are believed to be of material importance in the operation of the Company's business. However, the Company believes that no single patent, license, trademark, (or related group of patents, licenses, or trademarks) is material in relation to the Company's business as a whole. SEASONAL ASPECTS, CUSTOMERS, BACKLOG, AND RENEGOTIATION There are no significant seasonal aspects to the Company's business. The incidence of certain infectious diseases which occur at various times in different areas of the world does, however, affect the demand for the Company's anti-infective products. Orders for the Company's products are generally filled on a current basis, and order backlog is not material to the Company's business. No single customer accounted for sales equaling 10 percent or more of the Company's consolidated net sales. No material portion of the Company's business is subject to renegotiation of profits or termination of contracts at the election of the government. RESEARCH AND DEVELOPMENT The Company spent $880,974,000 in 1993, $772,407,000 in 1992, and $666,336,000 in 1991 on research to discover and develop new products and processes and to improve existing products and processes. The Company continues to concentrate research expenditures in pharmaceutical and diagnostic products. ENVIRONMENTAL MATTERS The Company believes that its operations comply in all material respects with applicable laws and regulations concerning environmental protection. Regulations under federal and state environmental laws impose stringent limitations on emissions and discharges to the environment from various manufacturing operations. The Company's capital and operating expenditures for pollution control in 1993 were approximately $32 million and $31 million, respectively. Capital and operating expenditures for pollution control are estimated to approximate $39 million and $36 million, respectively, in 1994. The Company is participating as one of many potentially responsible parties in investigation and/ or remediation at eight locations in the United States and Puerto Rico under the Comprehensive Environmental Response, Compensation, and Liability Act, commonly known as Superfund. The aggregate costs of remediation at these sites by all identified parties are uncertain but have been subject to widely ranging estimates totaling as much as several hundred million dollars. In many cases, the Company believes that the actual costs will be lower than these estimates, and the fraction for which the Company may be responsible is anticipated to be considerably less and will be paid out over a number of years. The Company expects to participate in the investigation or cleanup at these sites. The Company is also voluntarily investigating potential contamination at five Company-owned sites, and has initiated voluntary remediation at four Company-owned sites, in cooperation with the Environmental Protection Agency (EPA) or similar state agencies. While it is not feasible to predict with certainty the costs related to the previously described investigation and cleanup activities, the Company believes that such costs, together with other expenditures to maintain compliance with applicable laws and regulations concerning environmental protection, should not have a material adverse effect on the Company's earnings or competitive position. EMPLOYEES The Company employed 49,659 persons as of December 31, 1993. REGULATION The development, manufacture, sale, and distribution of the Company's products are subject to comprehensive government regulation, and the general trend is toward more stringent regulation. Government regulation by various federal, state, and local agencies, which includes detailed inspection of and controls over research and laboratory procedures, clinical investigations, and manufacturing, marketing, sampling, distribution, recordkeeping, storage and disposal practices, substantially increases the time, difficulty, and costs incurred in obtaining and maintaining the approval to market newly developed and existing products. Government regulatory actions can result in the seizure or recall of products, suspension or revocation of the authority necessary for their production and sale, and other civil or criminal sanctions. Continuing studies of the utilization, safety, and efficacy of health care products and their components are being conducted by industry, government agencies, and others. Such studies, which employ increasingly sophisticated methods and techniques, can call into question the utilization, safety, and efficacy of previously marketed products and in some cases have resulted, and may in the future result, in the discontinuance of marketing of such products and give rise to claims for damages from persons who believe they have been injured as a result of their use. The cost of human health care products continues to be a subject of investigation and action by governmental agencies, legislative bodies, and private organizations in the United States and other countries. In the United States, most states have enacted generic substitution legislation requiring or permitting a dispensing pharmacist to substitute a different manufacturer's version of a pharmaceutical product for the one prescribed. Federal and state governments continue to press efforts to reduce costs of Medicare and Medicaid programs, including restrictions on amounts agencies will reimburse for the use of products. Manufacturers must pay certain statutorily-prescribed rebates on Medicaid purchases for reimbursement on prescription drugs under state Medicaid plans. In addition, the Federal government follows a diagnosis-related group (DRG) payment system for certain institutional services provided under Medicare or Medicaid. The DRG system entitles a health care facility to a fixed reimbursement based on discharge diagnoses rather than actual costs incurred in patient treatment, thereby increasing the incentive for the facility to limit or control expenditures for many health care products. The Veterans Health Care Act of 1992 requires manufacturers to extend additional discounts on pharmaceutical products to various federal agencies, including the Department of Veterans Affairs, Department of Defense, and Public Health Service entities and institutions. In the United States, governmental cost-containment efforts have extended to the federally subsidized Special Supplemental Food Program for Women, Infants, and Children (WIC). All states participate in WIC and have sought and obtained rebates from manufacturers of infant formula whose products are used in the program. All of the states have also conducted competitive bidding for infant formula contracts which require the use of specific infant formula products for the state WIC program. The Child Nutrition and WIC Reauthorization Act of 1989 requires all states participating in WIC to engage in competitive bidding upon the expiration of their existing infant formula contracts. Governmental regulatory agencies now require manufacturers to pay additional fees. Under the Prescription Drug User Fee Act of 1992, the Federal Food and Drug Administration imposes substantial fees on various aspects of the approval, manufacture and sale of prescription drugs. Congress is now considering expanding user fees to medical devices. The Company believes that such legislation, if enacted, will add considerable expense for the Company. In the United States comprehensive legislation has been proposed that would make significant changes to the availability, delivery and payment for healthcare products and services. It is the intent of such proposed legislation to provide health and medical insurance for all United States citizens and to reduce the rate of increases in United States healthcare expenditures. If such legislation is enacted, the Company believes it could have the effect of reducing prices for, or reducing the rate of price increases for health and medical insurance and medical products and services. International operations are also subject to a significant degree of government regulation. Many countries, directly or indirectly through reimbursement limitations, control the selling price of most health care products. Furthermore, many developing countries limit the importation of raw materials and finished products. International regulations are having an impact on United States regulations, as well. The International Organization for Standardization (\"ISO\") provides the voluntary criteria for regulating medical devices within the European Economic Community. The Food and Drug Administration (\"FDA\") has announced that it will attempt to harmonize its regulation of medical devices with that of the ISO. Recently published changes to the FDA's regulations governing the manufacture of medical devices appear to encompass and exceed the ISO's approach to regulating medical devices. The FDA's adoption of the ISO's approach to regulation and other changes to the manner in which the FDA regulates medical devices will increase the cost of compliance with those regulations. Efforts to reduce health care costs are also being made in the private sector. Health care providers have responded by instituting various cost reduction and containment measures. It is not possible to predict the extent to which the Company or the health care industry in general might be affected by the matters discussed above. INTERNATIONAL OPERATIONS The Company markets products in approximately 130 countries through affiliates and distributors. Most of the products discussed in the preceding sections of this report are sold outside the United States. In addition, certain products of a local nature and variations of product lines to meet local regulatory requirements and marketing preferences are manufactured and marketed to customers outside the United States. International operations are subject to certain additional risks inherent in conducting business outside the United States, including price and currency exchange controls, changes in currency exchange rates, limitations on foreign participation in local enterprises, expropriation, nationalization, and other governmental action. ITEM 2.\n\n\n" + ] + } + ], "source": [ "print(\"Num examples:\", len(df))\n", "print(\"First example:\")\n", @@ -799,7 +1045,24 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:streaming.base.dataset:Because `predownload` was not specified, it will default to 8*batch_size if batch_size is not None, otherwise 64. Prior to Streaming v0.7.0, `predownload` defaulted to max(batch_size, 256 * batch_size // num_canonical_nodes).\n" + ] + }, + { + "output_type": "stream", + "name": "stdout", + "output_type": "stream", + "text": [ + "ITEM 1. BUSINESS GENERAL DEVELOPMENT OF BUSINESS Abbott Laboratories is an Illinois corporation, incorporated in 1900. Abbott's* principal business is the discovery, development, manufacture, and sale of a broad and diversified line of health care products. FINANCIAL INFORMATION RELATING TO INDUSTRY SEGMENTS, GEOGRAPHIC AREAS, AND CLASSES OF SIMILAR PRODUCTS Incorporated herein by reference is Note 15 entitled \"Segment and Geographic Area Information\" of the Notes to Consolidated Financial Statements included under Item 8, \"Financial Statements and Supplementary Data.\" NARRATIVE DESCRIPTION OF BUSINESS Abbott has four reportable segments: Established Pharmaceutical Products, Diagnostic Products, Nutritional Products, and Vascular Products. Prior to January 1, 2013, Abbott had five reportable segments, which included Proprietary Pharmaceutical Products. On January 1, 2013, Abbott completed the separation of its research-based proprietary pharmaceuticals business through the distribution of the issued and outstanding common stock of AbbVie Inc. (AbbVie) to Abbott's shareholders. AbbVie was formed to hold Abbott's research-based proprietary pharmaceuticals business and, as a result of the distribution, became an independent public company trading under the symbol \"ABBV\" on the New York Stock Exchange. On September 26, 2014, Abbott completed its acquisition of approximately 99.9% of the ordinary shares of CFR Pharmaceuticals, S.A., a Latin American pharmaceutical company, for approximately $2.9 billion, in cash. On February 27, 2015, Abbott completed the sale of its developed markets branded generics pharmaceuticals business, which was previously included in the Established Pharmaceutical Products segment, to Mylan Inc. for 110 million shares of Mylan N.V., a newly formed entity that combined Mylan's existing business with Abbott's developed markets branded generics pharmaceuticals business. Abbott retained the branded generics pharmaceuticals business and products of its Established Pharmaceutical Products segment in emerging markets. Established Pharmaceutical Products These products include a broad line of branded generic pharmaceuticals manufactured worldwide and marketed and sold outside the United States. These products are generally sold directly to wholesalers, distributors, government agencies, health care facilities, pharmacies, and independent retailers from Abbott-owned distribution centers and public warehouses, depending on the market served. Certain products are co-marketed or co-promoted with, or licensed from, other companies. The principal products included in the broad therapeutic area portfolios of the Established Pharmaceutical Products segment are: • gastroenterology products, including Creon®, for the treatment of pancreatic exocrine insufficiency associated with several underlying conditions, including cystic fibrosis and chronic pancreatitis; Duspatal® and Dicetel®, for the treatment of irritable bowel syndrome or biliary spasm; Heptral®, *As used throughout the text of this report on Form 10-K, the term \"Abbott\" refers to Abbott Laboratories, an Illinois corporation, or Abbott Laboratories and its consolidated subsidiaries, as the context requires. Transmetil®, Samyr®, and Donamet®, for the treatment of intrahepatic cholestasis (associated with liver disease) or depressive symptoms; and Duphalac®, for regulation of the physiological rhythm of the colon; • women's health products, including Duphaston®, for the treatment of many different gynecological disorders; and Femoston®, a hormone replacement therapy for postmenopausal women; • cardiovascular and metabolic products, including Lipanthyl® and TriCor®, for the treatment of dyslipidemia; Teveten® and Teveten® Plus, for the treatment of essential hypertension, and Physiotens®, for the treatment of hypertension; and Synthroid®, for the treatment of hypothyroidism; • pain and central nervous system products, including Serc®, for the treatment of Ménière's disease and vestibular vertigo; and Brufen®, for the treatment of pain, fever, and inflammation; and • respiratory drugs and vaccines, including the anti-infective clarithromycin (sold under the trademarks Biaxin®, Klacid®, and Klaricid®); and Influvac®, an influenza vaccine. The Established Pharmaceutical Products segment directs its primary marketing efforts toward building a strong brand with key stakeholders, including consumers, pharmacists, physicians, and other healthcare providers. Government agencies are also important customers. Competition in the Established Pharmaceutical Products segment is generally from other health care and pharmaceutical companies. In addition, the substitution of generic drugs for the brand prescribed and introduction of additional forms of already marketed established products by generic or branded competitors have increased competitive pressures. Diagnostic Products These products include a broad line of diagnostic systems and tests manufactured, marketed, and sold worldwide to blood banks, hospitals, commercial laboratories, clinics, physicians' offices, government agencies, alternate-care testing sites, and plasma protein therapeutic companies. In the United States, the segment's products are generally marketed and sold directly from Abbott-owned distribution centers, public warehouses and third-party distributors. Outside the United States, sales are made either directly to customers or through distributors, depending on the market served. The principal products included in the Diagnostic Products segment are: • immunoassay and clinical chemistry systems, including ARCHITECT® and ABBOTT PRISM®; • assays used for screening and/or diagnosis for drugs of abuse, cancer, therapeutic drug monitoring, fertility, physiological diseases, and infectious diseases such as hepatitis and HIV; • a full line of hematology systems and reagents known as the Cell-Dyn® series; • the m2000™, an instrument that automates the extraction, purification, and preparation of DNA and RNA from patient samples, and detects and measures infectious agents including HIV, HBV, HCV, HPV, and CT/NG; • the Vysis® product line of genomic-based tests, including the PathVysion® HER-2 DNA probe kit; the UroVysion® bladder cancer recurrence kit; and the Vysis ALK Break Apart FISH Probe Kit, the only FDA-approved companion diagnostic to Pfizer's approved non-small-cell lung cancer therapy XALKORI®; • IRIDICA®, an instrument used to rapidly identify a broad range of infection-causing pathogens, including bacteria, fungi, and viruses in critically ill patients; • informatics and automation solutions for use in the laboratory; and • the i-STAT® point-of-care diagnostic systems and tests for blood analysis. The Diagnostic Products segment's products are subject to competition in technological innovation, price, convenience of use, service, instrument warranty provisions, product performance, long-term supply contracts, and product potential for overall cost-effectiveness and productivity gains. Some products in this segment can be subject to rapid product obsolescence or regulatory changes. Although Abbott has benefited from technological advantages of certain of its current products, these advantages may be reduced or eliminated as competitors introduce new products. Nutritional Products These products include a broad line of pediatric and adult nutritional products manufactured, marketed, and sold worldwide. These products are generally marketed and sold directly to customers and to institutions, wholesalers, retailers, health care facilities, government agencies, and third-party distributors from Abbott-owned distribution centers or third-party distributors. The principal products included in the Nutritional Products segment are: • various forms of prepared infant formula and follow-on formula, including Similac®, Similac®Advance®, Similac® Advance® with EarlyShield®, Similac® with Iron, Similac Sensitive®, Similac Sensitive® RS, Similac Go&Grow®, Similac® NeoSure®, Similac® Organic, Similac Special Care®, Similac Total Comfort™, Similac® For Supplementation, Similac® with OptiGRO™, Isomil® Advance®, Isomil®, Alimentum®, Gain®, Grow®, Similac Qinti™, and Eleva™; • adult and other pediatric nutritional products, including Ensure®, Ensure Plus®, Ensure® Muscle Health, Ensure® (with Nutrivigor®), Ensure® Complete, Glucerna®, Glucerna Hunger Smart®, ProSure®, PediaSure®, PediaSure Sidekicks®, EleCare®, Juven®, Abound®, and Pedialyte®; • nutritional products used in enteral feeding in health care institutions, including Jevity®, Glucerna® 1.2 Cal, Glucerna® 1.5 Cal, Osmolite®, Oxepa®, Freego® (Enteral Pump) and Freego® sets, and Nepro®; and • Zone Perfect® bars and the EAS® family of nutritional brands, including Myoplex® and AdvantEdge®. Primary marketing efforts for nutritional products are directed toward consumers or to securing the recommendation of Abbott's brand of products by physicians or other health care professionals. In addition, certain nutritional products sold as Gain®, Grow®, PediaSure®, PediaSure Sidekicks®, Pedialyte®, Ensure®, Zone Perfect®, EAS®/Myoplex®, and Glucerna® are also promoted directly to the public by consumer marketing efforts in select markets. Competition for nutritional products in the segment is generally from other diversified consumer and health care manufacturers. Competitive factors include consumer advertising, formulation, packaging, scientific innovation, intellectual property, price, retail distribution, and availability of product forms. A significant aspect of competition is the search for ingredient innovations. The introduction of new products by competitors, changes in medical practices and procedures, and regulatory changes can result in product obsolescence. In\n\n addition, private label and local manufacturers' products may increase competitive pressure. Vascular Products These products include a broad line of coronary, endovascular, vessel closure, and structural heart devices for the treatment of vascular disease manufactured, marketed and sold worldwide. In the United States, the segment's products are generally marketed and sold directly to hospitals from Abbott-owned distribution centers and public warehouses. Outside the United States, sales are made either directly to customers or through distributors, depending on the market served. The principal products included in the Vascular Products segment are: • XIENCE Alpine™, XIENCE Xpedition®, XIENCE Prime®, XIENCE nano®, XIENCE V®, and XIENCE Pro® and XIENCE ProX, drug-eluting coronary stent systems developed on the Multi-Link Vision® platform; • Absorb™, a drug-eluting coronary bioresorbable vascular scaffold; • Multi-Link 8®, Multi-Link Vision® and Multi-Link Mini Vision®, coronary metallic stents; • TREK® and Voyager®, coronary balloon dilatation products; • Hi-Torque Balance Middleweight Elite® and ASAHI® coronary guidewires (licensed from Asahi Intecc Co., Ltd.); • MitraClip®, a percutaneous mitral valve repair system; • Supera® Peripheral Stent System, a peripheral vascular stent system; • StarClose SE® and Perclose® vessel closure devices; and • Acculink®/Accunet® and Xact®/Emboshield NAV®, carotid stent systems. The Vascular Products segment's products are subject to competition in technological innovation, price, convenience of use, service, product performance, long-term supply contracts, and product potential for overall cost-effectiveness and productivity gains. Some products in this segment can be subject to rapid product obsolescence or regulatory changes. Although Abbott has benefited from technological advantages of certain of its current products, these advantages may be reduced or eliminated as competitors introduce new products. Other Products The principal products in Abbott's other businesses include blood glucose, continuous glucose, and flash glucose monitoring systems, including test strips, sensors, data management decision software, and accessories for people with diabetes, under the FreeStyle® brand, and medical devices for the eye, including cataract surgery, LASIK surgery, contact lens care products, and dry eye products. These products are marketed worldwide and generally sold directly to wholesalers, government agencies, private health care organizations, health care facilities, mail order pharmacies, and independent retailers from Abbott-owned distribution centers and public warehouses. Some of these products are marketed and distributed through distributors. Blood glucose monitoring systems, contact lens care products, and dry eye products are also marketed and sold to consumers. These products are subject to regulatory changes and competition in technological innovation, price, convenience of use, service, and product performance. INFORMATION WITH RESPECT TO ABBOTT'S BUSINESS IN GENERAL Sources and Availability of Raw Materials Abbott purchases, in the ordinary course of business, raw materials and supplies essential to Abbott's operations from numerous suppliers in the United States and abroad. There have been no recent significant availability problems or supply shortages for raw materials or supplies. Patents, Trademarks, and Licenses Abbott is aware of the desirability for patent and trademark protection for its products. Accordingly, where possible, patents and trademarks are sought and obtained for Abbott's products in the United States and countries of interest to Abbott. Abbott owns and is licensed under a substantial number of patents and patent applications. Principal trademarks and the products they cover are discussed in the Narrative Description of Business on pages 1 through 4. These, and various patents which expire during the period to 2035, in the aggregate, are believed to be of material importance in the operation of Abbott's business. Abbott believes that no single patent, license, or trademark is material in relation to Abbott's business as a whole. Patent-related litigation is discussed in Legal Proceedings on page 16. Seasonal Aspects, Customers, Backlog, and Renegotiation There are no significant seasonal aspects to Abbott's business. Abbott has no single customer that, if the customer were lost, would have a material adverse effect on Abbott. Orders for Abbott's products are generally filled on a current basis, and order backlog is not material to Abbott's business. No material portion of Abbott's business is subject to renegotiation of profits or termination of contracts at the election of a government. Research and Development Abbott spent approximately $1.3 billion in 2014, $1.4 billion in 2013, and $1.5 billion in 2012 on research to discover and develop new products and processes and to improve existing products and processes. Environmental Matters Abbott believes that its operations comply in all material respects with applicable laws and regulations concerning environmental protection. Regulations under federal and state environmental laws impose stringent limitations on emissions and discharges to the environment from various manufacturing operations. Abbott's capital and operating expenditures for pollution control in 2014 were approximately $14 million and $37 million, respectively. Capital and operating expenditures for pollution control in 2015 are estimated to be $11 million and $40 million, respectively. Abbott has been identified as one of many potentially responsible parties in investigations and/or remediations at several locations in the United States, including Puerto Rico, under the Comprehensive Environmental Response, Compensation, and Liability Act, commonly known as Superfund. Abbott is also engaged in remediation at several other sites, some of which are owned by Abbott, in cooperation with the Environmental Protection Agency (EPA) or similar agencies. While it is not feasible to predict with certainty the final costs related to those investigations and remediation activities, Abbott believes that such costs, together with other expenditures to maintain compliance with applicable laws and regulations concerning environmental protection, should not have a material adverse effect on Abbott's financial position, cash flows, or results of operations. Employees Abbott employed approximately 77,000 persons as of December 31, 2014. Regulation The development, manufacture, marketing, sale, promotion, and distribution of Abbott's products are subject to comprehensive government regulation by the U.S. Food and Drug Administration and similar international regulatory agencies. Government regulation by various international, supranational, federal and state agencies addresses (among other matters) the development and approval to market Abbott's products, as well as the inspection of, and controls over, research and laboratory procedures, clinical investigations, product approvals and manufacturing, labeling, packaging, supply chains, marketing and promotion, pricing and reimbursement, sampling, distribution, quality control, post-market surveillance, record keeping, storage, and disposal practices. Abbott's international operations are also affected by trade and investment regulations in many countries. These may require local investment, restrict Abbott's investments, or limit the import of raw materials and finished products. In addition, Abbott is subject to laws and regulations pertaining to health care fraud and abuse, including state and federal anti-kickback and false claims laws in the United States. Prescription drug, nutrition, and medical device manufacturers such as Abbott are also subject to taxes, as well as application, product, user, establishment, and other fees. Governmental agencies can also invalidate intellectual property rights. Compliance with these laws and regulations is costly and materially affects Abbott's business. Among other effects, health care regulations substantially increase the time, difficulty, and costs incurred in obtaining and maintaining approval to market newly developed and existing products. Abbott expects this regulatory environment will continue to require significant technical expertise and capital investment to ensure compliance. Failure to comply can delay the release of a new product or result in regulatory and enforcement actions, the seizure or recall of a product, the suspension or revocation of the authority necessary for a product's production and sale, and other civil or criminal sanctions, including fines and penalties. Abbott's business can also be affected by ongoing studies of the utilization, safety, efficacy, and outcomes of health care products and their components that are regularly conducted by industry participants, government agencies, and others. These studies can call into question the utilization, safety, and efficacy of previously marketed products. In some cases, these studies have resulted, and may in the future result, in the discontinuation of marketing of such products in one or more countries, and may give rise to claims for damages from persons who believe they have been injured as a result of their use. Access to human health care products continues to be a subject of investigation and action by governmental agencies, legislative bodies, and private organizations in many countries. A major focus is cost containment. Efforts to reduce health care costs are also being made in the private sector, notably by health care payors and providers, which have instituted various cost reduction and containment measures. Abbott expects insurers and providers will continue attempts to reduce the cost of health care products. Many countries control the price of health care products directly or indirectly, through reimbursement, payment, pricing, coverage limitations, or compulsory licensing. Budgetary pressures on health care payors may also heighten the scope and severity of pricing pressures on Abbott's products for the foreseeable future. In the United States, the federal government regularly evaluates reimbursement for medical procedures in which medical devices and diagnostics may be used. The government follows a diagnosis-related group (DRG) payment system for certain institutional services provided under Medicare or Medicaid and has implemented a prospective payment system (PPS) for services delivered in hospital outpatient, nursing home, and home health settings. DRG and PPS entitle a health care facility to a fixed reimbursement based on the diagnosis and/or procedure rather than actual costs incurred in patient treatment, thereby increasing the incentive for the facility to limit or control expenditures for many health care products. Under the Patient Protection and Affordable Care Act and the Health Care and Education Reconciliation Act (together, the Affordable Care Act), Abbott must pay an excise tax on sales of certain medical devices. Medicare also implemented a competitive bidding system for durable medical equipment (including diabetes products), enteral nutrition products, and supplies. The Affordable Care Act also includes provisions known as the Physician Payments Sun\n\nshine Act, which require manufacturers of drugs, devices, and medical supplies covered under Medicare and Medicaid to record any transfers of value to physicians and teaching hospitals and to report this data to the Centers for Medicare and Medicaid Services for subsequent public disclosure. Similar reporting requirements have also been enacted on the state level domestically, and an increasing number of governments worldwide either have adopted or are considering similar laws requiring transparency of interactions with health care professionals. Failure to report appropriate data may result in civil or criminal fines and/or penalties. The regulation of data privacy and security, and the protection of the confidentiality of certain patient health information, is increasing. For example, the European Union continues to contemplate enacting stricter laws with enhanced financial penalties for noncompliance. Similarly, the U.S. Department of Health and Human Services has issued rules governing the use, disclosure, and security of protected health information, and the U.S. Food and Drug Administration has issued further guidance concerning data security for medical devices. Failure to comply with data privacy and security regulations can result in enforcement actions, which could include civil or criminal penalties. Transferring and managing protected health information will become more challenging as new laws and regulations are enacted, and Abbott expects there will be increasing regulatory complexity in this area. Governmental cost containment efforts also affect Abbott's nutrition business. In the United States, for example, under regulations governing the federally funded Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), all states must have a cost containment program for infant formula. As a result, through competitive bidding states obtain rebates from manufacturers of infant formula whose products are used in the program. Abbott expects debate to continue during 2015 at all government levels worldwide over the marketing, manufacture, availability, method of delivery, and payment for health care products and services, as well as data privacy and security. Abbott believes that future legislation and regulation in the markets it serves could affect access to health care products and services, increase rebates, reduce prices or reimbursements or the rate of price increases for health care products and services, change health care delivery systems, create new fees and obligations for the pharmaceutical, nutrition, diagnostic, and medical device industries, or require additional reporting and disclosure. It is not possible to predict the extent to which Abbott or the health care industry in general might be affected by the matters discussed above. INTERNATIONAL OPERATIONS As discussed in greater detail in the section captioned, \"Narrative Description of Business,\" Abbott markets products worldwide through affiliates and distributors. Most of the products discussed in the preceding sections of this report are also sold outside the United States. In addition, certain products of a local nature and variations of product lines to meet local regulatory requirements and marketing preferences are manufactured and marketed to customers outside the United States. International operations are subject to certain additional risks inherent in conducting business outside the United States, including price and currency exchange controls, changes in currency exchange rates, limitations on foreign participation in local enterprises, expropriation, nationalization, and other governmental action. INTERNET INFORMATION Copies of Abbott's Annual Report on Form 10-K, Quarterly Reports on Form 10-Q, Current Reports on Form 8-K, and amendments to those reports filed or furnished p\n\n*** WARNING: max output size exceeded, skipping output. ***\n\ntions affecting government benefit programs could impose new obligations on Abbott, require Abbott to change its business practices, and restrict its operations in the future. Abbott's industry is subject to various international, supranational, federal, and state laws and regulations pertaining to government benefit program reimbursement, price reporting and regulation, and health care fraud and abuse, including anti-kickback and false claims laws, and international and individual state laws relating to pricing and sales and marketing practices. Violations of these laws may be punishable by criminal and/or civil sanctions, including, in some instances, substantial fines, imprisonment, and exclusion from participation in government health care programs, including Medicare, Medicaid, and Veterans Administration health programs in the U.S. These laws and regulations are broad in scope and they are subject to evolving interpretations, which could require Abbott to incur substantial costs associated with compliance or to alter one or more of its sales or marketing practices. In addition, violations of these laws, or allegations of such violations, could disrupt Abbott's business and result in a material adverse effect on Abbott's revenues, profitability, and financial condition. Changes in the health care regulatory environment may adversely affect Abbott's business. A number of the provisions of the U.S. Patient Protection and Affordable Care Act and the Health Care and Education Reconciliation Act of 2010 change access to health care products and services and establish new fees for the medical device industry. Future rulemaking could increase rebates, reduce prices or the rate of price increases for health care products and services, or require additional reporting and disclosure. Abbott cannot predict the timing or impact of any future rulemaking. The expiration or loss of patent protection and licenses may affect Abbott's future revenues and operating income. Many of Abbott's businesses rely on patent and trademark and other intellectual property protection. Although most of the challenges to Abbott's intellectual property have come from other businesses, governments may also challenge intellectual property protections. To the extent Abbott's intellectual property is successfully challenged, invalidated, or circumvented or to the extent it does not allow Abbott to compete effectively, Abbott's businesses could suffer. To the extent that countries do not enforce Abbott's intellectual property rights\n\n or to the extent that countries require compulsory licensing of its intellectual property, Abbott's future revenues and operating income could be reduced. Litigation regarding Abbott's patents and trademarks is described in the section captioned \"Legal Proceedings.\" Competitors' intellectual property may prevent Abbott from selling its products or have a material adverse effect on Abbott's future profitability and financial condition. Competitors may claim that an Abbott product infringes upon their intellectual property. Resolving an intellectual property infringement claim can be costly and time consuming and may require Abbott to enter into license agreements. Abbott cannot guarantee that it would be able to obtain license agreements on commercially reasonable terms. A successful claim of patent or other intellectual property infringement could subject Abbott to significant damages or an injunction preventing the manufacture, sale or use of affected Abbott products. Any of these events could have a material adverse effect on Abbott's profitability and financial condition. Abbott's research and development efforts may not succeed in developing commercially successful products and technologies, which may cause Abbott's revenue and profitability to decline. To remain competitive, Abbott must continue to launch new products and technologies. To accomplish this, Abbott commits substantial efforts, funds, and other resources to research and development. A high rate of failure is inherent in the research and development of new products and technologies. Abbott must make ongoing substantial expenditures without any assurance that its efforts will be commercially successful. Failure can occur at any point in the process, including after significant funds have been invested. Promising new product candidates may fail to reach the market or may only have limited commercial success because of efficacy or safety concerns, failure to achieve positive clinical outcomes, inability to obtain necessary regulatory approvals, limited scope of approved uses, excessive costs to manufacture, the failure to establish or maintain intellectual property rights, or infringement of the intellectual property rights of others. Even if Abbott successfully develops new products or enhancements or new generations of Abbott's existing products, they may be quickly rendered obsolete by changing customer preferences, changing industry standards, or competitors' innovations. Innovations may not be accepted quickly in the marketplace because of, among other things, entrenched patterns of clinical practice or uncertainty over third-party reimbursement. Abbott cannot state with certainty when or whether any of its products under development will be launched, whether it will be able to develop, license, or otherwise acquire compounds or products, or whether any products will be commercially successful. Failure to launch successful new products or new indications for existing products may cause Abbott's products to become obsolete, causing Abbott's revenues and operating results to suffer. New products and technological advances by Abbott's competitors may negatively affect Abbott's results of operations. Abbott's products face intense competition from its competitors' products. Competitors' products may be safer, more effective, more effectively marketed or sold, or have lower prices or superior performance features than Abbott's products. Abbott cannot predict with certainty the timing or impact of the introduction of competitors' products. The manufacture of many of Abbott's products is a highly exacting and complex process, and if Abbott or one of its suppliers encounters problems manufacturing products, Abbott's business could suffer. The manufacture of many of Abbott's products is a highly exacting and complex process, due in part to strict regulatory requirements. Problems may arise during manufacturing for a variety of reasons, including equipment malfunction, failure to follow specific protocols and procedures, problems with raw materials, natural disasters, and environmental factors. In addition, single suppliers are currently used for certain products and materials. If problems arise during the production of a batch of product, that batch of product may have to be discarded. This could, among other things, lead to increased costs, lost revenue, damage to customer relations, time and expense spent investigating the cause and, depending on the cause, similar losses with respect to other batches or products. If problems are not discovered before the product is released to the market, recall and product liability costs may also be incurred. To the extent Abbott or one of its suppliers experiences significant manufacturing problems, this could have a material adverse effect on Abbott's revenues and profitability. Significant safety concerns could arise for Abbott's products, which could have a material adverse effect on Abbott's revenues and financial condition. Health care products typically receive regulatory approval based on data obtained in controlled clinical trials of limited duration. Following regulatory approval, these products will be used over longer periods of time in many patients. Investigators may also conduct additional, and perhaps more extensive, studies. If new safety issues are reported, Abbott may be required to amend the conditions of use for a product. For example, Abbott may be required to provide additional warnings on a product's label or narrow its approved intended use, either of which could reduce the product's market acceptance. If serious safety issues arise with an Abbott product, sales of the product could be halted by Abbott or by regulatory authorities. Safety issues affecting suppliers' or competitors' products also may reduce the market acceptance of Abbott's products. In addition, in the ordinary course of business, Abbott is the subject of product liability claims and lawsuits alleging that its products or the products of other companies that Abbott promotes have resulted or could result in an unsafe condition for or injury to patients. Product liability claims and lawsuits, safety alerts or product recalls, and other allegations of product safety or quality issues, regardless of their validity or ultimate outcome, may have a material adverse effect on Abbott's business and reputation and on Abbott's ability to attract and retain customers. Consequences may also include additional costs, a decrease in market share for the products, lower income or exposure to other claims. Product liability losses are self-insured. Product liability claims could have a material adverse effect on Abbott's profitability and financial condition. Deterioration in the economic position and credit quality of certain countries may negatively affect Abbott's results of operations. Unfavorable economic conditions in certain countries may increase the time it takes to collect outstanding trade receivables. Financial instability and fiscal deficits in these countries may result in additional austerity measures to reduce costs, including health care. Deterioration in the quality of sovereign debt, including credit downgrades, could increase Abbott's collection risk where a significant amount of Abbott's receivables in these countries are with governmental health care systems. Abbott depends on sophisticated information technology systems to operate its business and a cyber attack or other breach of these systems could have a material adverse effect on Abbott's results of operations. Similar to other large multi-national companies, the size and complexity of Abbott's information technology systems makes them vulnerable to a cyber attack, malicious intrusion, breakdown, destruction, loss of data privacy, or other significant disruption. Abbott's systems have been and are expected to continue to be the target of malware and other cyber attacks. Abbott has invested in its systems and the protection of its data to reduce the risk of an invasion or interruption and monitors its systems on an ongoing basis for any current or potential threats. There can be no assurance that these measures and efforts will prevent future interruptions or breakdowns that could have a significant effect on Abbott's business. Abbott may incur operational difficulties or be exposed to claims and liabilities as a result of the separation. AbbVie and Abbott entered into a separation and distribution agreement and various other agreements to govern the separation of AbbVie from Abbott and the relationship between the two companies going forward. These arrangements could lead to disputes between Abbott and AbbVie over Abbott's rights to certain shared property and rights and over the allocation of costs and revenues for products and operations. The separation and distribution agreement also provides for, among other things, indemnification obligations designed to make AbbVie financially responsible for substantially all liabilities that may exist relating to its business activities, whether incurred prior to or after AbbVie's separation from Abbott, as well as those obligations of Abbott assumed by AbbVie pursuant to the separation and distribution agreement. It is possible that a court would disregard the allocation agreed to between Abbott and AbbVie and require Abbott to assume responsibility for obligations allocated to AbbVie. Third parties could also seek to hold Abbott responsible for any of these liabilities or obligations. The indemnity rights Abbott has under the separation agreement may not be sufficient to protect Abbott. Even if Abbott is successful in obtaining indemnification, Abbott may have to bear losses temporarily. In addition, Abbott's indemnity obligations to AbbVie may be significant. These risks could negatively affect Abbott's results of operations. There could be significant liability if the distribution of AbbVie common stock to Abbott shareholders is determined to be a taxable transaction. Abbott received a private letter ruling from the Internal Revenue Service (IRS) to the effect that, among other things, the separation and the distribution of AbbVie qualifies as a transaction that is tax-free for U.S. federal income tax purposes under Sections 355 and 368(a)(1)(D) of the Internal Revenue Code (the Code). In addition, Abbott received an opinion from outside tax counsel to the effect that the separation and distribution qualifies as a transaction that is described in Sections 355(a) and 368(a)(1)(D) of the Code. The ruling and the opinion rely on certain facts, assumptions, representations and undertakings from Abbott and AbbVie regarding the past and future conduct of the companies' respective businesses and other matters. If any of these facts, assumptions, representations or undertakings are incorrect or not satisfied, Abbott and its shareholders may not be able to rely on the ruling or the opinion of tax counsel and could be subject to significant tax liabilities. Notwithstanding the receipt by Abbott of the private letter ruling from the IRS and opinion of tax counsel, the IRS could determine on audit that the separation is taxable if it determines that any of these facts, assumptions, representations or undertakings are not correct or have been violated or if it disagrees with the conclusions in the opinion that are not covered by the private letter ruling, or for other reasons, including as a result of certain significant changes in the share ownership of Abbott or AbbVie after the separation. If the separation is determined to be taxable for U.S. federal income tax purposes, Abbott and its shareholders that are subject to U\n\n.S. federal income tax could incur significant U.S. federal income tax liabilities. Abbott holds a significant investment in Mylan N.V. and is subject to market risk. On February 27, 2015, Abbott completed the disposition of its developed markets branded generics pharmaceuticals business and, in exchange, received 110,000,000 Mylan N.V. ordinary shares. As long as Abbott holds the shares, Abbott will have a substantial undiversified equity investment in Mylan and, therefore, will be subject to the risk of changes in the market value of those shares. Fluctuation in foreign currency exchange rates may adversely affect our financial statements and Abbott's ability to realize projected sales and earnings. Although Abbott's financial statements are denominated in U.S. dollars, a significant portion of Abbott's revenues and costs are realized in other currencies. Abbott's profitability is affected by movement of the U.S. dollar against other currencies. Fluctuations in exchange rates between the U.S. dollar and other currencies may also affect the reported value of Abbott's assets and liabilities, as well as its cash flows. Some foreign currencies are subject to government exchange controls. While Abbott enters into hedging arrangements to mitigate some of its foreign currency exposure, Abbott cannot predict with any certainty changes in foreign currency exchange rates or its ability to mitigate these risks. The international nature of Abbott's business subjects it to additional business risks that may cause its revenue and profitability to decline. Abbott's business is subject to risks associated with doing business internationally. Sales outside of the United States make up approximately 70 percent of Abbott's net sales. Additional risks associated with Abbott's international operations include: • differing local product preferences and product requirements; • trade protection measures and import or export licensing requirements; • difficulty in establishing, staffing, and managing operations; • differing labor regulations; • potentially negative consequences from changes in or interpretations of tax laws; • political and economic instability, including sovereign debt issues; • price controls, limitations on participation in local enterprises, expropriation, nationalization, and other governmental action; • inflation, recession, and fluctuations in interest rates; • compulsory licensing or diminished protection of intellectual property; and • potential penalties or other adverse consequences for violations of anti-corruption, anti-bribery, and other similar laws and regulations, including the Foreign Corrupt Practices Act and the U.K. Bribery Act. Events contemplated by these risks may, individually or in the aggregate, have a material adverse effect on Abbott's revenues and profitability. Other factors can have a material adverse effect on Abbott's future profitability and financial condition. Many other factors can affect Abbott's profitability and its financial condition, including: • changes in or interpretations of laws and regulations, including changes in accounting standards, taxation requirements, product marketing application standards, product labeling, source and use laws, and environmental laws; • differences between the fair value measurement of assets and liabilities and their actual value, particularly for pensions, retiree health care, stock compensation, intangibles, and goodwill; and for contingent liabilities such as litigation, the absence of a recorded amount, or an amount recorded at the minimum, compared to the actual amount; • changes in the rate of inflation (including the cost of raw materials, commodities, and supplies), interest rates, market value of Abbott's equity investments, and the performance of investments held by Abbott or Abbott's employee benefit trusts; • changes in the creditworthiness of counterparties that transact business with or provide services to Abbott or Abbott's employee benefit trusts; • changes in business, economic, and political conditions, including: war, political instability, terrorist attacks, the threat of future terrorist activity and related military action; natural disasters; the cost and availability of insurance due to any of the foregoing events; labor disputes, strikes, slow-downs, or other forms of labor or union activity; and pressure from third-party interest groups; • changes in Abbott's business units and investments and changes in the relative and absolute contribution of each to earnings and cash flow resulting from evolving business strategies, changing product mix, changes in tax laws or tax rates both in the U.S. and abroad and opportunities existing now or in the future; • changes in the buying patterns of a major distributor, retailer, or wholesale customer resulting from buyer purchasing decisions, pricing, seasonality, or other factors, or other problems with licensors, suppliers, distributors, and business partners; • changes in credit markets impacting Abbott's ability to obtain financing for its business operations; and • legal difficulties, any of which could preclude or delay commercialization of products or adversely affect profitability, including claims asserting statutory or regulatory violations, and adverse litigation decisions. CAUTIONARY STATEMENT REGARDING FORWARD-LOOKING STATEMENTS This Form 10-K contains forward-looking statements that are based on management's current expectations, estimates, and projections. Words such as \"expects,\" \"anticipates,\" \"intends,\" \"plans,\" \"believes,\" \"seeks,\" \"estimates,\" \"forecasts,\" variations of these words, and similar expressions are intended to identify these forward-looking statements. Certain factors, including but not limited to those identified under \"Item 1A. Risk Factors\" of this Form 10-K, may cause actual results to differ materially from current expectations, estimates, projections, forecasts, and from past results. No assurance can be made that any expectation, estimate, or projection contained in a forward-looking statement will be achieved or will not be affected by the factors cited above or other future events. Abbott undertakes no obligation to release publicly any revisions to forward-looking statements as the result of subsequent events or developments, except as required by law. ITEM 1B.\nITEM 1B. UNRESOLVED STAFF COMMENTS None. ITEM 2.\nITEM 2. PROPERTIES Abbott's corporate offices are located at 100 Abbott Park Road, Abbott Park, Illinois 60064. The locations of Abbott's principal plants, as of December 31, 2014, are listed below. Location Segments of Products Produced Abbott Park, Illinois Diagnostic Products Alajuela, Costa Rica Vascular Products Alcobendas, Spain Non-Reportable Altavista, Virginia Nutritional Products Anasco, Puerto Rico * Non-Reportable Baddi, India Established Pharmaceutical Products Barceloneta, Puerto Rico * Established Pharmaceutical and Vascular Products Belgorod, Russia Established Pharmaceutical Products Bogota, Colombia Established Pharmaceutical Products Buenos Aires, Argentina Established Pharmaceutical Products Cali, Colombia Established Pharmaceutical Products Casa Grande, Arizona Nutritional Products Chatillon, France ** Established Pharmaceutical Products Clonmel, Ireland Vascular Products Columbus, Ohio Nutritional Products Cootehill, Ireland Nutritional Products Dartford, England * Diagnostic Products Des Plaines, Illinois Diagnostic Products Donegal, Ireland Non-Reportable Fairfield, California * Nutritional Products Goa, India Established Pharmaceutical Products Granada, Spain Nutritional Products Groningen, the Netherlands Non-Reportable Hangzhou, China Non-Reportable Irving, Texas Diagnostic Products Jhagadia, India Nutritional Products Jiaxing, China Nutritional Products Karachi, Pakistan Established Pharmaceutical Products Katsuyama, Japan ** Established Pharmaceutical Products Lima, Peru Established Pharmaceutical Products Longford, Ireland Diagnostic Products Menlo Park, California Vascular Products Milpitas, California * Non-Reportable Murrieta, California Vascular Products Neustadt, Germany Established Pharmaceutical Products Olst, the Netherlands Established Pharmaceutical Products Ottawa, Canada * Diagnostic Products Pompeya, Argentina Established Pharmaceutical Products Quilmes, Argentina Established Pharmaceutical Products Redwood City, California * Vascular Products Rio de Janeiro, Brazil Established Pharmaceutical Products Santiago, Chile Established Pharmaceutical Products Singapore Nutritional Products Sligo, Ireland * Nutritional and Diagnostic Products Sturgis, Michigan Nutritional Products Sunnyvale, California Non-Reportable Temecula, California Vascular Products Tipp City, Ohio Nutritional Products Tlalpan, Mexico Established Pharmaceutical Products Uppsala, Sweden Non-Reportable Weesp, the Netherlands Established Pharmaceutical Products Wiesbaden, Germany Diagnostic Products Witney, England Non-Reportable Zwolle, the Netherlands Nutritional Products *Leased property **Transferred as part of the sale of the developed markets branded generics pharmaceuticals business to Mylan Inc. In addition to the above, as of December 31, 2014, Abbott had manufacturing facilities in three other locations in the United States and in seven countries outside the United States. Abbott's facilities are deemed suitable and provide adequate productive capacity. Abbott's research and development facilities in the United States are primarily located in California, Illinois, New Jersey, and Ohio. Abbott also has research and development facilities in various other countries including China, India, Singapore, Spain, and Switzerland. Except as noted, the corporate offices, and those principal plants in the United States listed above, are owned by Abbott or subsidiaries of Abbott. The remaining manufacturing plants and all other facilities are owned or leased by Abbott or subsidiaries of Abbott. There are no material encumbrances on the properties. ITEM 3.\nITEM 3. LEGAL PROCEEDINGS Abbott is involved in various claims, legal proceedings and investigations, including (as of January 31, 2015, except where noted below) those described below. While it is not feasible to predict the outcome of such pending claims, proceedings and investigations with certainty, management is of the opinion that their ultimate resolution should not have a material adverse effect on Abbott's financial position, cash flows, or results of operations. In September 2009, Wyeth, Cordis Corporation, and Cordis LLC filed suit against Abbott in the United States District Court for the District of New Jersey alleging the XIENCE V (and later the XIENCE Prime) stent systems\n\n" + ] + } + ], "source": [ "# Sanity Check\n", "import numpy as np\n", @@ -819,7 +1082,10 @@ "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { - "cellMetadata": {}, + "cellMetadata": { + "byteLimit": 2048000, + "rowLimit": 10000 + }, "inputWidgets": {}, "nuid": "298eb990-9160-4e1b-958f-33dd2c11b54b", "showTitle": false, @@ -827,7 +1093,7 @@ } }, "source": [ - "#### Cost Estimation" + "#### Token Estimation" ] }, { @@ -845,7 +1111,16 @@ "title": "" } }, - "outputs": [], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataset has ~985088 tokens that will be charged for during training\nBy default, you'll train for 3 epochs on this dataset\nBy default, ~2955264 tokens will be used in training\n" + ] + } + ], "source": [ "MAX_TOKENS_PER_EXAMPLE = FT_API_args.context_length if FT_API_args.context_length is not None else 4096\n", "TARGET_EPOCHS = FT_API_args.training_duration if FT_API_args.training_duration is not None else 1 \n", @@ -854,7 +1129,7 @@ "n_billing_tokens_in_dataset = len(mds_dataset) * FT_API_args.context_length \n", "print(f\"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training\")\n", "print(f\"By default, you'll train for {n_epochs} epochs on this dataset\")\n", - "print(f\"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens\")" + "print(f\"By default, ~{n_epochs * n_billing_tokens_in_dataset} tokens will be used in training\")" ] }, {