Instead of calling the mlflow register API directly, we use the intended log_model API, which will both log the model to mlflow run artifacts, and register it to Unity Catalog.

What's Changed

Catch delta table not found error by @milocress in #1625
Add Mlflow 403 PL UserError @dakinggg in #1623
Catches when data prep cluster fails to start by @milocress in #1628
add another cluster connection failure wrapper by @milocress in #1630
Use log_model API to register the model by @nancyhung @dakinggg in #1544

Full Changelog: v0.14.0...v0.14.1

Contributors

nancyhung, milocress, and dakinggg

Assets 2

28 Oct 22:41

irenedea

v0.14.0

8047c85

v0.14.0

New Features

Load Checkpoint Callback (#1570)

We added support for Composer's LoadCheckpoint callback, which loads a checkpoint at a specified event. This enables use cases like loading model base weights with peft.

callbacks:
    load_checkpoint:
        load_path: /path/to/your/weights

Breaking Changes

Accumulate over tokens in a Batch for Training Loss (#1618,#1610,#1595)

We added a new flag accumulate_train_batch_on_tokens which specifies whether training loss is accumulated over the number of tokens in a batch, rather than the number of samples. It is true by default. This will slightly change loss curves for models trained with padding. The old behavior can be recovered by simply setting this to False explicitly.

Default Run Name (#1611)

If no run name is provided, we now will default to using composer's randomly generated run names. (Previously, we defaulted to using "llm" for the run name.)

What's Changed

Update mcli examples to use 0.13.0 by @irenedea in #1594
Pass accumulate_train_batch_on_tokens through to composer by @dakinggg in #1595
Loosen MegaBlocks version pin by @mvpatel2000 in #1597
Add configurability for hf checkpointer register timeout by @dakinggg in #1599
Loosen MegaBlocks to <1.0 by @mvpatel2000 in #1598
Finetuning dataloader validation tweaks by @mvpatel2000 in #1600
Bump onnx from 1.16.2 to 1.17.0 by @dependabot in #1604
Remove TE from dockerfile and instead add as optional dependency by @snarayan21 in #1605
Data prep on multiple GPUs by @eitanturok in #1576
Add env var for configuring the maximum number of processes to use for dataset processing by @irenedea in #1606
Updated error message for cluster check by @nancyhung in #1602
Use fun default composer run names by @irenedea in #1611
Ensure log messages are properly formatted again by @snarayan21 in #1614
Add UC not enabled error for delta to json conversion by @irenedea in #1613
Use a temporary directory for downloading finetuning dataset files by @irenedea in #1608
Bump composer version to 0.26.0 by @irenedea in #1616
Add loss generating token counts by @dakinggg in #1610
Change accumulate_train_batch_on_tokens default to True by @dakinggg in #1618
Bump version to 0.15.0.dev0 by @irenedea in #1621
Add load checkpoint callback by @irenedea in #1570

Full Changelog: v0.13.0...v0.14.0

Contributors

irenedea, nancyhung, and 5 other contributors

Assets 2

18 Oct 16:50

dakinggg

v0.13.1

0354f5f

v0.13.1

🚀 LLM Foundry v0.13.1

What's Changed

Add configurability to HF checkpointer timeout by @dakinggg in #1599

Full Changelog: v0.13.0...v0.13.1

Contributors

dakinggg

Assets 2

15 Oct 06:23

irenedea

v0.13.0

18b0a6d

v0.13.0

🚀 LLM Foundry v0.13.0

🛠️ Bug Fixes & Cleanup

Pytorch 2.4 Checkpointing (#1569, #1581, #1583)

Resolved issues related to checkpointing for Curriculum Learning (CL) callbacks.

🔧 Dependency Updates

Bumped tiktoken from 0.4.0 to 0.8.0 (#1572)
Updated onnxruntime from 1.19.0 to 1.19.2 (#1590)

What's Changed

Update mcli yamls by @dakinggg in #1552
Use allenai/c4 instead of c4 dataset by @eitanturok in #1554
Tensor Parallelism by @eitanturok in #1521
Insufficient Permissions Error when trying to access table by @KuuCi in #1555
Add NoOp optimizer by @snarayan21 in #1560
Deterministic GCRP Errors by @KuuCi in #1559
Simplify CL API by @b-chu in #1510
Reapply #1389 by @dakinggg in #1561
Add dataset swap callback by @b-chu in #1536
Add error to catch more unknown example types by @milocress in #1562
Add FileExtensionNotFoundError by @b-chu in #1564
Add InvalidConversationError by @b-chu in #1565
Release docker img by @KuuCi in #1547
Revert FT dataloader changes from #1561, keep #1564 by @snarayan21 in #1566
Cleanup TP by @eitanturok in #1556
Changes for dataset swap callback by @gupta-abhay in #1569
Do not consider run_name when auto-detecting autoresume by @irenedea in #1571
Allow parameters with requires_grad=False in meta init by @sashaDoubov in #1567
Bump tiktoken from 0.4.0 to 0.8.0 by @dependabot in #1572
Add extensions to FinetuningFileNotFoundError by @b-chu in #1578
Handle long file names in convert text to mds by @irenedea in #1579
Set streaming log level by @mvpatel2000 in #1582
Fix pytorch checkpointing for CL callback by @b-chu in #1581
Fix pytorch checkpointing for CL callback by @b-chu in #1583
Error if filtered dataset contains 0 examples by @irenedea in #1585
Change cluster errors from NetworkError to UserError by @irenedea in #1586
Do not autoresume if a default name is set, only on user defined ones by @irenedea in #1588
Bump onnxruntime from 1.19.0 to 1.19.2 by @dependabot in #1590
Make FinetuningStreamingDataset parameters more flexible by @XiaohanZhangCMU in #1580
Add build callback tests by @irenedea in #1577
Bump version to 0.14.0.dev0 by @irenedea in #1587
Fix typo in eval code by using 'fsdp' instead of 'fsdp_config' by @irenedea in #1593

Full Changelog: v0.12.0...v0.13.0

Contributors

sashaDoubov, gupta-abhay, and 10 other contributors

Assets 2

26 Sep 03:52

dakinggg

v0.12.0

7897fb7

v0.12.0

🚀 LLM Foundry v0.12.0

New Features

PyTorch 2.4 (#1505)

This release updates LLM Foundry to the PyTorch 2.4 release, bringing with it support for the new features and optimizations in PyTorch 2.4

Extensibility improvements (#1450, #1449, #1468, #1467, #1478, #1493, #1495, #1511, #1512, #1527)

Numerous improvements to the extensibility of the modeling and data loading code, enabling easier reuse for subclassing and extending. Please see the linked PRs for more details on each change.

Improved error messages (#1457, #1459, #1519, #1518, #1522, #1534, #1548, #1551)

Various improved error messages, making debugging user errors more clear.

Sliding window in torch attention (#1455)

We've added support for sliding window attention to the reference attention implementation, allowing easier testing and comparison against more optimized attention variants.

Bug fixes

Extra BOS token for llama 3.1 with completion data (#1476)

A bug resulted in an extra BOS token being added between prompt and response during finetuning. This is fixed so that the prompt and response supplied by the user are concatenated without any extra tokens put between them.

What's Changed

Add test for logged_config transforms by @b-chu in #1441
Bump version to 0.12.0.dev0. by @irenedea in #1447
Update pytest-codeblocks requirement from <0.17,>=0.16.1 to >=0.16.1,<0.18 by @dependabot in #1445
Bump coverage[toml] from 7.4.4 to 7.6.1 by @dependabot in #1442
Enabled generalizing build_inner_model in ComposerHFCausalLM by @gupta-abhay in #1450
Update llm foundry version in mcli yamls by @irenedea in #1451
merge to main by @XiaohanZhangCMU in #865
allow embedding resizing passed through by @jdchang1 in #1449
Update packaging requirement from <23,>=21 to >=21,<25 by @dependabot in #1444
Update pytest requirement from <8,>=7.2.1 to >=7.2.1,<9 by @dependabot in #1443
Implement ruff rules enforcing PEP 585 by @snarayan21 in #1453
Adding sliding window attn to scaled_multihead_dot_product_attention by @ShashankMosaicML in #1455
Add user error for UnicodeDeocdeError in convert text to mds by @irenedea in #1457
Fix log_config by @josejg in #1432
Add EnvironmentLogger Callback by @josejg in #1350
Update mosaicml/ci-testing to 0.1.2 by @irenedea in #1458
Correct error message for inference wrapper by @josejg in #1459
Update CI tests to v0.1.2 by @KuuCi in #1466
Bump onnxruntime from 1.18.1 to 1.19.0 by @dependabot in #1461
Update tenacity requirement from <9,>=8.2.3 to >=8.2.3,<10 by @dependabot in #1460
Simple change to enable mapping functions for ft constructor by @gupta-abhay in #1468
use default eval interval from composer by @milocress in #1369
Consistent Naming EnviromentLoggingCallback by @josejg in #1470
Register NaN Monitor Callback by @josejg in #1471
Add train subset num batches by @mvpatel2000 in #1472
Parent class hf models by @jdchang1 in #1467
Remove extra bos for prompt/response data with llama3.1 by @dakinggg in #1476
Add prepare fsdp back by @dakinggg in #1477
Add date_string when applying tokenizer chat template by @snarayan21 in #1474
Make sample tokenization extensible by @gupta-abhay in #1478
Use Streaming version 0.8.1 by @snarayan21 in #1479
Bump hf-transfer from 0.1.3 to 0.1.8 by @dependabot in #1480
fix hf checkpointer by @milocress in #1489
Fix device mismatch when running hf.generate by @ShashankMosaicML in #1486
Bump composer to 0.24.1 + FSDP config device_mesh deprecation by @snarayan21 in #1487
master_weights_dtype not supported by ComposerHFCausalLM.init() by @eldarkurtic in #1485
Detect loss spikes and high losses during training by @joyce-chen-uni in #1473
Enable passing in external position ids by @gupta-abhay in #1493
Align logged attributes for errors and run metadata in kill_loss_spike_callback.py by @joyce-chen-uni in #1494
tokenizer is never built when converting finetuning dataset by @eldarkurtic in #1496
Removing error message for reusing kv cache with torch attn by @ShashankMosaicML in #1497
Fix formatting of loss spike & high loss error messages by @joyce-chen-uni in #1498
Enable cross attention layers by @gupta-abhay in #1495
Update to ci-testing 0.2.0 by @dakinggg in #1500
[WIP] Torch 2.4 in docker images by @snarayan21 in #1491
[WIP] Only torch 2.4.0 compatible by @snarayan21 in #1505
Update mlflow requirement from <2.16,>=2.14.1 to >=2.14.1,<2.17 by @dependabot in #1506
Update ci-testing to 0.2.2 by @dakinggg in #1503
Allow passing key_value_statest for x-attn through MPT Block by @gupta-abhay in #1511
Fix cross attention for blocks by @gupta-abhay in #1512
Put 2.3 image back in release examples by @dakinggg in #1513
Sort callbacks so that CheckpointSaver goes before HuggingFaceCheckpointer by @irenedea in #1515
Raise MisconfiguredDatasetError from original error by @irenedea in #1519
Peft fsdp by @dakinggg in #1520
Raise DatasetTooSmall exception if canonical nodes is less than num samples by @irenedea in #1518
Add permissions check for delta table reading by @irenedea in #1522
Add HuggingFaceCheckpointer option for only registering final checkpoint by @irenedea in #1516
Replace FSDP args by @KuuCi in #1517
enable correct padding_idx for embedding layers by @gupta-abhay in #1527
Revert "Replace FSDP args" by @KuuCi in #1533
Delete unneeded inner base model in PEFT HF Checkpointer by @snarayan21 in #1532
Add deprecation warning to fsdp_config by @KuuCi in #1530
Fix reuse kv cache for torch attention by @ShashankMosaicML in #1539
Error on text dataset file not found by @milocress in #1534
Make ICL tasks not required for eval by @snarayan21 in #1540
Bumping flash attention version to 2.6.3 and adding option for softcap in attention and lm_head logits. by @ShashankMosaicML in #1374
Register mosaic logger by @dakinggg in #1542
Hfcheckpointer optional generation config by @KuuCi in #1543
Bump composer version to 0.25.0 by @dakinggg in #1546
Bump streaming version to 0.9.0 by @dakinggg in #1550
Bump version to 0.13.0.dev0 by @dakinggg in #1549
Add proper user error for accessing schema by @KuuCi in #1548
Validate Cluster Access Mode by @KuuCi in #1551

New Contributors

@jdchang1 made their first contribution in #1449
@joyce-chen-uni made their first contribution in #1473

Full Changelog: v0.11.0...v0.12.0

Contributors

gupta-abhay, eldarkurtic, and 13 other contributors

Assets 2

13 Aug 17:16

irenedea

v0.11.0

d40d016

v0.11.0

🚀 LLM Foundry v0.11.0

New Features

LLM Foundry CLI Commands (#1337, #1345, #1348, #1354)

We've added CLI commands for our commonly used scripts.

For example, instead of calling composer llm-foundry/scripts/train.py parameters.yaml, you can now do composer -c llm-foundry train parameters.yaml.

Docker Images Contain All Optional Dependencies (#1431)

LLM Foundry Docker images now have all optional dependencies.

Support for Llama3 Rope Scaling (#1391)

To use it, you can add the following to your parameters:

model:
    name: mpt_causal_lm
    attn_config:
      rope: true
      ...
      rope_impl: hf
      rope_theta: 500000
      rope_hf_config:
        type: llama3
        ...

Tokenizer Registry (#1386)

We now have a tokenizer registry so you can easily add custom tokenizers.

LoadPlanner and SavePlanner Registries (#1358)

We now have LoadPlanner and SavePlanner registries so you can easily add custom checkpoint loading and saving logic.

Faster Auto-packing (#1435)

The auto packing startup is now much faster. To use auto packing with finetuning datasets, you can add packing_ratio: auto to your config like so:

  train_loader:
    name: finetuning
    dataset:
      ...
      packing_ratio: auto

What's Changed

Extra serverless by @XiaohanZhangCMU in #1320
Fixing sequence_id =-1 bug, adding tests by @ShashankMosaicML in #1324
Registry docs update by @dakinggg in #1323
Add dependabot by @dakinggg in #1322
HUGGING_FACE_HUB_TOKEN -> HF_TOKEN by @dakinggg in #1321
Bump version by @b-chu in #1326
Relax hf hub pin by @dakinggg in #1314
Error if metadata matches existing keys by @dakinggg in #1313
Update transformers requirement from <4.41,>=4.40 to >=4.42.3,<4.43 by @dependabot in #1327
Bump einops from 0.7.0 to 0.8.0 by @dependabot in #1328
Bump onnxruntime from 1.15.1 to 1.18.1 by @dependabot in #1329
Bump onnx from 1.14.0 to 1.16.1 by @dependabot in #1331
Currently multi-gpu generate does not work with hf.generate for hf checkpoints. This PR fixes that. by @ShashankMosaicML in #1332
Fix registry for callbacks with configs by @mvpatel2000 in #1333
Adding a child class of hf's rotary embedding to make hf generate work on multiple gpus. by @ShashankMosaicML in #1334
Add a config arg to just save an hf checkpoint by @dakinggg in #1335
Deepcopy config in callbacks_with_config by @mvpatel2000 in #1336
Avoid HF race condition by @dakinggg in #1338
Nicer error message for undefined symbol by @dakinggg in #1339
Bump sentencepiece from 0.1.97 to 0.2.0 by @dependabot in #1342
Removing logging exception through update run metadata by @jjanezhang in #1292
[MCLOUD-4910] Escape UC names during data prep by @naren-loganathan in #1343
Add CLI for train.py by @KuuCi in #1337
Add fp32 to the set of valid inputs to attention layer by @j316chuck in #1347
Log all extraneous_keys in one go for ease of development by @josejg in #1344
Fix MLFlow Save Model for TE by @j316chuck in #1353
Add flag for saving only composer checkpoint by @irenedea in #1356
Expose flag for should_save_peft_only by @irenedea in #1357
Command utils + train by @KuuCi in #1361
Readd Clear Resolver by @KuuCi in #1365
Add Eval to Foundry CLI by @KuuCi in #1345
Enhanced Logging for convert_delta_to_json and convert_text_to_mds by @vanshcsingh in #1366
Add convert_dataset_hf to CLI by @KuuCi in #1348
Add missing init by @KuuCi in #1368
Make ICL dataloaders build lazily by @josejg in #1359
Add option to unfuse Wqkv by @snarayan21 in #1367
Add convert_dataset_json to CLI by @KuuCi in #1349
Add convert_text_to_mds to CLI by @KuuCi in #1352
Fix hf dataset hang on small dataset by @dakinggg in #1370
Add LoadPlanner and SavePlanner registries by @irenedea in #1358
Load config on rank 0 first by @dakinggg in #1371
Add convert_finetuning_dataset to CLI by @KuuCi in #1354
Allow for transforms on the model before MLFlow registration by @snarayan21 in #1372
Allow flash attention up to 3 by @dakinggg in #1377
Update accelerate requirement from <0.26,>=0.25 to >=0.32.1,<0.33 by @dependabot in #1341
update runners by @KevDevSha in #1360
Allow for multiple workers when autopacking by @b-chu in #1375
Allow train.py-like config for eval.py by @josejg in #1351
Fix load and save planner config logic by @irenedea in #1385
Do dtype conversion in torch hook to save memory by @irenedea in #1384
Get a shared file system safe signal file name by @dakinggg in #1381
Add transformation method to hf_causal_lm by @irenedea in #1383
[kushalkodnad/tokenizer-registry] Introduce new registry for tokenizers by @kushalkodn-db in #1386
Bump transformers version to 4.43.1 by @dakinggg in #1388
Add convert_delta_to_json to CLI by @KuuCi in #1355
Revert "Use utils to get shared fs safe signal file name (#1381)" by @dakinggg in #1389
Avoid race condition in convert text to mds script by @dakinggg in #1390
Refactor loss function for ComposerMPTCausalLM by @irenedea in #1387
Revert "Allow for multiple workers when autopacking (#1375)" by @dakinggg in #1392
Bump transformers to 4.43.2 by @dakinggg in #1393
Support rope scaling by @milocress in #1391
Removing the extra LlamaRotaryEmbedding import by @ShashankMosaicML in #1394
Dtensor oom by @dakinggg in #1395
Condition the meta initialization for hf_causal_lm on pretrain by @irenedea in #1397
Fix license link in readme by @dakinggg in #1398
Enable passing epsilon when building norm layers by @gupta-abhay in #1399
Add pre register method for mlflow by @dakinggg in #1396
add it by @dakinggg in #1400
Remove orig params default by @dakinggg in #1401
Add spin_dataloaders flag by @dakinggg in #1405
Remove curriculum learning error when duration less than saved timestamp by @b-chu in #1406
Set pretrained model name correctly, if provided, in HF Checkpointer by @snarayan21 in #1407
Enable QuickGelu Function for CLIP models by @gupta-abhay in #1408
Bump streaming version to v0.8.0 by @mvpatel2000 in #1411
Kevin/ghcr build by @KevDevSha in #1413
Update accelerate requirement from <0.33,>=0.25 to >=0.25,<0.34 by @dependabot in #1403
Update huggingface-hub requirement from <0.24,>=0.19.0 to >=0.19.0,<0.25 by @dependabot in #1379
Make Pytest log in color in Github Action by @eitanturok in https://github.com/mosaicml/llm-fo...

Contributors

bfontain, gupta-abhay, and 18 other contributors

Assets 2

Releases: mosaicml/llm-foundry

v0.14.5

v0.14.4

Contributors

v0.14.3

What's Changed

Contributors

v0.14.2

Bug Fixes

Move loss generating token counting to the dataloader (#1632)

What's Changed

Contributors

v0.14.1

New Features

Use log_model for registering models (#1544 )

What's Changed

Contributors

v0.14.0

New Features

Load Checkpoint Callback (#1570)

Breaking Changes

Accumulate over tokens in a Batch for Training Loss (#1618,#1610,#1595)

Default Run Name (#1611)

What's Changed

Contributors

v0.13.1

🚀 LLM Foundry v0.13.1

What's Changed

Contributors

v0.13.0

🚀 LLM Foundry v0.13.0

🛠️ Bug Fixes & Cleanup

Pytorch 2.4 Checkpointing (#1569, #1581, #1583)

🔧 Dependency Updates

What's Changed

Contributors

v0.12.0

🚀 LLM Foundry v0.12.0

New Features

PyTorch 2.4 (#1505)

Extensibility improvements (#1450, #1449, #1468, #1467, #1478, #1493, #1495, #1511, #1512, #1527)

Improved error messages (#1457, #1459, #1519, #1518, #1522, #1534, #1548, #1551)

Sliding window in torch attention (#1455)

Bug fixes

Extra BOS token for llama 3.1 with completion data (#1476)

What's Changed

New Contributors

Contributors

v0.11.0

🚀 LLM Foundry v0.11.0

New Features

LLM Foundry CLI Commands (#1337, #1345, #1348, #1354)

Docker Images Contain All Optional Dependencies (#1431)

Support for Llama3 Rope Scaling (#1391)

Tokenizer Registry (#1386)

LoadPlanner and SavePlanner Registries (#1358)

Faster Auto-packing (#1435)

What's Changed

Contributors