Releases: FedML-AI/FedML
FedML 0.8.9
FEDML Open Source: A Unified and Scalable Machine Learning Library for Running Training and Deployment Anywhere at Any Scale
Backed by FEDML Nexus AI: Next-Gen Cloud Services for LLMs & Generative AI (https://nexus.fedml.ai)
FedML Documentation: https://doc.fedml.ai
FedML Homepage: https://fedml.ai/
FedML Blog: https://blog.fedml.ai/
FedML Medium: https://medium.com/@FedML
FedML Research: https://fedml.ai/research-papers/
Join the Community:
Slack: https://join.slack.com/t/fedml/shared_invite/zt-havwx1ee-a1xfOUrATNfc9DFqU~r34w
Discord: https://discord.gg/9xkW8ae6RV
FEDML® stands for Foundational Ecosystem Design for Machine Learning. FEDML Nexus AI is the next-gen cloud service for LLMs & Generative AI. It helps developers to launch complex model training, deployment, and federated learning anywhere on decentralized GPUs, multi-clouds, edge servers, and smartphones, easily, economically, and securely.
Highly integrated with FEDML open source library, FEDML Nexus AI provides holistic support of three interconnected AI infrastructure layers: user-friendly MLOps, a well-managed scheduler, and high-performance ML libraries for running any AI jobs across GPU Clouds.
A typical workflow is showing in figure above. When developer wants to run a pre-built job in Studio or Job Store, FEDML®Launch swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. When running the job, FEDML®Launch orchestrates the compute plane in different cluster topologies and configuration so that any complex AI jobs are enabled, regardless model training, deployment, or even federated learning. FEDML®Open Source is unified and scalable machine learning library for running these AI jobs anywhere at any scale.
In the MLOps layer of FEDML Nexus AI
- FEDML® Studio embraces the power of Generative AI! Access popular open-source foundational models (e.g., LLMs), fine-tune them seamlessly with your specific data, and deploy them scalably and cost-effectively using the FEDML Launch on GPU marketplace.
- FEDML® Job Store maintains a list of pre-built jobs for training, deployment, and federated learning. Developers are encouraged to run directly with customize datasets or models on cheaper GPUs.
In the scheduler layer of FEDML Nexus AI
- FEDML® Launch swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. It supports a range of compute-intensive jobs for generative AI and LLMs, such as large-scale training, serverless deployments, and vector DB searches. FEDML Launch also facilitates on-prem cluster management and deployment on private or hybrid clouds.
In the Compute layer of FEDML Nexus AI
- FEDML® Deploy is a model serving platform for high scalability and low latency.
- FEDML® Train focuses on distributed training of large and foundational models.
- FEDML® Federate is a federated learning platform backed by the most popular federated learning open-source library and the world’s first FLOps (federated learning Ops), offering on-device training on smartphones and cross-cloud GPU servers.
- FEDML® Open Source is unified and scalable machine learning library for running these AI jobs anywhere at any scale.
Contributing
FedML embraces and thrive through open-source. We welcome all kinds of contributions from the community. Kudos to all of our amazing contributors!
FedML has adopted Contributor Covenant.
FedML 0.8.7
What's Changed
New Features
- [CoreEngine/MLOps] Supported LLM record logging.
- [Serving] Made the inference backend for deepspeed work.
- [CoreEngine/DevOps] Made the public cloud server scheduled into specific nodes.
- [DevOps] Added the fedml light docker image and related documents.
- [DevOps] Built and pushed light docker images and related pipelines.
- [CoreEngine] Added timestamp when reporting system metrics.
- [DevOps] Made the serving k8s cluster work with the latest images and updated related chart files.
- [CoreEngine] Added the skip_log_model_net option for llm training.
- [CoreEngine/CrossSilo] Supported customized hierarchical cross-silo.
- [Serving] Created the default model config and readme file if the user did not provide any model config and readme options when creating a model card.
- [Serving] Allow users to customize their token for end point and inference.
Bug Fixes
- [CoreEngine] Made compatibility when opening subprocess on windows.
- [CoreEngine] Fixed the issue that MPI Mode does not have client rank -1.
- [CoreEngine] Set the python interpreter based on the current running python version.
- [CoreEngine] Fixed the issue that failed to verify the pip ssl certificate when checking OTA versions.
- [CrossDevice] Fixed issues where the test metrics are reported twice to MLOps and loss metrics are clipped to integers on the Beehive platform.
- [App] Fixed issues when installing flamby on the heart-disease app.
- [CoreEngine] Added handler when utf-8 cannot decode the output and error string.
- [App] Fixed scripts and requirements on the FedNLP app.
- [CoreEngine] Fixed issues whereFileExistsError triggered for all os.makedirs.
- [Serving] Changed the model url to open.fedml.ai.
- [Serving] Fixed the issue for OnnxExporterError and added Onnx as default dependent library when installing fedml.
- [Serving] Fixed the issue where the local package name is different from MLOps UI.
Enhancements
- [Serving] Establish container based on user's config and improve code readability.
FedML 0.8.4
What's Changed
New Features in 0.8.4
At FedML, our mission is to remove the friction and pain points of converting your ML & AI models from R&D into production-scale-distributed and federated training & serving via our no-code MLOps platform.
FedML is happy to announce our update 0.8.4. This release is filled with new capabilities, bug fixes, and enhancements. A key announcement is the launch of FedLLM for simplifying & reducing the costs associated with training & serving large language models. You can read more about it on our blog post.
New Features
-
[CoreEngine/MLOps] Launched FedLLM (Federated Large Language Model) for training and serving GitHub Blog
-
[CoreEngine] Deployed Helm Charts to our repository for packaging and ease of deploying on Kubernetes https://github.com/FedML-AI/FedML/blob/master/installation/install_on_k8s/fedml-edge-client-server/fedml-server-deployment-latest.tgz https://github.com/FedML-AI/FedML/blob/master/installation/install_on_k8s/fedml-edge-client-server/fedml-client-deployment-latest.tgz
-
[Documents] Refactored the devops and installation structures (devops for internal pipelines, installation for external users). https://github.com/FedML-AI/FedML/tree/master/installation
-
[DevOps] Deployed a new fedml fedml-light docker image and related documents. DockerHub GitHub doc
-
[DevOps] Built the light docker image to deploy to the k8s cluster, refined k8s related installation sections in the document. https://hub.docker.com/r/fedml/fedml-edge-client-server-light/tags
-
[CoreEngine] Added support for multiple simultaneous training jobs when using our open source MLOPs commands.
-
[CoreEngine] Improved training health monitoring and properly report failed status.
-
[CoreEngine] Added APIs for enabling, disabling and querying client agent status. The APIs are as follows.
curl -XPOST http://localhost:40800/fedml/api/v2/disableAgent -d’{}'
curl -XPOST http://localhost:40800/fedml/api/v2/enableAgent -d’{}'
curl -XPOST http://localhost:40800/fedml/api/v2/queryAgentStatus -d’{}'
Bug Fixes
-
[CoreEngine] Create distinct device ids when running multiple Docker containers to simulate multiple clients or silos on one machine. Now using the product id plus a random id as the device id
-
[CoreEngine] Fixed a device assignment issue in get_torch_device in the distributed training mode.
-
[Serving] Fixed the exceptions that occurred when recovering at startup after upgrading.
-
[CoreEngine] Fixed the device id issue when running in the docker on MacOS.
-
[App] Fixed the issue in the app fedprox + sage graph regression and graph clf.
-
[App] Fixed an issue with the heart disease app failing when running in MLOps.
-
[App] Fixed an issue with the heart disease app’s performance curve
-
[App/Android] Enhanced Android starting/stopping mechanism and fixed the following issues:
Fixed status displays after stopping the run.
When stopping a Run during a round that has not finished, the MNN process will remain in IDLE state (it was previously going OFFLINE).
When stopping after a round is done, the training will now stop
Python server TAG in the logs is not correct. Now you can easily find the server mentioned in logs.
Enhancements
-
[Serving] Tested the inference backend and checked the response after the model deployment is finished.
-
[CoreEngine/Serving] Set the GPU option based on the availability of CUDA when running the inference backend, optimize the mqtt connection checking.
-
[CoreEngine] Stored model caches to the user home directory when running the federated learning.
-
[CoreEngine] Added the device id to the monitor message when processing inference request
-
[CoreEngine] Reported the runner exception and ignored exceptions when missing the bootstrap section in the fedml_config.yaml.
FedML 0.8.3
What's Changed
New Features
- [CoreEngine/MLOps] Introducing the FedML OTA (Over-the-Air) upgrade mechanism for the training platform and serving platform.
- [Documents] Added guidance for the OTA mechanism in the user guide document.
Bug Fixes
- [Serving] Fixed an issue where exceptions occurred when activating the model inference.
- [CoreEngine] Fixed an issue where aggregator exceptions occurred when running MPI scripts.
- [Documents] Fixed broken links in the user guide document.
- [CoreEngine] Checked if the current job is empty in the get_current_job_status api.
- [CoreEngine] Fixed a high CPU usage issue when the reload option was enabled in the client API.
Enhancements
- [Serving] Improved data syncing between Redis server and Sqlite database.
- [Serving] Implemented the use of triple elements (end point name/model name/model version) to identify each inference API request.
- [DevOps] Updated Jenkinsfile to automate the building and deployment of the model serving Docker to the K8s cluster.
- [Serving] Implemented the model monitor stop functionality when deactivating and deleting the model deployment.
- [Serving] Checked the status of the end point when recovering on startup.
- [CoreEngine] Refactored the OTA upgrade process for improved robustness.
- [CoreEngine] Attach logs to the new Run ID when initiating a new run or deploying a model.
- [CoreEngine] Refined upgrade status messages for enhanced clarity.
FedML 0.8.2
What's Changed
New Features
- [CoreEngine/MLOps] Refactor the entire serving platform to make it run more smoothly on the Kubernetes cluster.
Bug Fixes
- [Training] The training status is still running after training.
- [Training] Fixed the issue that the parrot platform can not collect and analyze metrics, events and logs.
- [CoreEngine] Make the device unique in the docker container.
Enhancements
- [CoreEngine/MLOps] Print log does not show on the MLOps distributed logging platform.
- [CoreEngine/MLOps] Use the bootstrap script to upgrade the version of FedML when we don't need to publish the pip package.
FedML 0.8.0
FedML Open and Collaborative AI Platform
Train, deploy, monitor, and improve machine learning models anywhere (edge/cloud) powered by collaboration on combined data, models, and computing resources
What's Changed
Feature Overview
- supporting MLOps (https://open.fedml.ai)
- Multiple scenarios:
- FedML Octopus: Cross-silo Federated Learning
- FedML Beehive: Cross-device Federated Learning
- FedML Parrot: FL Simulation with Single Process or Distributed Computing, smooth migration from research to production
- FedML Spider: Federated Learning on Web Browsers
- Support Any Machine Learning Framework: PyTorch, TensorFlow, JAX with Haiku, and MXNet.
- Diverse communication backends (MPI, gRPC, PyTorch RPC, MQTT + S3)
- Differential Privacy (CDP-central DP; LDP-local DP)
- Attacker (API: fedml.core.FedMLAttacker); README: python/fedml/core/security/readme.md
- Defender (API: fedml.core.FedMLDefender); README: python/fedml/core/security/readme.md
- Secure Aggregation (multi-party computation): cross_silo/light_sec_agg_example
- In FedML/python/app folder, we provide applications in real-world settings.
- Enable federated model inference at MLOps (https://open.fedml.ai)
For more detailed instructions, please refer to https://doc.fedml.ai/
New Features
- [Serving] Make all serving pipelines work: device login, model creation, model packaging, model pushing, model deployment and model monitoring.
- [Serving] Make three entries for creating model cards work: from the trained model list, from the web page for creating model cards, from the related CLI for fedml model.
- [OpenSource] Formally releases all of the previous versions as this v0.8.0 version: training, security, aggregator, communication backends, MQTT optimization, metrics tracing, events tracing, realtime logs.
Bug Fixes
- [CoreEngine] CLI engine error when running simulation.
- [Serving] Adjust the training codes to adapt the ONNX sequence rule.
- [Serving] URL error in the model serving platform.
Enhancements
- [CoreEngine/MLOps][log] Format the log time to NTP time.
- [CoreEngin/MLOps] Shows the progress bar and the size of the transferred data in the log when the client downloads and uploads the model.
- [CoreEngine] Client optimization when the network is weak or disconnected.
The old FedML library before FedML company is incorporated (as of 2022-04-29)
Revert "Revert "update events for different device id."" This reverts commit 2a8706fadfa0337e086ddaf1356aaecd0edc3170.