Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
d02782b
fix: multiple GPUs cache error
WDRshadow Jul 13, 2024
eb77efc
fix: mismatch of ParForProgramBlock._numThreads
WDRshadow Jul 13, 2024
6a21cd8
update: mismatch of ParForProgramBlock._numThreads
WDRshadow Jul 14, 2024
e8f79ac
Add Example-ResNet.dml
Jul 15, 2024
99d5c01
update: MultiGPUTest (draft)
WDRshadow Jul 16, 2024
c56144a
Commit MultiGPUTest
Jul 17, 2024
25bde5d
Merge branch 'main' of https://github.com/WDRshadow/systemds
Jul 17, 2024
9ea0f02
Merge branch 'main' into SYSTEMDS-2951-dev
Jul 17, 2024
9ed077c
update: delete unnecessary codes
WDRshadow Jul 17, 2024
9b67303
update: delete unnecessary codes 2
WDRshadow Jul 17, 2024
18ffb44
update: MultiGPUTest
WDRshadow Jul 17, 2024
46cfbc1
update: initialize SingleGPUTest available gpu
WDRshadow Jul 18, 2024
62a0465
update: Tests for multi-GPU completed
WDRshadow Jul 18, 2024
ea9fce8
update: add _numTasks check for test
WDRshadow Jul 18, 2024
0c6a950
update: delete _numThreads check.
WDRshadow Jul 18, 2024
28c7cac
new GPUTest
Jul 19, 2024
a35c8bd
Update: SingleGPUTest
Jul 19, 2024
d6b2d44
update: delete unnecessary codes 3
WDRshadow Jul 19, 2024
729200d
Update: Test batch size from 100K to 500K, print the all the time.
Jul 19, 2024
0f01ff5
update: delete unnecessary codes 4
WDRshadow Jul 20, 2024
cc136d8
update: modified the test instances
WDRshadow Jul 20, 2024
7cf12a6
update: modified the test instances
Jul 19, 2024
f44fda6
Merge remote-tracking branch 'origin/SYSTEMDS-2951-dev-batch' into SY…
WDRshadow Jul 20, 2024
9001604
Update: GPUTest.dml add time counter
Jul 21, 2024
cf08084
Update: write and read the model, train and predict perspectively
KexingLi22 Jul 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -755,6 +755,8 @@ private void executeLocalParFor( ExecutionContext ec, IntObject from, IntObject

//restrict recompilation to thread local memory
setMemoryBudget();

_numThreads = Math.min(_numThreads, ec.getNumGPUContexts());

final LocalTaskQueue<Task> queue = new LocalTaskQueue<>();
final Thread[] threads = new Thread[_numThreads];
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@
import java.io.File;
import java.io.IOException;
import java.lang.ref.SoftReference;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicLong;

import org.apache.commons.lang3.mutable.MutableBoolean;
Expand Down Expand Up @@ -214,7 +214,7 @@ public enum CacheStatus {
//for lazily evaluated RDDs, and (2) as abstraction for environments that do not necessarily have spark libraries available
private RDDObject _rddHandle = null; //RDD handle
private BroadcastObject<T> _bcHandle = null; //Broadcast handle
protected HashMap<GPUContext, GPUObject> _gpuObjects = null; //Per GPUContext object allocated on GPU
protected ConcurrentHashMap<GPUContext, GPUObject> _gpuObjects = null; //Per GPUContext object allocated on GPU

private LineageItem _lineage = null;

Expand All @@ -229,7 +229,7 @@ protected CacheableData(DataType dt, ValueType vt) {
_uniqueID = _seq.getNextID();
_cacheStatus = CacheStatus.EMPTY;
_numReadThreads = 0;
_gpuObjects = DMLScript.USE_ACCELERATOR ? new HashMap<>() : null;
_gpuObjects = DMLScript.USE_ACCELERATOR ? new ConcurrentHashMap<>() : null;
WDRshadow marked this conversation as resolved.
Show resolved Hide resolved
}

/**
Expand Down Expand Up @@ -472,7 +472,7 @@ public synchronized GPUObject getGPUObject(GPUContext gCtx) {

public synchronized void setGPUObject(GPUContext gCtx, GPUObject gObj) {
if( _gpuObjects == null )
_gpuObjects = new HashMap<>();
_gpuObjects = new ConcurrentHashMap<>();
GPUObject old = _gpuObjects.put(gCtx, gObj);
if (old != null)
throw new DMLRuntimeException("GPU : Inconsistent internal state - this CacheableData already has a GPUObject assigned to the current GPUContext (" + gCtx + ")");
Expand Down
Loading