-
Notifications
You must be signed in to change notification settings - Fork 0
Home
We are developing a group of tools aiming to systematically pinpoint and resolve latent software contention in all components of the whole software stack from userspace. In order to pinpoint exact scalability culprits, we need to profile three things: Memory Allocator, Synchronization Events, and System call. MMProf is our memory allocator profiler, SyncPerf is our synchronization events profiler, and Scaler (this project) would be our system call profiler aiming to identify contention issues caused by the underlying OS.
Specifically, Scaler is a system call profiler designed to identify scalability issues in C/C++ programs and attribute them to the user's code. We will then try to extend its functionality to make it also suitable for detecting issues in machine learning programs.
To make scaler work, the following problems should be resolved:
-
Recording events
-
Which type of system call should we record?
-
Which information should we record for each event?
-
How to intercept and record those events?
-
How to store events efficiently?
-
-
Aggregating events
- How to aggregate and attribute contention to multiple components in the system?
-
How to find critical paths among those events?
-
How to extend Scaler to make it suitable for machine learning problems.
Currently, we prioritize resolving task 1.1-1.3 and are doing an investigation for task 4 as well.
For task 1.1-1.3, Steven is building a PLT hook library to intercept function calls. Please check this article for details.
For task 4, Steven is performing a Stackoverflow and Github Issue investigation to collect and categorize performance issues encountered by users. Hopefully, through this investigation, we could come up with some ideas about how we could extend Scaler to make it more suitable for ML programs. Steven is also doing an investigation on existing profilers in research field, please check awesome-profilers repo.