Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User/ansgoel/pool scale 1 #1

Merged
merged 57 commits into from
Oct 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
9d729ed
newer onefuzz type
anshuman-goel Sep 29, 2020
cdc78f8
update to types
anshuman-goel Sep 30, 2020
3794fd9
scaling
anshuman-goel Sep 30, 2020
76cf3d8
optional region
anshuman-goel Sep 30, 2020
f1f49eb
region as optional
anshuman-goel Sep 30, 2020
9ab4465
pool as mandatory
anshuman-goel Oct 1, 2020
8830103
cap case
anshuman-goel Oct 1, 2020
9a22460
newer changes
anshuman-goel Sep 30, 2020
70fafaf
idk
anshuman-goel Sep 30, 2020
2587ad4
af errors
anshuman-goel Oct 1, 2020
61ddb0c
pool debugging
anshuman-goel Oct 1, 2020
fa884b9
region cap case
anshuman-goel Oct 1, 2020
1c4a4d9
task state
anshuman-goel Oct 1, 2020
b550c58
Update `can_schedule` check to support node reimaging (#35)
ranweiler Sep 29, 2020
b72bf8c
Refactor internal node event schemas (#29)
ranweiler Sep 29, 2020
b35db6c
use sccache more consistently (#47)
bmc-msft Sep 29, 2020
25a43ce
add end-to-end integration testing of fuzzing pipelines (#46)
bmc-msft Sep 29, 2020
5c592a7
re-add black to lint stages (#45)
bmc-msft Sep 29, 2020
50a1fd0
fix formatting (#55)
bmc-msft Sep 29, 2020
03979ac
Example sdk in azure functions (#56)
bmc-msft Sep 29, 2020
5282a66
Adding node assignment to the task entity (#54)
chkeita Sep 29, 2020
71dd2f4
Link VMSS nodes and tasks when setting up (#43)
ranweiler Sep 30, 2020
ea21e45
set more detailed version information during builds (#58)
bmc-msft Sep 30, 2020
e188e97
Using a clean flag (#59)
anshuman-goel Oct 1, 2020
bf01dbd
Remove use of `batch` in NodeMessages (#60)
bmc-msft Oct 1, 2020
dcbfe9b
make version.localchanges match API logic (#62)
bmc-msft Oct 1, 2020
b34b42e
Set log levels in Azure Functions by hand for 3rd party libraries (#63)
bmc-msft Oct 1, 2020
8b9bc2b
use sc.exe instead of Set-Content (#67)
bmc-msft Oct 1, 2020
8f31f7d
move to warning (#66)
bmc-msft Oct 1, 2020
445bdc8
slim down msg (#65)
bmc-msft Oct 1, 2020
8278d5b
only set stating to stopping (#64)
bmc-msft Oct 1, 2020
ede6599
setting a default
anshuman-goel Oct 1, 2020
11cfe92
reversing things
anshuman-goel Oct 1, 2020
660014d
agent comment
anshuman-goel Oct 1, 2020
014cb91
debugging
anshuman-goel Oct 1, 2020
4b0959a
scaling down
anshuman-goel Oct 2, 2020
17ffc9e
fixing resizes
anshuman-goel Oct 2, 2020
68620b7
Merge branch 'main' into user/ansgoel/pool-scale-1
anshuman-goel Oct 5, 2020
383ae1c
Build 1.1.0 (#99)
bmc-msft Oct 5, 2020
29458c5
Update CURRENT_VERSION (#104)
bmc-msft Oct 5, 2020
d0d537b
newer changes
anshuman-goel Oct 5, 2020
f560514
api fix
anshuman-goel Oct 5, 2020
a1bc7a7
removing onefuzztypes
anshuman-goel Oct 5, 2020
d6b8706
Merge branch 'main' into user/ansgoel/pool-scale-1
anshuman-goel Oct 5, 2020
d280791
linter errors
anshuman-goel Oct 5, 2020
ca246da
syntax error
anshuman-goel Oct 5, 2020
33d0abd
sorting imports
anshuman-goel Oct 5, 2020
7b212a5
linter
anshuman-goel Oct 5, 2020
2cb5749
linter
anshuman-goel Oct 5, 2020
0b6f432
linting fixes
anshuman-goel Oct 5, 2020
7a58871
linting fixes
anshuman-goel Oct 5, 2020
075a4b4
import sort
anshuman-goel Oct 5, 2020
e92402c
linting fixes
anshuman-goel Oct 5, 2020
5582b3e
linting fixes
anshuman-goel Oct 5, 2020
bdc9970
fixes
anshuman-goel Oct 6, 2020
95e0e7e
removing non needed changes
anshuman-goel Oct 6, 2020
e3b7273
model remove non needed changes
anshuman-goel Oct 6, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,45 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## 1.1.0
### Added
* Agent/Service: Added the ability to automatically re-image nodes that are out-of-date [#35](https://github.com/microsoft/onefuzz/pull/35)
* Deployment: Added data-migration scripts for pre-release installs [#12](https://github.com/microsoft/onefuzz/pull/12)
* SDK/CLI: Added more `onefuzz debug` sub-commands to support debugging tasks [#95](https://github.com/microsoft/onefuzz/pull/95)
* Agent: Added machine_id and version to log messages [#94](https://github.com/microsoft/onefuzz/pull/94)
* Service: Errors in creating Azure Devops work items from reports now mark the task as failed [#77](https://github.com/microsoft/onefuzz/pull/77)
* Service: The nodes executing a task are now included when fetching details for a task (such as `onefuzz tasks get $TASKID`) [#54](https://github.com/microsoft/onefuzz/pull/54)
* SDK: Added example [Azure Functions](https://azure.microsoft.com/en-us/services/functions/) that uses the SDK [#56](https://github.com/microsoft/onefuzz/pull/56)
* SDK/CLI: Added the ability to execute debugger commands automatically during `repro` [#39](https://github.com/microsoft/onefuzz/pull/39)
* CLI: Added documentation of CLI sub-command arguments (used to describe `afl_container` in AFL templates [#10](https://github.com/microsoft/onefuzz/pull/10)
* Agent: Added `ONEFUZZ_TARGET_SETUP_PATH` environment variable that indicates the path to the task specific setup container on the fuzzing nodes [#15](https://github.com/microsoft/onefuzz/pull/15)
* CICD: Use [sccache](https://github.com/mozilla/sccache) to speed up build times [#47](https://github.com/microsoft/onefuzz/pull/47)
* SDK: Added end-to-end [integration test script](src/cli/examples/integration-test.py) to verify full fuzzing pipelines [#46](https://github.com/microsoft/onefuzz/pull/46)
* Documentation: Added definitions for [pool](docs/terminology.md#pool), [node](docs/terminology.md#node), and [scaleset](docs/terminology.md#scaleset) [#17](https://github.com/microsoft/onefuzz/pull/17)

### Changed
* Agent/Service: Refactored state management for on-vm supervisors [#96](https://github.com/microsoft/onefuzz/pull/96)
* Agent: Added 'done' semaphore to the agent to prevent agent from fetching additional work once the node should be reset. [#86](https://github.com/microsoft/onefuzz/pull/86)
* Agent: Nodes now sleep longer between checking for new work. [#78](https://github.com/microsoft/onefuzz/pull/78)
* Agent: The task execution clock is now started once the task is in the 'setting up' state [#82](https://github.com/microsoft/onefuzz/pull/82)
* Service: Drastically reduced logs sent to App Insights from third-party libraries [#63](https://github.com/microsoft/onefuzz/pull/63)
* Agent/Service: Added the ability to upgrade out-of-date VMs upon requesting new tasking [#35](https://github.com/microsoft/onefuzz/pull/35)
* CICD: Non-release builds now include the GIT hash in the versions and `localchanges` if built locally with uncommited code. [#58](https://github.com/microsoft/onefuzz/pull/58)
* Agent: [Command replacements](docs/command-replacements.md) now use absolute rather than relative paths. [#22](https://github.com/microsoft/onefuzz/pull/22)

### Fixed
* CLI: Fixed issue using `onefuzz template stop` which would improperly stop jobs that had the same 'name' but different 'project' values. [#97](https://github.com/microsoft/onefuzz/pull/97)
* Agent: Fixed input marker expansion (used in AFL templates related to handling `@@`). [#87](https://github.com/microsoft/onefuzz/pull/97)
* Service: Errors generated after the task shutdown has started are ignored. [#83](https://github.com/microsoft/onefuzz/pull/83)
* Agent: Instance specific tools now download and run on windows nodes as expected [#81](https://github.com/microsoft/onefuzz/pull/81)
* CLI: Using `--wait_for_running` in `onefuzz template` jobs now properly waits for tasks to launch before exiting [#84](https://github.com/microsoft/onefuzz/pull/84)
* Service: Handled more Azure Devops notification errors [#80](https://github.com/microsoft/onefuzz/pull/80)
* Agent: WSearch service is now properly disabled by default on Windows VMs [#67](https://github.com/microsoft/onefuzz/pull/67)
* Service: Properly deletes `repro` VMs [#36](https://github.com/microsoft/onefuzz/pull/36)
* Agent: Supervisor now flushes logs to appinsights upon exit [#21](https://github.com/microsoft/onefuzz/pull/21)
* Agent: Task specific setup script failures now properly get recorded as a failed task and trigger the node to be re-imaged [#24](https://github.com/microsoft/onefuzz/pull/24)


## 1.0.0
### Added
* Initial public release
2 changes: 1 addition & 1 deletion CURRENT_VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.0
1.1.0
5 changes: 4 additions & 1 deletion src/api-service/__app__/.gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
.direnv
.direnv
.python_packages
__pycache__
.venv
1 change: 1 addition & 0 deletions src/api-service/__app__/agent_commands/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

def get(req: func.HttpRequest) -> func.HttpResponse:
request = parse_request(NodeCommandGet, req)

if isinstance(request, Error):
return not_ok(request, context="NodeCommandGet")

Expand Down
3 changes: 3 additions & 0 deletions src/api-service/__app__/agent_events/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ def on_state_update(
state=NodeTaskState.setting_up,
)
node_task.save()

elif state == NodeState.done:
# if tasks are running on the node when it reports as Done
# those are stopped early
Expand All @@ -125,6 +126,8 @@ def on_state_update(
machine_id,
done_data,
)
else:
logging.info("No change in Node state")
else:
logging.info("ignoring state updates from the node: %s: %s", machine_id, state)

Expand Down
2 changes: 2 additions & 0 deletions src/api-service/__app__/agent_registration/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

import logging
from uuid import UUID

import azure.functions as func
Expand Down Expand Up @@ -76,6 +77,7 @@ def get(req: func.HttpRequest) -> func.HttpResponse:

def post(req: func.HttpRequest) -> func.HttpResponse:
registration_request = parse_uri(AgentRegistrationPost, req)
logging.info(f"request: {registration_request}")
if isinstance(registration_request, Error):
return not_ok(registration_request, context="agent registration")

Expand Down
18 changes: 16 additions & 2 deletions src/api-service/__app__/onefuzzlib/pools.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,11 @@ def create(
arch: Architecture,
managed: bool,
client_id: Optional[UUID],
max_size: int, # scaleset max size
vm_sku: str,
image: str,
spot_instances: bool,
region: Region,
) -> "Pool":
return cls(
name=name,
Expand All @@ -335,6 +340,11 @@ def create(
managed=managed,
client_id=client_id,
config=None,
max_size=max_size,
vm_sku=vm_sku,
image=image,
spot_instances=spot_instances,
region=region,
)

def save_exclude(self) -> Optional[MappingIntStrAny]:
Expand Down Expand Up @@ -854,14 +864,18 @@ def halt(self) -> None:
self.state = ScalesetState.halt
self.delete()

def max_size(self) -> int:
@classmethod
def scaleset_max_size(cls, image: str) -> int:
# https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/
# virtual-machine-scale-sets-placement-groups#checklist-for-using-large-scale-sets
if self.image.startswith("/"):
if image.startswith("/"):
return 600
else:
return 1000

def max_size(self) -> int:
return Scaleset.scaleset_max_size(self.image)

@classmethod
def search_states(
cls, *, states: Optional[List[ScalesetState]] = None
Expand Down
21 changes: 21 additions & 0 deletions src/api-service/__app__/onefuzzlib/tasks/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,27 @@ def get_by_task_id(cls, task_id: UUID) -> Union[Error, "Task"]:
task = tasks[0]
return task

@classmethod
def get_tasks_by_pool_name(
cls, pool_name: str
) -> Optional[Union[Error, List["Task"]]]:
tasks = cls.search()
if not tasks:
return Error(code=ErrorCode.INVALID_REQUEST, errors=["unable to find task"])

pool_tasks = []

for task in tasks:
if not task.config.pool:
continue
if pool_name == task.config.pool.pool_name and task.state not in [
TaskState.stopped,
TaskState.stopping,
]:
pool_tasks.append(task)

return pool_tasks

def mark_stopping(self) -> None:
if self.state not in [TaskState.stopped, TaskState.stopping]:
self.state = TaskState.stopping
Expand Down
4 changes: 2 additions & 2 deletions src/api-service/__app__/onefuzzlib/versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
def read_local_file(filename: str) -> str:
path = os.path.join(os.path.dirname(os.path.realpath(__file__)), filename)
if os.path.exists(path):
with open(path, "r") as handle:
return handle.read().strip()
with open(path, "rb") as handle:
return handle.read().strip().decode("utf-16")
else:
return "UNKNOWN"

Expand Down
21 changes: 20 additions & 1 deletion src/api-service/__app__/pool/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

import logging
import os

import azure.functions as func
from onefuzztypes.enums import ErrorCode, PoolState
from onefuzztypes.models import AgentConfig, Error
from onefuzztypes.requests import PoolCreate, PoolSearch, PoolStop

from ..onefuzzlib.azure.creds import get_instance_name
from ..onefuzzlib.azure.creds import get_base_region, get_instance_name, get_regions
from ..onefuzzlib.pools import Pool
from ..onefuzzlib.request import not_ok, ok, parse_request

Expand Down Expand Up @@ -65,12 +66,30 @@ def post(req: func.HttpRequest) -> func.HttpResponse:
context=repr(request),
)

logging.info(request)

if request.region is None:
region = get_base_region()
else:
if request.region not in get_regions():
return not_ok(
Error(code=ErrorCode.UNABLE_TO_CREATE, errors=["invalid region"]),
context="poolcreate",
)

region = request.region

pool = Pool.create(
name=request.name,
os=request.os,
arch=request.arch,
managed=request.managed,
client_id=request.client_id,
max_size=request.max_size,
vm_sku=request.vm_sku,
image=request.image,
spot_instances=request.spot_instances,
region=region,
)
pool.save()
return ok(set_config(pool))
Expand Down
123 changes: 123 additions & 0 deletions src/api-service/__app__/pool_resize/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
#!/usr/bin/env python
#
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

import logging
import math
from typing import List

import azure.functions as func
from onefuzztypes.enums import NodeState, PoolState, ScalesetState
from onefuzztypes.models import Error

from ..onefuzzlib.pools import Node, Pool, Scaleset
from ..onefuzzlib.tasks.main import Task


def scale_up(pool: Pool, scalesets: List[Scaleset], nodes_needed: int) -> None:
logging.info(f"Nodes needed: {nodes_needed}")

for scaleset in scalesets:
if scaleset.state == ScalesetState.running:

max_size = min(scaleset.max_size(), pool.max_size)
logging.info(f"Scaleset size: {scaleset.size}, max_size: {max_size}")
if scaleset.size < max_size:
current_size = scaleset.size
if nodes_needed <= max_size - current_size:
scaleset.size = current_size + nodes_needed
nodes_needed = 0
else:
scaleset.size = max_size
nodes_needed = nodes_needed - (max_size - current_size)
scaleset.state = ScalesetState.resize
scaleset.save()

else:
continue

if nodes_needed == 0:
return

for _ in range(
math.ceil(
nodes_needed / min(Scaleset.scaleset_max_size(pool.image), pool.max_size)
)
):
logging.info(f"Creating Scaleset for Pool {pool.name}")
max_nodes_scaleset = min(
Scaleset.scaleset_max_size(pool.image), pool.max_size, nodes_needed
)
scaleset = Scaleset.create(
pool_name=pool.name,
vm_sku=pool.vm_sku,
image=pool.image,
region=pool.region,
size=max_nodes_scaleset,
spot_instances=pool.spot_instances,
tags={"pool": pool.name},
)
scaleset.save()
# don't return auths during create, only 'get' with include_auth
scaleset.auth = None
nodes_needed -= max_nodes_scaleset


def scale_down(scalesets: List[Scaleset], nodes_to_remove: int) -> None:
for scaleset in scalesets:
nodes = Node.search_states(
scaleset_id=scaleset.scaleset_id, states=[NodeState.free]
)
if nodes and nodes_to_remove > 0:
max_nodes_remove = min(len(nodes), nodes_to_remove)
if max_nodes_remove >= scaleset.size and len(nodes) == scaleset.size:
scaleset.state = ScalesetState.halt
nodes_to_remove = nodes_to_remove - scaleset.size
scaleset.save()
continue

scaleset.size = scaleset.size - max_nodes_remove
nodes_to_remove = nodes_to_remove - max_nodes_remove
scaleset.state = ScalesetState.resize
scaleset.save()


def get_vm_count(tasks: List[Task]) -> int:
count = 0
for task in tasks:
if not task.config.pool:
continue
count += task.config.pool.count
return count


def main(mytimer: func.TimerRequest) -> None: # noqa: F841
pools = Pool.search_states(states=[PoolState.init, PoolState.running])
for pool in pools:
tasks = Task.get_tasks_by_pool_name(pool.name)
num_of_tasks = 0
# get all the tasks (count not stopped) for the pool
if not tasks or isinstance(tasks, Error):
continue

num_of_tasks = get_vm_count(tasks)
logging.info(f"#Tasks: {num_of_tasks}")
# do scaleset logic match with pool
# get all the scalesets for the pool
scalesets = Scaleset.search_by_pool(pool.name)
pool_resize = False
for scaleset in scalesets:
if scaleset.state in ScalesetState.is_resizing():
pool_resize = True
break
num_of_tasks = num_of_tasks - scaleset.size

if pool_resize:
continue

if num_of_tasks > 0:
# resizing scaleset or creating new scaleset.
scale_up(pool, scalesets, num_of_tasks)
elif num_of_tasks < 0:
scale_down(scalesets, abs(num_of_tasks))
11 changes: 11 additions & 0 deletions src/api-service/__app__/pool_resize/function.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"scriptFile": "__init__.py",
"bindings": [
{
"name": "mytimer",
"type": "timerTrigger",
"direction": "in",
"schedule": "00:01:00"
}
]
}
Binary file modified src/api-service/__app__/requirements.txt
Binary file not shown.
Loading