Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Commit

Permalink
Feature/custom package (#272)
Browse files Browse the repository at this point in the history
* Added custom package script

* Added feature custom download

* Fixed typo

* Fixed directory for installation

* Fixed full folder directory

* Add dependencies and fix pattern

* Fix pattern not found

* Added repo

* Switching to devtools

* Fixing devtools install with directory

* Fix in for merger.R

* Working cluster custom packages

* Removed printed statements

* Working on custom docs

* Custom packages sample docs

* Fixed typo in azure files typo

* Fixed typos based on PR
  • Loading branch information
brnleehng authored May 14, 2018
1 parent d02599d commit 20c86f1
Show file tree
Hide file tree
Showing 12 changed files with 194 additions and 27 deletions.
4 changes: 4 additions & 0 deletions R/cluster.R
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,10 @@ makeCluster <-
"wget https://raw.githubusercontent.com/Azure/doAzureParallel/",
"master/inst/startup/install_bioconductor.R"
),
paste0(
"wget https://raw.githubusercontent.com/Azure/doAzureParallel/",
"master/inst/startup/install_custom.R"
),
"chmod u+x install_bioconductor.R",
installAndStartContainerCommand
)
Expand Down
1 change: 1 addition & 0 deletions R/commandLineUtilities.R
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ dockerRunCommand <-
dockerOptions <-
paste(
dockerOptions,
"-e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR",
"-e AZ_BATCH_TASK_ID=$AZ_BATCH_TASK_ID",
"-e AZ_BATCH_JOB_ID=$AZ_BATCH_JOB_ID",
"-e AZ_BATCH_TASK_WORKING_DIR=$AZ_BATCH_TASK_WORKING_DIR",
Expand Down
71 changes: 48 additions & 23 deletions docs/20-package-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,29 +38,37 @@ You can install packages by specifying the package(s) in your JSON pool configur
}
```

## Installing Packages per-*foreach* Loop

You can also install cran packages by using the **.packages** option in the *foreach* loop. You can also install github/bioconductor packages by using the **github** and **bioconductor" option in the *foreach* loop. Instead of installing packages during pool creation, packages (and its dependencies) can be installed before each iteration in the loop is run on your Azure cluster.

### Installing a Github Package

doAzureParallel supports github package with the **github** option.

Please do not use "https://github.com/" as prefix for the github package name above.

## Installing packages from a private GitHub repository

Clusters can be configured to install packages from a private GitHub repository by setting the __githubAuthenticationToken__ property. If this property is blank only public repositories can be used. If a token is added then public and the private github repo can be used together.
Clusters can be configured to install packages from a private GitHub repository by setting the __githubAuthenticationToken__ property in the credentials file. If this property is blank only public repositories can be used. If a token is added then public and the private github repo can be used together.

When the cluster is created the token is passed in as an environment variable called GITHUB\_PAT on start-up which lasts the life of the cluster and is looked up whenever devtools::install_github is called.

Credentials File for github authentication token
``` json
{
...
"githubAuthenticationToken": "",
...
}

```

Cluster File
```json
{
{
"name": <your pool name>,
"vmSize": <your pool VM size name>,
"maxTasksPerNode": <num tasks to allocate to each node>,
"poolSize": {
"dedicatedNodes": {
"min": 2,
"max": 2
},
"lowPriorityNodes": {
"min": 1,
"max": 10
},
"autoscaleFormula": "QUEUE"
},
...
"rPackages": {
"cran": [],
"github": ["<project/some_private_repository>"],
Expand All @@ -71,10 +79,18 @@ When the cluster is created the token is passed in as an environment variable ca
}
```

_More information regarding github authentication tokens can be found [here](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)_
_More information regarding github authentication tokens can be found [here](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/)

## Installing Packages per-*foreach* Loop
You can also install cran packages by using the **.packages** option in the *foreach* loop. You can also install github/bioconductor packages by using the **github** and **bioconductor" option in the *foreach* loop. Instead of installing packages during pool creation, packages (and its dependencies) can be installed before each iteration in the loop is run on your Azure cluster.
### Installing Multiple Packages
By using character vectors of the packages,

```R
number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations,
.packages=c('package_1', 'package_2'),
github = c('Azure/rAzureBatch', 'Azure/doAzureParallel'),
bioconductor = c('IRanges', 'Biobase')) %dopar% { ... }
```

To install a single cran package:
```R
Expand All @@ -94,7 +110,6 @@ number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, github='azure/rAzureBatch') %dopar% { ... }
```

Please do not use "https://github.com/" as prefix for the github package name above.

To install multiple github packages:
```R
Expand All @@ -114,7 +129,7 @@ number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, bioconductor=c('package_1', 'package_2')) %dopar% { ... }
```

## Installing Packages from BioConductor
## Installing a BioConductor Package
The default deployment of R used in the cluster (see [Customizing the cluster](./30-customize-cluster.md) for more information) includes the Bioconductor installer by default. Simply add packages to the cluster by adding packages in the array.

```json
Expand All @@ -134,17 +149,27 @@ The default deployment of R used in the cluster (see [Customizing the cluster](.
},
"autoscaleFormula": "QUEUE"
},
"containerImage:" "rocker/tidyverse:latest",
"rPackages": {
"cran": [],
"github": [],
"bioconductor": ["IRanges"]
},
"commandLine": []
"commandLine": [],
"subnetId": ""
}
}
```

Note: Container references that are not provided by tidyverse do not support Bioconductor installs. If you choose another container, you must make sure that Biocondunctor is installed.
Note: Container references that are not provided by tidyverse do not support Bioconductor installs. If you choose another container, you must make sure that Bioconductor is installed.

## Installing Custom Packages
doAzureParallel supports custom package installation in the cluster. Custom packages installation on the per-*foreach* loop level is not supported.

For steps on installing custom packages, it can be found [here](../samples/package_management/custom/README.md).

Note: If the package requires a compilation such as apt-get installations, users will be required
to build their own containers.

## Uninstalling packages
## Uninstalling a Package
Uninstalling packages from your pool is not supported. However, you may consider rebuilding your pool.
49 changes: 49 additions & 0 deletions inst/startup/install_custom.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
args <- commandArgs(trailingOnly = TRUE)

sharedPackageDirectory <- file.path(
Sys.getenv("AZ_BATCH_NODE_SHARED_DIR"),
"R",
"packages")

tempDir <- file.path(
Sys.getenv("AZ_BATCH_NODE_STARTUP_DIR"),
"tmp")

.libPaths(c(sharedPackageDirectory, .libPaths()))

pattern <- NULL
if (length(args) > 1) {
if (!is.null(args[2])) {
pattern <- args[2]
}
}

devtoolsPackage <- "devtools"
if (!require(devtoolsPackage, character.only = TRUE)) {
install.packages(devtoolsPackage)
require(devtoolsPackage, character.only = TRUE)
}

packageDirs <- list.files(
path = tempDir,
full.names = TRUE,
recursive = FALSE)

for (i in 1:length(packageDirs)) {
print("Package Directories")
print(packageDirs[i])

devtools::install(packageDirs[i],
args = c(
paste0(
"--library=",
"'",
sharedPackageDirectory,
"'")))

print("Package Directories Completed")
}

unlink(
tempDir,
recursive = TRUE)
4 changes: 3 additions & 1 deletion inst/startup/merger.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ batchJobPreparationDirectory <-
Sys.getenv("AZ_BATCH_JOB_PREP_WORKING_DIR")
batchTaskWorkingDirectory <- Sys.getenv("AZ_BATCH_TASK_WORKING_DIR")
taskPackageDirectory <- paste0(batchTaskWorkingDirectory)
clusterPackageDirectory <- paste0(Sys.getenv("AZ_BATCH_NODE_SHARED_DIR", "/R/packages"))
clusterPackageDirectory <- file.path(Sys.getenv("AZ_BATCH_NODE_SHARED_DIR"),
"R",
"packages")

libPaths <- c(
taskPackageDirectory,
Expand Down
5 changes: 4 additions & 1 deletion samples/azure_files/azure_files_cluster.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,8 @@
},
"commandLine": [
"mkdir /mnt/batch/tasks/shared/data",
"mount -t cifs //<STORAGE_ACCOUNT_NAME>.file.core.windows.net/<FILE_SHARE_NAME> /mnt/batch/tasks/shared/data -o vers=3.0,username=<STORAGE_ACCOUNT_NAME>,password=<STORAGE_ACCOUNT_KEY>==,dir_mode=0777,file_mode=0777,sec=ntlmssp"]
"mount -t cifs //<STORAGE_ACCOUNT_NAME>.file.core.windows.net/<FILE_SHARE_NAME> /mnt/batch/tasks/shared/data -o vers=3.0,username=<STORAGE_ACCOUNT_NAME>,password=<STORAGE_ACCOUNT_KEY>,dir_mode=0777,file_mode=0777,sec=ntlmssp",
"wget https://raw.githubusercontent.com/Azure/doAzureParallel/feature/custom-package/inst/startup/install_custom.R",
"docker run --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR rocker/tidyverse:latest Rscript --no-save --no-environ --no-restore --no-site-file --verbose $AZ_BATCH_NODE_STARTUP_DIR/wd/install_custom.R /mnt/batch/tasks/shared/data"
]
}
2 changes: 1 addition & 1 deletion samples/azure_files/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ This samples shows how to update the cluster configuration to create a new mount

For large data sets or large traffic applications be sure to review the Azure Files [scalability and performance targets](https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#scalability-targets-for-blobs-queues-tables-and-files).

For very large data sets we recommend using Azure Blobs. You can learn more in the [persistent storage](../../docs/23-persistent-storage.md) and [distrubuted data](../../docs/21-distributing-data.md) docs.
For very large data sets we recommend using Azure Blobs. You can learn more in the [persistent storage](../../docs/23-persistent-storage.md) and [distributing data](../../docs/21-distributing-data.md) docs.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#Please see documentation at docs/20-package-management.md for more details on packagement management.
#Please see documentation at docs/20-package-management.md for more details on package management.

# import the doAzureParallel library and its dependencies
library(doAzureParallel)
Expand Down
32 changes: 32 additions & 0 deletions samples/package_management/custom/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
## Installing Custom Packages
doAzureParallel supports custom package installation in the cluster. Custom packages are R packages that cannot be hosted on Github or be built on a docker image. The recommended approach for custom packages is building them from source and uploading them to an Azure File Share.

Note: If the package requires a compilation such as apt-get installations, users will be required
to build their own containers.

### Building Package from Source in RStudio
1. Open *RStudio*
2. Go to *Build* on the navigation bar
3. Go to *Build From Source*

### Uploading Custom Package to Azure Files
For detailed steps on uploading files to Azure Files in the Portal can be found
[here](https://docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-portal)

### Notes
1) In order to build the custom packages' dependencies, we need to untar the R packages and build them within their directories. By default, we will build custom packages in the *$AZ_BATCH_NODE_SHARED_DIR/tmp* directory.
2) By default, the custom package cluster configuration file will install any packages that are a *.tar.gz file in the file share. If users want to specify R packages, they must change this line in the cluster configuration file.

Finds files that end with *.tar.gz in the current Azure File Share directory
``` json
{
...
"commandLine": [
...
"mkdir $AZ_BATCH_NODE_STARTUP_DIR/tmp | for i in `ls $AZ_BATCH_NODE_SHARED_DIR/data/*.tar.gz | awk '{print $NF}'`; do tar -xvf $i -C $AZ_BATCH_NODE_STARTUP_DIR/tmp; done",
...
]
}
```
3) For more information on using Azure Files on Batch, follow our other [sample](./azure_files/readme.md) of using Azure Files
4) Replace your Storage Account name, endpoint and key in the cluster configuration file
24 changes: 24 additions & 0 deletions samples/package_management/custom/custom.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#Please see documentation at docs/20-package-management.md for more details on package management.

# import the doAzureParallel library and its dependencies
library(doAzureParallel)

# set your credentials
doAzureParallel::setCredentials("credentials.json")

# Create your cluster if not exist
cluster <- doAzureParallel::makeCluster("custom_packages_cluster.json")

# register your parallel backend
doAzureParallel::registerDoAzureParallel(cluster)

# check that your workers are up
doAzureParallel::getDoParWorkers()

summary <- foreach(i = 1:1, .packages = c("customR")) %dopar% {
sessionInfo()
# Method from customR
hello()
}

summary
27 changes: 27 additions & 0 deletions samples/package_management/custom/custom_packages_cluster.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"name": "custom-package-pool",
"vmSize": "Standard_D2_v2",
"maxTasksPerNode": 1,
"poolSize": {
"dedicatedNodes": {
"min": 2,
"max": 2
},
"lowPriorityNodes": {
"min": 0,
"max": 0
},
"autoscaleFormula": "QUEUE"
},
"rPackages": {
"cran": [],
"github": [],
"bioconductor": []
},
"commandLine": [
"mkdir /mnt/batch/tasks/shared/data",
"mount -t cifs //<Account Name>.file.core.windows.net/<File Share> /mnt/batch/tasks/shared/data -o vers=3.0,username=<Account Name>,password=<Account Key>,dir_mode=0777,file_mode=0777,sec=ntlmssp",
"mkdir $AZ_BATCH_NODE_STARTUP_DIR/tmp | for i in `ls $AZ_BATCH_NODE_SHARED_DIR/data/*.tar.gz | awk '{print $NF}'`; do tar -xvf $i -C $AZ_BATCH_NODE_STARTUP_DIR/tmp; done",
"docker run --rm -v $AZ_BATCH_NODE_ROOT_DIR:$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_SHARED_DIR=$AZ_BATCH_NODE_SHARED_DIR -e AZ_BATCH_NODE_ROOT_DIR=$AZ_BATCH_NODE_ROOT_DIR -e AZ_BATCH_NODE_STARTUP_DIR=$AZ_BATCH_NODE_STARTUP_DIR rocker/tidyverse:latest Rscript --no-save --no-environ --no-restore --no-site-file --verbose $AZ_BATCH_NODE_STARTUP_DIR/wd/install_custom.R /mnt/batch/tasks/shared/data"
]
}

0 comments on commit 20c86f1

Please sign in to comment.