Skip to content
This repository has been archived by the owner on May 15, 2019. It is now read-only.

Spot naming #73

Open
wants to merge 7 commits into
base: tmp_spot
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions INSTALL.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# oni-ml
# spot-ml

Machine learning routines for OpenNetworkInsight, version 1.1
Machine learning routines for Apache Spot (incubating).

## Prerequisites, Installation and Configuration

Install and configure oni-ml as a part of the Open-Network-Insight project, per the instruction at
[the Open-Network-Insight wiki](https://github.com/Open-Network-Insight/open-network-insight/wiki).
Install and configure spot-ml as a part of the Spot project, per the instruction at
[the Spot wiki].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for now you are just keeping links placeholders rather than let them point to the current repo structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah
will add new repo structure when it all comes on line


The oni-ml routines must be built into a jar stored at `target/scala-2.10/oni-ml-assembly-1.1.jar` on the master node. This requires Scala 2.10 or later to be installed on the system building the jar. To build the jar, from the top-level of the oni-ml repo, execute the command `sbt assembly`.
The spot-ml routines must be built into a jar stored at `target/scala-2.10/spot-ml-assembly-1.1.jar` on the master node. This requires Scala 2.10 or later to be installed on the system building the jar. To build the jar, from the top-level of the spot-ml repo, execute the command `sbt assembly`.

Names and language that we will use from the configuration variables for Open-Network-Insight (that are set in the file [duxbay.conf](https://github.com/Open-Network-Insight/oni-setup/blob/dev/duxbay.conf))
Names and language that we will use from the configuration variables for Spot (that are set in the file [spot.conf])

- MLNODE The node from which the oni-ml routines are invoked
- MLNODE The node from which the spot-ml routines are invoked
- NODES The list of MPI worker nodes that execute the topic modelling analysis
- HUSER An HDFS user path that will be the base path for the solution; this is usually the same user that you created to run the solution
- LPATH The local path for the ML intermediate and final results, dynamically created and populated when the pipeline runs
Expand All @@ -22,7 +22,7 @@ Names and language that we will use from the configuration variables for Open-Ne

### Prepare data for input

Load data for consumption by oni-ml by running [oni-ingest](https://github.com/Open-Network-Insight/oni-ingest/tree/dev).
Load data for consumption by spot-ml by running [spot-ingest].


## Run a suspicious connects analysis
Expand Down Expand Up @@ -52,15 +52,15 @@ As the maximum probability of an event is 1, a threshold of 1 can be used to sel

## Licensing

oni-ml is licensed under Apache Version 2.0
spot-ml is licensed under Apache Version 2.0

## Contributing

Create a pull request and contact the maintainers.

## Issues

Report issues at the [OpenNetworkInsight issues page](https://github.com/Open-Network-Insight/open-network-insight/issues).
Report issues at the [Spot issues page].

## Maintainers

Expand Down
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,27 @@
# oni-ml
# spot-ml

Machine learning routines for OpenNetworkInsight, version 1.1
Machine learning routines for Apache Spot (incubating).

At present, oni-ml contains routines for performing *suspicious connections* analyses on netflow, DNS or proxy data gathered from a network. These
At present, spot-ml contains routines for performing *suspicious connections* analyses on netflow, DNS or proxy data gathered from a network. These
analyses consume a (possibly very lage) collection of network events and produces a list of the events that considered to be the least probable (or most suspicious).

oni-ml is designed to be run as a component of Open-Network-Insight. It relies on the ingest component of Open-Network-Insight to collect and load
netflow and DNS records, and oni-ml will try to load data to the operational analytics component of Open-Network-Insight. It is strongly suggested that when experimenting with oni-ml, you do so as a part of the unified Open-Network-Insight system: Please see [the Open-Network-Insight wiki](https://github.com/Open-Network-Insight/open-network-insight/wiki)
spot-ml is designed to be run as a component of Spot. It relies on the ingest component of Spot to collect and load
netflow and DNS records, and spot-ml will try to load data to the operational analytics component of Spot. It is suggested that when experimenting with spot-ml, you do so as a part of the unified Spot system: Please see [the Spot wiki]

The remaining instructions in this README file treat oni-ml in a stand-alone fashion that might be helpful for customizing and troubleshooting the
The remaining instructions in this README file treat spot-ml in a stand-alone fashion that might be helpful for customizing and troubleshooting the
component.

## Prepare data for input

Load data for consumption by oni-ml by running [oni-ingest](https://github.com/Open-Network-Insight/oni-ingest/tree/dev).
Load data for consumption by spot-ml by running [spot-ingest].

The data format and location where the data is stored differs for netflow and DNS analyses.

**Netflow Data**

Netflow data for the year YEAR, month MONTH, and day DAY is stored in HDFS at `HUSER/flow/csv/y=YEAR/m=MONTH/d=DAY/*`

Data for oni-ml netflow analyses is currently stored in text csv files using the following schema:
Data for spot-ml netflow analyses is currently stored in text csv files using the following schema:

- time: String
- year: Double
Expand Down Expand Up @@ -55,7 +55,7 @@ Data for oni-ml netflow analyses is currently stored in text csv files using the

DNS data for the year YEAR, month MONTH and day DAY is stored in Hive at `HUSER/dns/hive/y=YEAR/m=MONTH/d=DAY/`

The Hive tables containing DNS data for oni-ml analyses have the following schema:
The Hive tables containing DNS data for spot-ml analyses have the following schema:

- frame_time: STRING
- unix_tstamp: BIGINT
Expand Down Expand Up @@ -124,12 +124,12 @@ As the maximum probability of an event is 1, a threshold of 1 can be used to sel
```


### oni-ml output
### spot-ml output

Final results are stored in the following file on HDFS.

Depending on which data source is analyzed,
oni-ml output will be found under the ``HPATH`` at one of
spot-ml output will be found under the ``HPATH`` at one of

$HPATH/dns/scored_results/YYYYMMDD/scores/dns_results.csv
$HPATH/proxy/scored_results/YYYYMMDD/scores/results.csv
Expand All @@ -138,7 +138,7 @@ oni-ml output will be found under the ``HPATH`` at one of

It is a csv file in which network events annotated with estimated probabilities and sorted in ascending order.

A successful run of oni-ml will also create and populate a directory at `LPATH/<source>/YYYYMMDD` where `<source>` is one of flow, dns or proxy, and
A successful run of spot-ml will also create and populate a directory at `LPATH/<source>/YYYYMMDD` where `<source>` is one of flow, dns or proxy, and
`YYYYMMDD` is the date argument provided to `ml_ops.sh`
This directory will contain the following files generated during the LDA procedure used for topic-modelling:

Expand All @@ -152,15 +152,15 @@ In addition, on each worker node identified in NODES, in the `LPATH/<source>/YYY

## Licensing

oni-ml is licensed under Apache Version 2.0
spot-ml is licensed under Apache Version 2.0

## Contributing

Create a pull request and contact the maintainers.

## Issues

Report issues at the [OpenNetworkInsight issues page](https://github.com/Open-Network-Insight/open-network-insight/issues).
Report issues at the [Spot issues page].

## Maintainers

Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name := "oni-ml"
name := "spot-ml"

version := "1.1"

Expand Down
7 changes: 3 additions & 4 deletions install_ml.sh
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
#!/bin/bash

# lda stage
source /etc/duxbay.conf
source /etc/spot.conf

# copy solution files to all nodes
for d in "${NODES[@]}"
do
rsync -v -a --include='target' --include='target/scala-2.10' --include='target/scala-2.10/oni-ml-assembly-1.1.jar' \
--include='oni-lda-c' --include='oni-lda-c/*' --include 'top-1m.csv' --include='*.sh' \
rsync -v -a --include='target' --include='target/scala-2.10' --include='target/scala-2.10/spot-ml-assembly-1.1.jar' \
--include='spot-lda-c' --include='spot-lda-c/*' --include 'top-1m.csv' --include='*.sh' \
--exclude='*' . $d:${LUSER}/ml
done

8 changes: 4 additions & 4 deletions ml_ops.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ fi
# read in variables (except for date) from etc/.conf file
# note: FDATE and DSOURCE *must* be defined prior sourcing this conf file

source /etc/duxbay.conf
source /etc/spot.conf

# third argument if present will override default TOL from conf file

Expand Down Expand Up @@ -78,13 +78,13 @@ rm -f ${LPATH}/*.{dat,beta,gamma,other,pkl} # protect the flow_scores.csv file
hdfs dfs -rm -R -f ${HDFS_SCORED_CONNECTS}

# Add -p <command> to execute pre MPI command.
# Pre MPI command can be configured in /etc/duxbay.conf
# Pre MPI command can be configured in /etc/spot.conf
# In this script, after the line after --mpicmd ${MPI_CMD} add:
# --mpiprep ${MPI_PREP_CMD}

${MPI_PREP_CMD}

time spark-submit --class "org.opennetworkinsight.SuspiciousConnects" \
time spark-submit --class "org.apache.spot.SuspiciousConnects" \
--master yarn-client \
--driver-memory ${SPK_DRIVER_MEM} \
--conf spark.driver.maxResultSize=${SPK_DRIVER_MAX_RESULTS} \
Expand All @@ -101,7 +101,7 @@ time spark-submit --class "org.opennetworkinsight.SuspiciousConnects" \
--conf spark.shuffle.service.enabled=true \
--conf spark.yarn.am.waitTime=1000000 \
--conf spark.yarn.driver.memoryOverhead=${SPK_DRIVER_MEM_OVERHEAD} \
--conf spark.yarn.executor.memoryOverhead=${SPAK_EXEC_MEM_OVERHEAD} target/scala-2.10/oni-ml-assembly-1.1.jar \
--conf spark.yarn.executor.memoryOverhead=${SPAK_EXEC_MEM_OVERHEAD} target/scala-2.10/spot-ml-assembly-1.1.jar \
--analysis ${DSOURCE} \
--input ${RAWDATA_PATH} \
--dupfactor ${DUPFACTOR} \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package org.opennetworkinsight
package org.apache.spot

import org.apache.spark.rdd.RDD
import java.io.PrintWriter
Expand All @@ -15,14 +15,14 @@ import scala.sys.process._
* 4. Calculates and returns probability of word given topic: p(w|z)
*/

object OniLDACWrapper {
object SpotLDACWrapper {

case class OniLDACInput(doc: String, word: String, count: Int) extends Serializable
case class SpotLDACInput(doc: String, word: String, count: Int) extends Serializable

case class OniLDACOutput(docToTopicMix: Map[String, Array[Double]], wordResults: Map[String, Array[Double]])
case class SpotLDACOutput(docToTopicMix: Map[String, Array[Double]], wordResults: Map[String, Array[Double]])


def runLDA(docWordCount: RDD[OniLDACInput],
def runLDA(docWordCount: RDD[SpotLDACInput],
modelFile: String,
topicDocumentFile: String,
topicWordFile: String,
Expand All @@ -35,19 +35,19 @@ object OniLDACWrapper {
localUser: String,
dataSource: String,
nodes: String,
prgSeed: Option[Long]): OniLDACOutput = {
prgSeed: Option[Long]): SpotLDACOutput = {

// Create word Map Word,Index for further usage
val wordDictionary: Map[String, Int] = {
val words = docWordCount
.cache
.map({case OniLDACInput(doc, word, count) => word})
.map({case SpotLDACInput(doc, word, count) => word})
.distinct
.collect
words.zipWithIndex.toMap
}

val distinctDocument = docWordCount.map({case OniLDACInput(doc, word, count) => doc}).distinct.collect
val distinctDocument = docWordCount.map({case SpotLDACInput(doc, word, count) => doc}).distinct.collect
//distinctDocument.cache()

// Create document Map Index, Document for further usage
Expand Down Expand Up @@ -113,7 +113,7 @@ object OniLDACWrapper {
// Create word results
val wordResults = getWordToProbPerTopicMap(topicWordData, wordDictionary)

OniLDACOutput(docToTopicMix, wordResults)
SpotLDACOutput(docToTopicMix, wordResults)
}

/**
Expand Down Expand Up @@ -147,18 +147,18 @@ object OniLDACWrapper {
}
}

def createModel(documentWordData: RDD[OniLDACInput], wordToIndex: Map[String, Int], distinctDocument: Array[String])
def createModel(documentWordData: RDD[SpotLDACInput], wordToIndex: Map[String, Int], distinctDocument: Array[String])
: Array[String]
= {
val documentCount = documentWordData
.map({case OniLDACInput(doc, word, count) => doc})
.map({case SpotLDACInput(doc, word, count) => doc})
.map(document => (document, 1))
.reduceByKey(_ + _)
.collect
.toMap

val wordIndexdocWordCount = documentWordData
.map({case OniLDACInput(doc, word, count) => (doc, wordToIndex(word) + ":" + count)})
.map({case SpotLDACInput(doc, word, count) => (doc, wordToIndex(word) + ":" + count)})
.groupByKey()
.map(x => (x._1, x._2.mkString(" ")))
.collect
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
package org.opennetworkinsight
package org.apache.spot

import org.apache.log4j.{Level, LogManager, Logger}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.opennetworkinsight.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
import org.opennetworkinsight.dns.DNSSuspiciousConnectsAnalysis
import org.opennetworkinsight.netflow.FlowSuspiciousConnects
import org.opennetworkinsight.proxy.ProxySuspiciousConnectsAnalysis
import org.apache.spot.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
import org.apache.spot.dns.DNSSuspiciousConnectsAnalysis
import org.apache.spot.netflow.FlowSuspiciousConnects
import org.apache.spot.proxy.ProxySuspiciousConnectsAnalysis

/**
* Top level entrypoint to execute suspicious connections analysis on network data.
Expand Down Expand Up @@ -39,7 +39,7 @@ object SuspiciousConnects {
Logger.getLogger("akka").setLevel(Level.OFF)

val analysis = config.analysis
val sparkConfig = new SparkConf().setAppName("ONI ML: " + analysis + " lda")
val sparkConfig = new SparkConf().setAppName("Spot ML: " + analysis + " suspicious connects analysis")
val sparkContext = new SparkContext(sparkConfig)
val sqlContext = new SQLContext(sparkContext)
implicit val outputDelimiter = OutputDelimiter
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package org.opennetworkinsight
package org.apache.spot


/**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package org.opennetworkinsight
package org.apache.spot

import org.apache.spark.broadcast.Broadcast

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
package org.opennetworkinsight.dns
package org.apache.spot.dns

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel.ModelSchema
import org.apache.spot.dns.model.DNSSuspiciousConnectsModel.ModelSchema

/**
* Data frame schemas and column names used in the DNS suspicious connects analysis.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
package org.opennetworkinsight.dns
package org.apache.spot.dns

import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.opennetworkinsight.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
import org.opennetworkinsight.dns.DNSSchema._
import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel
import org.opennetworkinsight.dns.sideinformation.DNSSideInformation
import org.apache.spot.SuspiciousConnectsArgumentParser.SuspiciousConnectsConfig
import org.apache.spot.dns.DNSSchema._
import org.apache.spot.dns.model.DNSSuspiciousConnectsModel
import org.apache.spot.dns.sideinformation.DNSSideInformation
import org.apache.log4j.Logger

import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel.ModelSchema
import org.opennetworkinsight.dns.sideinformation.DNSSideInformation.SideInfoSchema
import org.apache.spot.dns.model.DNSSuspiciousConnectsModel.ModelSchema
import org.apache.spot.dns.sideinformation.DNSSideInformation.SideInfoSchema

/**
* The suspicious connections analysis of DNS log data develops a probabilistic model the DNS queries
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
package org.opennetworkinsight.dns
package org.apache.spot.dns

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.functions._
import org.opennetworkinsight.utilities.DomainProcessor.{DomainInfo, extractDomainInfo}
import org.opennetworkinsight.utilities.Quantiles
import org.apache.spot.utilities.DomainProcessor.{DomainInfo, extractDomainInfo}
import org.apache.spot.utilities.Quantiles


/**
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
package org.opennetworkinsight.dns.model
package org.apache.spot.dns.model

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SQLContext}
import org.opennetworkinsight.dns.model.DNSSuspiciousConnectsModel.{ModelSchema, modelColumns}
import org.apache.spot.dns.model.DNSSuspiciousConnectsModel.{ModelSchema, modelColumns}

import scala.io.Source

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
package org.opennetworkinsight.dns.model
package org.apache.spot.dns.model

import org.apache.spark.broadcast.Broadcast
import org.opennetworkinsight.SuspiciousConnectsScoreFunction
import org.opennetworkinsight.dns.DNSWordCreation
import org.apache.spot.SuspiciousConnectsScoreFunction
import org.apache.spot.dns.DNSWordCreation


/**
Expand Down
Loading