LEVERAGING METADATA AND MACHINE LEARNING TO ENHANCE INTRUSION ANALYSIS About Reservoir Labs

LEVERAGING METADATA AND
MACHINE LEARNING TO
ENHANCE INTRUSION ANALYSIS
Robert Rotsted
Senior Engineer
Reservoir Labs
About Reservoir Labs
−
Founded in 1990
−
Private and profitable business based in New York City
−
Specialize in Cyber Security R&D, commercialization of R&D
−
R-Scope® Network Security Monitoring

Bro-powered appliance

100 Gigabit / second in 1RU
2
Agenda
−
Survey of the use of Machine Learning in NIDS
−
Overview of Machine Learning concepts
−
Recommendations for use of Machine Learning in NSM
3
Goals

Emphasize value of NSM

Address the problem of context and diminishing value of
network payload data

Propose ML as one method of combatting these issues
4
NIDS / NSM Today

Extract network metadata, deliver to search engine, hunt
threats

Match IOC, alert

Model attack pattern, alert when state of network matches
modeled attack pattern

Analyze metadata, draw conclusions
5
NIDS / NSM Today

Extract network metadata, deliver to search engine, hunt
threats
−
Pro


−
Agile
Discover previously unknown threats
Con

Costs a lot of analyst cycles
6
NIDS / NSM Today

Match IOC, alert
−
Pro


−
Discover previously known threats / vectors
Turn-key
Con


False positives
Unknown threats / vectors
7
NIDS / NSM Today

Model attack pattern, alert when state of network matches
modeled attack pattern
−
Pro

Discover previously known threats / vectors
Turn-key
 Low FP

−
Con

Not agile
Unknown threats / vectors
 Long development cycle

8
NIDS / NSM Today

Analyze metadata, draw conclusions
−
Pro


−
Agile
Discover previously unknown threats
Con


False positives
False negatives
9
Why: Machine Learning in NSM

Transition from reactive alert driven work flow to data
driven workflow

Data driven workflow requires data driven tools
10
What: Machine Learning

Machine learning is a subfield of computer science and
statistics that deals with the construction and study of
systems that can learn from data, rather than follow only
explicitly programmed instructions.
http://en.wikipedia.org/wiki/Machine_learning
11
What: Machine Learning

Unsupervised
−
Bring structure to unlabeled data
−
Pro

−
Discover previously unknown data patterns
Con

Discover meaningless data patterns
12
What: Machine Learning

Supervised
−
Infer a function from labeled data.
−
Pros

Easy to interpret
 Highly configurable
−
Cons

Subject to human error
 Time consuming to construct
13
SOME HISTORY…
14
History: Machine Learning, Anomaly
Based IDS
History of thought that NIDS would benefit from Machine
Learning

1987 paper “An Intrusion-Detection Model”, Dorothy Denning
described a NIDS

“Profiles and checks for anomalous behavior whenever an audit
record is generated or a period terminates. If abnormal behavior is
detected, an anomaly record is generated”

Cited over 3000 times according to Google Scholar!
15
Machine Learning, Anomaly Based NIDS
Indispensable since 1987? No way.
16
Some wisdom from the Sages …
17
“ … On using Machine Learning for IDS”
18
“ … On using Machine Learning for IDS”
Sommer and Paxson wrote a paper back in 2010 that said
essentially:

Detecting adversaries with anomaly detection, ML algorithms is very
difficult

Considerable research on the topic, not a lot of anomaly detection and
ML deployed in production NIDS
19
Paxson and Sommer’s Advice

Improve: Methodologies, inputs and outputs
−
−
−
Don’t apply machine learning, statistical analysis, or anomaly
detection as a general solution
Define and extract data relevant features from network traffic
to feed to ML algorithms
Deliver the data to your user that enhances their contextual
understanding of their environment, enhances automatability
20
CURATING DATA FOR MACHINE
LEARNING
21
Extracting Data Features

Feature extraction is a general term for methods of
constructing combinations of the variables in order to
simplify the task of accurately describing a large set of
data.
22
Quantitative Features

Quantitative features are numerical
−
Examples:

Sent byte count
Connection duration
 Received packet count

23
Producer Consumer Ratio
−
Introduced at FloCon in January by Carter Bullard and John
Gerth
−
All network nodes are producers and consumers of data
−
Exfiltration is a modification of the highly granular and
aggregated consumer / producer relationships of an
organization of systems.
24
Producer Consumer Ratio
−
A normalized value indicating directionality of application
information transfer, independent of data load or rate.
 =
 − 
 + 
-1
1
25
Producer Consumer Ratio
−
A normalized value indicating directionality of application
information transfer, independent of data load or rate.
 =
 − 
 + 
-1
1
1.0
1.0 = Pure Push: FTP upload, multicast, beaconing
26
Producer Consumer Ratio
−
A normalized value indicating directionality of application
information transfer, independent of data load or rate.
 =
 − 
 + 
-1
1
0.4
0.4 = 70:30 export: Sending Email
27
Producer Consumer Ratio
−
A normalized value indicating directionality of application
information transfer, independent of data load or rate.
 =
 − 
 + 
-1
1
0.0
0.0 = Balanced Exchange: NTP, ARP probe
28
Producer Consumer Ratio
−
A normalized value indicating directionality of application
information transfer, independent of data load or rate.
 =
 − 
 + 
-1
1
-0.5
-0.5 = 3:1 import - HTTP Browsing
29
Producer Consumer Ratio
−
A normalized value indicating directionality of application
information transfer, independent of data load or rate.
 =
 − 
 + 
-1
1
-1.0
-1.0 = Pure Pull: HTTP Download
30
Baselining DNS PCR

Why baselining DNS?
−
−
DNS is less adulterated than HTTP
DNS tunneling is a commonly cited Exfil channel
DNS
-1
1
-0.72
31
Baselining DNS PCR

Why baselining DNS?
−
−
DNS is less adulterated than HTTP
DNS tunneling is a commonly cited Exfil channel
DNS
DNS Tunneling
-1
1
-0.72
0.23
A change in PCR can indicate DNS Tunneling!
32
Categorical Features

Categorical features represent values that are labels
−
Examples:


Season (Spring, Winter, Summer, Fall)
Operating System (Windows, Linux, Mac OS)
33
Representing Categorical Variables

A “dummy variable” is an artificial variable created to
represent an attribute with two or more distinct
categories/levels.
https://www.moresteam.com/whitepapers/download/dummy-variables.pdf
34
Dummy Variable Trap

Beware! Do not include a ‘dummy variable’ for every
category of your variable

Leads to the problem of multicollinearity – in effect, the
variables cancel each other out

Solution: When representing categorical data, represent
all but one of the categories in your data with a dummy
variable
−
Ex. Spring, Winter, Fall (omitting Summer)
35
APPLYING MACHINE LEARNING
36
Analysts are important, perhaps too
important.
37
Problems with ‘meatspace’
Its not scriptable …
Memory corruption
Sleep, meals …
38
Context should also live in
‘cyberspace’
39
Create your own features, derive context.

Network data is fuzzy

You need to make assumptions about hosts and sessions
all the time based on imperfect information

One thing that you may want to do with data coming off the
network is use it to classify behavior or nodes, BUT HOW?
40
Context Aware NSM / NIDS
− Apply
appropriate policies to hosts on your
network
− Increase
− Write
workflow automation
policy in more expressive terms
41
Classification with Logistic Regression

Logistic Regression models the probability of a
sample being member of a class given its
feature values

Logistic Regression is a supervised classification
method
http://www.datarobot.com/blog/classification-with-scikit-learn/
42
Example training data
43
Logistic Regression
Probability
44
Prototyping Supervised ML
1.
2.
3.
4.
5.
6.
Derive a theory
Explore the data
Curate the data
Test your theory
Evaluate Results
Repeat
45
Derive a Theory
The count of unique HTTP referrers / minute is a good
indicator for classifying a host as a human web browser
46
Explore the Data
index=rscope sourcetype=http http_referrer !=‘-’ dst_ip !=10.* |
timechart distinct_count(http_referrer) by src_ip
47
Curate the Data ( Bro )

Reservoir Labs has created a Bro framework for curating
this type of data
−
−
−
Aggregate statistics about connection in one line
Easily add remove new data features
Add labelling criteria
48
Curate the Data ( Bro )

Unique HTTP Referrers / Minute
49
Curate the Data

Add labels
50
Logistic Regression POC

Leverages sklearn Python library
 First learns from training data, then applies algorithm to
data to test its predictions
 Inputs data from Bro ml.log
 Outputs the coefficient and score of the regression
51
Perform the Logistic Regression

Execute: ./logistic-regression
−
Score


The mean accuracy on the given test data and labels
Higher value = more accurate
52
Evaluate the Results




Prediction – Algorithm’s prediction for ‘label’
Prob(0) – Probability that ‘label’ is 0
Prob(1) – Probability that ‘label’ is 1
Confidence – Signed distance from the Logistic Curve
53
Conclusions on Logistic Regression…

Be wise!

Supervised learning algorithms are only as good as you
teach them to be

Choose your features wisely, could be very misleading

Just because it works for me, does not mean it will work for
you!
54
Future Work

Implement Logistic Regression in Bro
−

In progress
Experiment with other ML algorithms and network
metadata
55
Code available soon!
https://github.com/reservoirlabs/
56
Wrapping up…

Statistics and Machine Learning can and will be really useful in
NSM

There are some common pitfalls in using these tools for attack
detection, see “Outside the Closed World” for more details

Today, statistical analysis and ML tools are relegated primarily to
data scientists and commercial tools

For open source NSM to yield more than forensic data and IOC
matching, these tools need to be applied by hands of analysts!
57
Thanks!

Follow Reservoir Labs on Twitter!
−

@Reservoir_Labs
Follow me on Twitter!
−
@rrotsted
58