Novel applications of Machine Learning to Network Traffic Analysis

PDF Publication Title:

Novel applications of Machine Learning to Network Traffic Analysis ( novel-applications-machine-learning-network-traffic-analysis )

Previous Page View | Next Page View | Return to Search List

Text from PDF Page: 039

3.2.5 Synthetic data generation The main principle behind all ML models is that they learn from data instead of learning in an imperative way based on predefined rules (programming paradigm). Hence, the importance of having large representative datasets. Large datasets are important since the objective is to be able to create algorithms that can generalize to data outside of the data used for training, hence the need of a representative dataset. A dataset is representative if it includes samples that represent all possible behaviors that we try to model with our algorithm, and avoids non- representative samples (noise). Since the behaviour of systems is often complex, their representative datasets are usually large. When we have problems acquiring a representative dataset due to cost, time, privacy or technical difficulties, and we end up with small datasets or datasets that do not include sufficient samples of under-represented behaviours, then we need to consider the use of synthetic data. In order to create a dataset that can be used for model training, we can have three alternatives [120]: • Real data: data generated by the normal generation environment associated with the data and that we try to model with our ML algorithm • Semi-synthetic data: data generated by an artificial generation environment that tries to be similar to the normal generation environment of the data. In this case the intention is to reproduce virtual entities (e.g. users, systems...) with a behaviour similar to the real one, with the intention that the data produced by the simulated environment is similar and representative of the real one. The simulation can be based on physical entities (e.g. network, switches, computers...) or simulated by software processes. • Synthetic data: data synthetically created without using a simulated generation environment. This data is created trying to be similar to the real data (e.g. correlation, probability distribution, patterns...). In this case, we synthetize the data directly instead of obtaining it by simulating the data generation environment. There are pros and cons for all three alternatives [120]. Of course, the best option is to have a real and representative dataset. Since this is not always an option, the next best option is to create semi-synthetic data that simulates the data generation process in a realistic way. But, in many cases, due to cost, time, or technical difficulties, the only available option is to create synthetic data. This latter option can be problematic, since the generated data can be noisy and not representative of the original data, therefore, it is important to articulate good methods to generate synthetic data when all other possibilities are not feasible. Synthetic data should resemble the actual data, but with the variability required to not be an exact copy of the original data. Intrusion detection is an area particularly interesting for the generation of synthetic data. Acquiring a representative dataset can be costly and time consuming even with a simulated environment. In addition, the intrusion detection datasets are strongly biased to normal traffic, being difficult to access traffic associated with intrusion events. Regarding unbalanced datasets there are well-known over-sampling algorithms (SMOTE, ADASYN,..)[5] that create synthetic data for the under-represented classes. The main idea of these algorithms is to create new samples close (under a defined distance measure) to existing samples that belong to some specific minority class, therefore, it is important how the “distance” function is defined, which is not an easy task as demonstrated by the many variants of the SMOTE algorithm. Doctoral Thesis: Novel applications of Machine Learning to NTAP - 37

PDF Image | Novel applications of Machine Learning to Network Traffic Analysis

novel-applications-machine-learning-network-traffic-analysis-039

PDF Search Title:

Novel applications of Machine Learning to Network Traffic Analysis

Original File Name Searched:

456453_1175348.pdf

DIY PDF Search: Google It | Yahoo | Bing

Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info

Cruising Review Topics and Articles More Info

Software based on Filemaker for the travel industry More Info

The Burgenstock Resort: Reviews on CruisingReview website... More Info

Resort Reviews: World Class resorts... More Info

The Riffelalp Resort: Reviews on CruisingReview website... More Info

CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com | RSS | AMP