PDF Publication Title:
Text from PDF Page: 165
In this work we provide a method to generate data of similar probabilistic structure to intrusion detection data, having both continuous and categorical features and being strongly unbalanced to some of their associated labels. We generate the data conditioned to the specific class (label) to which we want the data to belong. That is, from a particular set of labels we generate training samples associated to that set of labels, reflecting real data that comes from those labels. We call the method Variational Generative Model (VGM). We use a generative model based on a Conditional Variational Autoencoder (VAE), using the intrusion class labels as input. This modification provides an advantage, as we can readily generate new data using only the labels, without having to rely on specific training samples that represent or are associated to specific labels. Furthermore, the new synthesized data can be used as new additional training data to improve classification results for common machine learning classifiers. These results also confirm that the synthesized data have similar structure to the original but not been identical which allows to improve the performance of a classifier. The problem presented here can be considered similar to the one faced by classification with an imbalanced dataset, which is mainly addressed with four strategies [6, 7, 8]: resampling, cost-sensitive, algorithmic and ensemble. To compare VGM with equivalent approaches, we focus on resampling. Resampling can be achieved creating new minority class samples (over- sampling) or reducing the number of majority class samples (under-sampling). An effective way to perform over-sampling is by creating new synthetic data that resembles the original data. The state-of-the-art (SOTA) algorithms in synthetic over-sampling are based on SMOTE [9] and its numerous variants [10, 11]; being its main idea to create new samples close in ‘distance’ to existing samples that belongs to some specific minority class. The different variants consider alternative approaches to calculate the distance function and the proximity to majority class samples (borderline). To avoid possible over-fitting due to synthetic data, there is the possibility to combine over-sampling and under-sampling methods [12, 13]. Another interesting method is ADASYN[14], which is similar to SMOTE, but giving more weight to samples that are harder to learn (closer to other majority class samples). Finally, there is the possibility to perform ensemble sampling with methods similar to EasyEnsemble [15]. Our method (VGM) is a generative method to synthesize new data belonging to any class label. The main difference between VGM and SMOTE (and its variants) is that VGM is based in a latent probability distribution learned from data, instead of being based in a predefined ‘distance’ function. VGM does not need to assume any ‘distance’ function, or to impose rules on the importance of proximity to majority class samples, which would be additional hyper- parameters to explore. We provide a comparison of the proposed model with seven SOTA synthetic data generation algorithms (SMOTE, ADASYN...), showing that synthetic data generated by VGM provides better performance metrics (average accuracy and F1) when several common classifiers are trained with this data instead of data from other alternative generation algorithms. To train the VGM model, we need original data from a well-known intrusion detection dataset, for which we have chosen the NSL-KDD data set [16]. We have explored several architectures for VGM, considering different number of layers, nodes, regularization, loss functions and probability distribution for the output layer. We present the different options and the results obtained. As a summary, the contributions of this paper are: (1) It is the first application of a conditional variational autoencoder to generate synthetic data in the intrusion detection field. (2) We present original methods to show similarity of real and synthetic data. (3) VGM provides more useful synthetic samples than comparable SOTA over-sampling algorithms, corroborated by better performance (accuracy, F1) produced by various classifiers when using synthetic data generated by VGM. Doctoral Thesis: Novel applications of Machine Learning to NTAP - 163PDF Image | Novel applications of Machine Learning to Network Traffic Analysis
PDF Search Title:
Novel applications of Machine Learning to Network Traffic AnalysisOriginal File Name Searched:
456453_1175348.pdfDIY PDF Search: Google It | Yahoo | Bing
Cruise Ship Reviews | Luxury Resort | Jet | Yacht | and Travel Tech More Info
Cruising Review Topics and Articles More Info
Software based on Filemaker for the travel industry More Info
The Burgenstock Resort: Reviews on CruisingReview website... More Info
Resort Reviews: World Class resorts... More Info
The Riffelalp Resort: Reviews on CruisingReview website... More Info
CONTACT TEL: 608-238-6001 Email: greg@cruisingreview.com (Standard Web Page)