Traffic Classification

The identification of applications in network traffic has become a prolific research topic during the last years. The classification of the traffic is crucial for classic network management tasks, such as traffic engineering and capacity planning. Traditional techniques relying on transport-level protocol ports are no longer reliable due to the ever-changing nature of Internet traffic and applications and their techniques to avoid the detection (e.g., encryption, obfuscation). As a consequence, researchers are working and proposing a wide range of traffic classification solutions. However, although some proposals achieve high accuracy, the  problem is far from being completely solved. The lack of shared tools and reference data makes the comparison and validation of the proposed techniques very difficult. Thus, difficulting the better assesment of the present achievements in this field. 

Our group is involved in many projects doing research in the traffic classification field. Our area of research covers many aspects in this field, however, we have special expertise in these topics:

 

DATASETS

Probably the biggest problem to compare and validate the different techniques proposed for network traffic classification is the lack of publicly available datasets. Mainly because of privacy issues, researchers and practitioners are not allowed to share their datasets with the research community. In order to address, or at least mitigate, this problem, our group is usually publishing the datasets used in their works. Next, the publicly available datasets related to our works are described. Special mention for the "Is our Ground-Truth for Traffic Classification Reliable?" dataset that provides a set of reliably labeled pcap traces with full payload.

 

"Analysis of the impact of sampling on NetFlow traffic classification" Dataset

 This dataset is derived from the paper: 

Valentín Carela-Español, Pere Barlet-Ros, Albert Cabellos-Aparicio, and Josep Solé-Pareta: "Analysis of the impact of sampling on NetFlow traffic classification", Computer Networks 55 (2011), pp. 1083-1099. [pdf] [doi]

 

ABSTRACT

The traffic classification problem has recently attracted the interest of both network operators and researchers, given the limitations of traditional techniques when applied to current Internet traffic. Several machine learning (ML) methods have been proposed in the literature as a promising solution to this problem. However, very few can be applied to NetFlow data, while fewer works have analyzed their performance under traffic sampling. In this paper, we address the traffic classification problem with Sampled NetFlow, which is a widely extended protocol among network operators, but scarcely investigated by the research community. In particular, we adapt one of the most popular ML methods to operate with NetFlow data and analyze the impact of traffic sampling on its performance.

Our results show that our ML method is able to obtain similar accuracy than previous packet-based methods, but using only the limited information reported by NetFlow. Conversely, our results indicate that the accuracy of standard ML techniques degrades drastically with sampling. In order to reduce this impact, we propose an automatic ML process that does not rely on any human intervention and significantly improves the classification accuracy in the presence of traffic sampling

 

DATASET

The evaluation dataset used in the paper "Analysis of the impact of sampling on NetFlow traffic classification" consists of seven traces collected at the Gigabit access link of the Universitat Politècnica de Catalunya (UPC), which connects about 25 faculties and 40 departments (geographically distributed in 10 campuses) to the Internet through the Spanish Research and Education network (RedIRIS). 

 

Name Flows Date Time
UPC-I 2 985 098 11-12-08 10:00 (15 min.)
UPC-II 3 369 105 11-12-08 12:00 (15 min.)
UPC-III 3 474 603 12-12-08 16:00 (15 min.)
UPC-IV 3 020 114 12-12-08 18:30 (15 min.)
UPC-V 7 146 336 21-12-08 16:00 (1 h.)
UPC-VI 9 718 077 22-12-08 12:30 (1 h.)
UPC-VII 5 510 999 10-03-09 03:00 (1 h.)

 

The format of the labeled traces available consists of a plain text file similar to a NetFlow v5 flow-print output without IP information and the correspondent application label obtained by L7-Filter.

 

Pr SrcP DstP Pkts Octets StartTime EndTime Active B/Pk Ts Fl Application
06 50 114f 2 3000 0901.00:59:15.924 0901.00:59:17.924 2.000 1500 00 10 skypetoskype

 

GROUND-TRUTH METHODOLOGY

In order to reduce the inaccuracy of L7-filter we use 3 rules:

We also perform a sanitization process in order to remove incorrect or incomplete flows that may confuse or bias the training phase. The sanitization process removes those TCP flows that are not properly formed (e.g., without TCP establishment or termination, and flows with packet loss or with out-of-order packets) from the training set. However, no sanitization process is applied to UDP traffic.

  

TRACES PETITION 

If you are interested in any of these labeled traces send an email to:monitoring email

  

"Is our Ground-Truth for Traffic Classification Reliable?" Dataset

 This dataset is derived from the papers: 

Valentín Carela-Español, Tomasz Bujlow, and Pere Barlet-Ros: "Is Our Ground-Truth for Traffic Classification Reliable?",  In Proc. of the Passive and Active Measurements Conference (PAM'14), Los Angeles, CA, USA, March 2014. [pdf] [doi]

Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Comparison of Deep Packet Inspection (DPI) tools for traffic classification" , Technical Report, UPC-DAC-RR-CBA-2013-3, June 2013. [pdf]

 

ABSTRACT

The validation of the different proposals in the traffic classification literature is a controversial issue. Usually, these works base their results on a ground-truth built from private datasets and labeled by techniques of unknown reliability. This makes the validation and comparison with other solutions an extremely difficult task. This paper aims to be a first step towards addressing the validation and trustworthiness problem of network traffic classifiers. We perform a comparison between 6 well-known DPI-based techniques, which are frequently used in the literature for ground-truth generation. In order to evaluate these tools we have carefully built a labeled dataset of more than 500 000 flows, which contains traffic from popular applications. Our results present PACE, a commercial tool, as the most reliable solution for ground-truth generation. However, among the open-source tools available, NDPI and especially Libprotoident, also achieve very high precision, while other, more frequently used tools (e.g., L7-Filter ) are not reliable enough and should not be used for ground-truth generation in their current form.

  

DATASET

The dataset used in the paper "Is our ground-truth for traffic classification relaible?" consists of  1 262 022 flows captured during 66 days, between February 25, 2013 and May 1, 2013, which account for 35.69 GB of pure packet data. The dataset has been artificially built in order to allow us its publication with full packet payload. However, we have manually simulated different human behaviours for each application studied in order to make it as representative as possible. The selected applications are shown below:

The dataset consists of three pcap traces, one for each OS used (LX: Linux, W7: Windows 7, XP: Windows XP), and three INFO files, one for each pcap trace. Each line in the INFO file corresponds to a flow in the pcap trace and is described as follows:

 flow_id + "#" + start_time + "#" + end_time + "#" + local_ip + "#" + remote_ip + "#" + local_port + "#" + remote_port + "#" + transport_protocol + "#" + operating_system + "#" + process_name + "#" + HTTP Url + "#" + HTTP Referer + "#" + HTTP Content-type +"#" .

  The process name was present for 520 993 flows (41.28 % of all the flows), which account for 32.33 GB (90.59 %) of the data volume. Additionally, 14 445 flows (1.14 % of all the flows), accounting for 0.28 GB (0.78 %) of data volume, could be identified based on the HTTP content-type field extracted from the packets. Therefore, we were able to successfully establish the ground truth for 535 438 flows (42.43 % of all the flows), accounting for 32.61 GB (91.37 %) of data volume. The remaining flows are unlabeled due to their short lifetime (below <1 s), which made VBS, our ground-truth generator, incapable to reliably establish the corresponding sockets. Only these successfully classified flows will be taken into account during the evaluation of the classifiers. However, all the flows are included in the publicly available traces. This ensures data integrity and the proper work of the classifiers, which may rely on coexistence of different flows. We isolated several application classes based on the information stored in the database (e.g., application labels, HTTP content-type field). The classes together with the number of flows and the data volume are shown in the next table: 

 

Application #Flows #Megabytes
Edonkey 176 581 2 823.88
BitTorrent 62 845  2 621.37
FTP 876   3 089.06
DNS 6 600  1.74 
NTP 27 786  4.03 
RDP  132 907 13 218.47 
NETBIOS 9 445  5.17 
SSH  26 219 91.80 
Browser HTTP  46 669 5 757.32 
Browser RTMP 427  5 907.15 
Unclassified 771 667  3 026.57

 

For a more detailed description of the dataset we refer the reader to our paper and technical report cited before.

  

GROUND-TRUTH METHODOLOGY

To collect and accurately label the flows, we adapted Volunteer-Based System (VBS) developed at Aalborg University. The task of VBS is to collect information about Internet traffic flows (i.e., start time of the flow, number of packets contained by the flow, local and remote IP addresses, local and remote ports, transport layer protocol) together with detailed information about each packet (i.e., direction, size, TCP flags, and relative timestamp to the previous packet in the flow). For each flow, the system also collects the process name associated with that flow. The process name is obtained from the system sockets. This way, we can ensure the application associated to a particular traffic. Additionally, the system collects some information about the HTTP content type (e.g., text/html, video/x-flv ). The captured information is transmitted to the VBS server, which stores the data in a MySQL database. The source code was published under a GPL license. The modified version of the VBS client captures full Ethernet frames for each packet, extracts HTTP URL and Referer fields. We added a module called pcapBuilder, which is responsible for dumping the packets from the database to PCAP files. At the same time, INFO files are generated to provide detailed information about each flow, which allows us to assign each packet from the PCAP file to an individual flow.

 

TRACES PETITION 

If you are interested in any of these labeled traces send an email to:monitoring email

"Independent Comparison of Popular DPI Tools for Traffic Classification" Dataset

 This dataset is derived from the papers:

Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Independent Comparison of Popular DPI Tools for Traffic Classification" , Computer Networks 76 (2015), pp. 75-89. [pdf] [doi]

Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Extended Independent Comparison of Popular Deep Packet Inspection (DPI) Tools for Traffic Classification" , Technical Report, UPC-DAC-RR-CBA-2014-1, Jan 2014. [pdf]

 

ABSTRACT

Deep Packet Inspection (DPI) is the state-of-the-art technology for traffic classification. According to the conventional wisdom, DPI is the most accurate classification technique. Consequently, most popular products, either commercial or open-source, rely on some sort of DPI for traffic classification. However, the actual performance of DPI is still unclear to the research community, since the lack of public datasets prevent the comparison and reproducibility of their results. This paper presents a comprehensive comparison of 6 well-known DPI tools, which are commonly used in the traffic classification literature. Our study includes 2 commercial products (PACE and NBAR) and 4 open-source tools (OpenDPI, L7-filter, NDPI, and Libprotoident). We studied their performance in various scenarios (including packet and flow truncation) and at different classification levels (application protocol, application and web service). We carefully built a labeled dataset with more than 750 K flows, which contains traffic from popular applications. We used the Volunteer-Based System (VBS), developed at Aalborg University, to guarantee the correct labeling of the dataset. We released this dataset, including full packet payloads, to the research community. We believe this dataset could become a common benchmark for the comparison and validation of network traffic classifiers. Our results present PACE, a commercial tool, as the most accurate solution. Surprisingly, we find that some open-source tools, such as Libprotoident and NDPI, also achieve very high accuracy.

 

DATASET

The dataset used in the paper "Independent Comparison of Popular DPI Tools for Traffic Classification?" consists 767 690 flows, which account for 53.31 GB of pure packet data. The application name was present for 759 720 flows (98.96 % of all the flows), which account for 51.93 GB (97.41 %) of the data volume. The remaining flows are unlabeled due to their short lifetime (usually below 1 s), which made VBS incapable to reliably establish the corresponding sockets. The dataset has been artificially built in order to allow us its publication with full packet payload. However, we have manually simulated different human behaviours for each application studied in order to make it as representative as possible. 

The dataset consists of a pcap traces and an INFO file. Each line in the INFO file corresponds to a flow in the pcap trace and is described as follows:

 flow_id + "#" + start_time + "#" + end_time + "#" + local_ip + "#" + remote_ip + "#" + local_port + "#" + remote_port + "#" + transport_protocol + "#" + operating_system + "#" + process_name + "#" + HTTP Url + "#" + HTTP Referer + "#" + HTTP Content-type +"#" .

Unlike our previous paper "Is our ground-truth for traffic classification reliable?", the classification in this paper has been done at three different levels. The first level studied is the Application Protocol level. Next table shows the content of the dataset at this level: 

 

Application Protocol #Flows #Megabytes
DNS 18 251 7.66
HTTP 43 127  7 325.44
ICMP 205  2.34
IMAP-STARTTLS 35  36.56 
IMAP-TLS 103 410.23
NETBIOS Name Service  10 199 11.13 
NETBIOS Session Service 11 0.01
SAMBA Session Service  42 808 450.39
NTP 42 227 6.12 
POP3-PLAIN 26  189.25
POP3-TLS 101  147.68
RTMP 378 2 353.67
SMTP-PLAIN 67 62.27
SMTP-TLS 52 3.37
SOCKSv5 1 927 898.31
SHH 38 961 844.87
Webdav 57 59.91

 

The second level of classification studied is the Application level. Next table presents the distribution of the dataset based on its application: 

 

Application #Flows #Megabytes
4Shared 144 13.39
America's Army 350 61.15
BitTorrent clients (encrypted) 96 399   3 313.98
BitTorrent clients (non-encrypted) 261 527 6 779.95 
Dropbox 93  128.66
eDonkey clients (obfuscated) 12 835 8 178.74
eDonkey clients (non-obfuscated) 13 852  8 480.48 
Freenet  135 538.28
FTP clients (active) 126 341.17
FTP clients (passive) 122 270.46
iTunes 235  75.4 
League of Legends 23 124.14
Pando Media Booster 13 453 13.3
PPlive 1 510 83.86
PPStream 1 141 390.4
RDP Clients 153 837 13 257.65
Skype (all) 2 177 102.99
Skype (audio) 7 4.85
Skype (file transfer) 6 25.74
Skype (video) 7 41.16
Sopcast 424 109.34
Spotify 178 195.15
Steam 1 205 255.84
TOR 185 47.14
World of Warcraft 22 1.98

 

The last level studied is related to services at web traffic. The classes together with the number of flows and the data volume are shown in the next table: 

 

Web Service #Flows #Megabytes
4Shared 98 68.42
Amazon 602 51.02
Apple 477 90.22
Ask 171 1.86
Bing 456 36.84
Blogspot 235 10.53
CNN 247 3.66
Craigslist  179 4.09
Cyworld 332 13.06
Doubleclick 1 989 11.24
eBay 281 8.31
FaceBook 6 953   747.35
Go.com 335   25.83
Google  6 541  532.54
Instagram  9  0.22
Justin.tv  2 326  126.33
LinkedIn  62  2.14
Mediafire  472  27.99
MSN  928  23.22
MySpace  2 2.54
Pinterest  189  3.64
Putlocker 103  71.92
QQ.com  753  10.46
Taobao  387  24.29
The Huffington Post  71  21.19
Tumblr  403  52.56
Twitter  1 138  13.67
Vimeo  131  204.45
Vk.com 343   9.59
Wikipedia  6 092  521.95
Windows Live  26  0.16
Wordpress  169  33.31
Yahoo  17 373 937.07
YouTube  2 534  1 891.79

 

For a more detailed description of the dataset we refer the reader to our paper and technical report cited before.

  

GROUND-TRUTH METHODOLOGY

To collect and accurately label the flows, we adapted Volunteer-Based System (VBS) developed at Aalborg University. The task of VBS is to collect information about Internet traffic flows (i.e., start time of the flow, number of packets contained by the flow, local and remote IP addresses, local and remote ports, transport layer protocol) together with detailed information about each packet (i.e., direction, size, TCP flags, and relative timestamp to the previous packet in the flow). For each flow, the system also collects the process name associated with that flow. The process name is obtained from the system sockets. This way, we can ensure the application associated to a particular traffic. Additionally, the system collects some information about the HTTP content type (e.g., text/html, video/x-flv ). The captured information is transmitted to the VBS server, which stores the data in a MySQL database. The source code was published under a GPL license. The modified version of the VBS client captures full Ethernet frames for each packet, extracts HTTP URL and Referer fields. We added a module called pcapBuilder, which is responsible for dumping the packets from the database to PCAP files. At the same time, INFO files are generated to provide detailed information about each flow, which allows us to assign each packet from the PCAP file to an individual flow.

 

TRACES PETITION 

If you are interested in this labeled trace send an email to:monitoring email

 

PUBLICATIONS

The complete list of publications related to this group can be found here.