Traffic Classification

The identification of applications in network traffic has become a prolific research topic during the last years. The classification of the traffic is crucial for classic network management tasks, such as traﬃc engineering and capacity planning. Traditional techniques relying on transport-level protocol ports are no longer reliable due to the ever-changing nature of Internet traffic and applications and their techniques to avoid the detection (e.g., encryption, obfuscation). As a consequence, researchers are working and proposing a wide range of traffic classification solutions. However, although some proposals achieve high accuracy, the problem is far from being completely solved. The lack of shared tools and reference data makes the comparison and validation of the proposed techniques very difficult. Thus, difficulting the better assesment of the present achievements in this field.

Our group is involved in many projects doing research in the traffic classification field. Our area of research covers many aspects in this field, however, we have special expertise in these topics:

Machine Learning-based techniques
Deep Packet Inspection-based techniques
Traffic classification with sampled traffic (e.g., Sampled Netflow)
Traffic classification in high demanding networks (e.g., backbone networks)
Stream Machine Learning-based techinques
Ground-truth techniques

DATASETS

Probably the biggest problem to compare and validate the different techniques proposed for network traffic classification is the lack of publicly available datasets. Mainly because of privacy issues, researchers and practitioners are not allowed to share their datasets with the research community. In order to address, or at least mitigate, this problem, our group is usually publishing the datasets used in their works. Next, the publicly available datasets related to our works are described. Special mention for the "Is our Ground-Truth for Traffic Classification Reliable?" dataset that provides a set of reliably labeled pcap traces with full payload.

"Analysis of the impact of sampling on NetFlow traffic classification" Dataset

This dataset is derived from the paper:

Valentín Carela-Español, Pere Barlet-Ros, Albert Cabellos-Aparicio, and Josep Solé-Pareta: "Analysis of the impact of sampling on NetFlow traffic classification", Computer Networks 55 (2011), pp. 1083-1099. [pdf] [doi]

ABSTRACT

The traffic classification problem has recently attracted the interest of both network operators and researchers, given the limitations of traditional techniques when applied to current Internet traffic. Several machine learning (ML) methods have been proposed in the literature as a promising solution to this problem. However, very few can be applied to NetFlow data, while fewer works have analyzed their performance under traffic sampling. In this paper, we address the traffic classification problem with Sampled NetFlow, which is a widely extended protocol among network operators, but scarcely investigated by the research community. In particular, we adapt one of the most popular ML methods to operate with NetFlow data and analyze the impact of traffic sampling on its performance.

Our results show that our ML method is able to obtain similar accuracy than previous packet-based methods, but using only the limited information reported by NetFlow. Conversely, our results indicate that the accuracy of standard ML techniques degrades drastically with sampling. In order to reduce this impact, we propose an automatic ML process that does not rely on any human intervention and significantly improves the classification accuracy in the presence of traffic sampling

DATASET

The evaluation dataset used in the paper "Analysis of the impact of sampling on NetFlow traffic classification" consists of seven traces collected at the Gigabit access link of the Universitat Politècnica de Catalunya (UPC), which connects about 25 faculties and 40 departments (geographically distributed in 10 campuses) to the Internet through the Spanish Research and Education network (RedIRIS).

Name	Flows	Date	Time
UPC-I	2 985 098	11-12-08	10:00 (15 min.)
UPC-II	3 369 105	11-12-08	12:00 (15 min.)
UPC-III	3 474 603	12-12-08	16:00 (15 min.)
UPC-IV	3 020 114	12-12-08	18:30 (15 min.)
UPC-V	7 146 336	21-12-08	16:00 (1 h.)
UPC-VI	9 718 077	22-12-08	12:30 (1 h.)
UPC-VII	5 510 999	10-03-09	03:00 (1 h.)

The format of the labeled traces available consists of a plain text file similar to a NetFlow v5 flow-print output without IP information and the correspondent application label obtained by L7-Filter.

Pr	SrcP	DstP	Pkts	Octets	StartTime	EndTime	Active	B/Pk	Ts	Fl	Application
06	50	114f	2	3000	0901.00:59:15.924	0901.00:59:17.924	2.000	1500	00	10	skypetoskype

GROUND-TRUTH METHODOLOGY

In order to reduce the inaccuracy of L7-filter we use 3 rules:

We apply the patterns in a priority order depending on the degree of overmatching of each pattern (e.g., skypeout patterns are in the latest positions of the rule list).
We do not label those packets that do not agree with the rules given by pattern creators (e.g., packets detected as NTP with a size different than 48 bytes are not labeled).
In the case of multiple matches, we label the flow with the application with more priority based on the quality of each pattern reported in the L7-filter documentation. If the quality of the patterns is equal, the label with more occurrences is chosen.

We also perform a sanitization process in order to remove incorrect or incomplete flows that may confuse or bias the training phase. The sanitization process removes those TCP flows that are not properly formed (e.g., without TCP establishment or termination, and flows with packet loss or with out-of-order packets) from the training set. However, no sanitization process is applied to UDP traffic.

TRACES PETITION

If you are interested in any of these labeled traces send an email to:

"Is our Ground-Truth for Traffic Classification Reliable?" Dataset

This dataset is derived from the papers:

Valentín Carela-Español, Tomasz Bujlow, and Pere Barlet-Ros: "Is Our Ground-Truth for Traffic Classification Reliable?", In Proc. of the Passive and Active Measurements Conference (PAM'14), Los Angeles, CA, USA, March 2014. [pdf] [doi]

Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Comparison of Deep Packet Inspection (DPI) tools for traffic classification" , Technical Report, UPC-DAC-RR-CBA-2013-3, June 2013. [pdf]

ABSTRACT

The validation of the different proposals in the traffic classification literature is a controversial issue. Usually, these works base their results on a ground-truth built from private datasets and labeled by techniques of unknown reliability. This makes the validation and comparison with other solutions an extremely difficult task. This paper aims to be a first step towards addressing the validation and trustworthiness problem of network traffic classifiers. We perform a comparison between 6 well-known DPI-based techniques, which are frequently used in the literature for ground-truth generation. In order to evaluate these tools we have carefully built a labeled dataset of more than 500 000 flows, which contains traffic from popular applications. Our results present PACE, a commercial tool, as the most reliable solution for ground-truth generation. However, among the open-source tools available, NDPI and especially Libprotoident, also achieve very high precision, while other, more frequently used tools (e.g., L7-Filter ) are not reliable enough and should not be used for ground-truth generation in their current form.

DATASET

The dataset used in the paper "Is our ground-truth for traffic classification relaible?" consists of 1 262 022 flows captured during 66 days, between February 25, 2013 and May 1, 2013, which account for 35.69 GB of pure packet data. The dataset has been artificially built in order to allow us its publication with full packet payload. However, we have manually simulated different human behaviours for each application studied in order to make it as representative as possible. The selected applications are shown below:

Web browsers: based on w3schools statistics: Chrome and Firefox (W7, XP, LX), Internet Explorer (W7, XP).
BitTorrent clients: based on CNET ranking: uTorrent and Bittorrent (W7, XP), Frostwire and Vuze (W7, XP, LX)
eDonkey clients: based on CNET ranking: eMule (W7, XP), aMule (LX)
FTP clients: based on CNET ranking: FileZilla (W7, XP, LX), SmartFTP Client (W7, XP), CuteFTP (W7, XP), WinSCP (W7, XP)
Remote Desktop servers: built-in (W7, XP), xrdp (LX)
SSH servers: sshd (LX)
Background traffic: DNS and NTP (W7, XP, LX), NETBIOS (W7, XP)

The dataset consists of three pcap traces, one for each OS used (LX: Linux, W7: Windows 7, XP: Windows XP), and three INFO files, one for each pcap trace. Each line in the INFO file corresponds to a flow in the pcap trace and is described as follows:

flow_id + "#" + start_time + "#" + end_time + "#" + local_ip + "#" + remote_ip + "#" + local_port + "#" + remote_port + "#" + transport_protocol + "#" + operating_system + "#" + process_name + "#" + HTTP Url + "#" + HTTP Referer + "#" + HTTP Content-type +"#" .

The process name was present for 520 993 flows (41.28 % of all the flows), which account for 32.33 GB (90.59 %) of the data volume. Additionally, 14 445 flows (1.14 % of all the flows), accounting for 0.28 GB (0.78 %) of data volume, could be identified based on the HTTP content-type field extracted from the packets. Therefore, we were able to successfully establish the ground truth for 535 438 flows (42.43 % of all the flows), accounting for 32.61 GB (91.37 %) of data volume. The remaining flows are unlabeled due to their short lifetime (below <1 s), which made VBS, our ground-truth generator, incapable to reliably establish the corresponding sockets. Only these successfully classified flows will be taken into account during the evaluation of the classifiers. However, all the flows are included in the publicly available traces. This ensures data integrity and the proper work of the classifiers, which may rely on coexistence of different flows. We isolated several application classes based on the information stored in the database (e.g., application labels, HTTP content-type field). The classes together with the number of flows and the data volume are shown in the next table:

Application	#Flows	#Megabytes
Edonkey	176 581	2 823.88
BitTorrent	62 845	2 621.37
FTP	876	3 089.06
DNS	6 600	1.74
NTP	27 786	4.03
RDP	132 907	13 218.47
NETBIOS	9 445	5.17
SSH	26 219	91.80
Browser HTTP	46 669	5 757.32
Browser RTMP	427	5 907.15
Unclassified	771 667	3 026.57

For a more detailed description of the dataset we refer the reader to our paper and technical report cited before.

GROUND-TRUTH METHODOLOGY

To collect and accurately label the flows, we adapted Volunteer-Based System (VBS) developed at Aalborg University. The task of VBS is to collect information about Internet traffic flows (i.e., start time of the flow, number of packets contained by the flow, local and remote IP addresses, local and remote ports, transport layer protocol) together with detailed information about each packet (i.e., direction, size, TCP flags, and relative timestamp to the previous packet in the flow). For each flow, the system also collects the process name associated with that flow. The process name is obtained from the system sockets. This way, we can ensure the application associated to a particular traffic. Additionally, the system collects some information about the HTTP content type (e.g., text/html, video/x-flv ). The captured information is transmitted to the VBS server, which stores the data in a MySQL database. The source code was published under a GPL license. The modified version of the VBS client captures full Ethernet frames for each packet, extracts HTTP URL and Referer fields. We added a module called pcapBuilder, which is responsible for dumping the packets from the database to PCAP files. At the same time, INFO files are generated to provide detailed information about each flow, which allows us to assign each packet from the PCAP file to an individual flow.

TRACES PETITION

If you are interested in any of these labeled traces send an email to:

"Independent Comparison of Popular DPI Tools for Traffic Classification" Dataset

This dataset is derived from the papers:

Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Independent Comparison of Popular DPI Tools for Traffic Classification" , Computer Networks 76 (2015), pp. 75-89. [pdf] [doi]

Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Extended Independent Comparison of Popular Deep Packet Inspection (DPI) Tools for Traffic Classification" , Technical Report, UPC-DAC-RR-CBA-2014-1, Jan 2014. [pdf]

ABSTRACT

Deep Packet Inspection (DPI) is the state-of-the-art technology for traffic classification. According to the conventional wisdom, DPI is the most accurate classification technique. Consequently, most popular products, either commercial or open-source, rely on some sort of DPI for traffic classification. However, the actual performance of DPI is still unclear to the research community, since the lack of public datasets prevent the comparison and reproducibility of their results. This paper presents a comprehensive comparison of 6 well-known DPI tools, which are commonly used in the traffic classification literature. Our study includes 2 commercial products (PACE and NBAR) and 4 open-source tools (OpenDPI, L7-filter, NDPI, and Libprotoident). We studied their performance in various scenarios (including packet and flow truncation) and at different classification levels (application protocol, application and web service). We carefully built a labeled dataset with more than 750 K flows, which contains traffic from popular applications. We used the Volunteer-Based System (VBS), developed at Aalborg University, to guarantee the correct labeling of the dataset. We released this dataset, including full packet payloads, to the research community. We believe this dataset could become a common benchmark for the comparison and validation of network traffic classifiers. Our results present PACE, a commercial tool, as the most accurate solution. Surprisingly, we find that some open-source tools, such as Libprotoident and NDPI, also achieve very high accuracy.

DATASET

The dataset used in the paper "Independent Comparison of Popular DPI Tools for Traffic Classification?" consists 767 690 flows, which account for 53.31 GB of pure packet data. The application name was present for 759 720 flows (98.96 % of all the flows), which account for 51.93 GB (97.41 %) of the data volume. The remaining flows are unlabeled due to their short lifetime (usually below 1 s), which made VBS incapable to reliably establish the corresponding sockets. The dataset has been artificially built in order to allow us its publication with full packet payload. However, we have manually simulated different human behaviours for each application studied in order to make it as representative as possible.

The dataset consists of a pcap traces and an INFO file. Each line in the INFO file corresponds to a flow in the pcap trace and is described as follows:

Unlike our previous paper "Is our ground-truth for traffic classification reliable?", the classification in this paper has been done at three different levels. The first level studied is the Application Protocol level. Next table shows the content of the dataset at this level:

Application Protocol	#Flows	#Megabytes
DNS	18 251	7.66
HTTP	43 127	7 325.44
ICMP	205	2.34
IMAP-STARTTLS	35	36.56
IMAP-TLS	103	410.23
NETBIOS Name Service	10 199	11.13
NETBIOS Session Service	11	0.01
SAMBA Session Service	42 808	450.39
NTP	42 227	6.12
POP3-PLAIN	26	189.25
POP3-TLS	101	147.68
RTMP	378	2 353.67
SMTP-PLAIN	67	62.27
SMTP-TLS	52	3.37
SOCKSv5	1 927	898.31
SHH	38 961	844.87
Webdav	57	59.91

The second level of classification studied is the Application level. Next table presents the distribution of the dataset based on its application:

Application	#Flows	#Megabytes
4Shared	144	13.39
America's Army	350	61.15
BitTorrent clients (encrypted)	96 399	3 313.98
BitTorrent clients (non-encrypted)	261 527	6 779.95
Dropbox	93	128.66
eDonkey clients (obfuscated)	12 835	8 178.74
eDonkey clients (non-obfuscated)	13 852	8 480.48
Freenet	135	538.28
FTP clients (active)	126	341.17
FTP clients (passive)	122	270.46
iTunes	235	75.4
League of Legends	23	124.14
Pando Media Booster	13 453	13.3
PPlive	1 510	83.86
PPStream	1 141	390.4
RDP Clients	153 837	13 257.65
Skype (all)	2 177	102.99
Skype (audio)	7	4.85
Skype (file transfer)	6	25.74
Skype (video)	7	41.16
Sopcast	424	109.34
Spotify	178	195.15
Steam	1 205	255.84
TOR	185	47.14
World of Warcraft	22	1.98

The last level studied is related to services at web traffic. The classes together with the number of flows and the data volume are shown in the next table:

Web Service	#Flows	#Megabytes
4Shared	98	68.42
Amazon	602	51.02
Apple	477	90.22
Ask	171	1.86
Bing	456	36.84
Blogspot	235	10.53
CNN	247	3.66
Craigslist	179	4.09
Cyworld	332	13.06
Doubleclick	1 989	11.24
eBay	281	8.31
FaceBook	6 953	747.35
Go.com	335	25.83
Google	6 541	532.54
Instagram	9	0.22
Justin.tv	2 326	126.33
LinkedIn	62	2.14
Mediafire	472	27.99
MSN	928	23.22
MySpace	2	2.54
Pinterest	189	3.64
Putlocker	103	71.92
QQ.com	753	10.46
Taobao	387	24.29
The Huffington Post	71	21.19
Tumblr	403	52.56
Twitter	1 138	13.67
Vimeo	131	204.45
Vk.com	343	9.59
Wikipedia	6 092	521.95
Windows Live	26	0.16
Wordpress	169	33.31
Yahoo	17 373	937.07
YouTube	2 534	1 891.79

For a more detailed description of the dataset we refer the reader to our paper and technical report cited before.

Traffic Classification

DATASETS

"Analysis of the impact of sampling on NetFlow traffic classification" Dataset

ABSTRACT

DATASET

GROUND-TRUTH METHODOLOGY

TRACES PETITION

"Is our Ground-Truth for Traffic Classification Reliable?" Dataset

ABSTRACT

DATASET

GROUND-TRUTH METHODOLOGY

TRACES PETITION

"Independent Comparison of Popular DPI Tools for Traffic Classification" Dataset

ABSTRACT

DATASET

GROUND-TRUTH METHODOLOGY

TRACES PETITION

PUBLICATIONS