wiki:HowItWorks

How Libprotoident Works

Libprotoident uses a novel approach for traffic classification that requires only four bytes of payload to be retained for each packet, alleviating the storage and privacy concerns that are associated with deep packet inspection (DPI) approaches.

Libprotoident programs

When developing a libprotoident program, the user is responsible to reading packets from the capture source (using libtrace), assigning each packet to a bidirectional flow and determining the direction of the packet.

Each biflow should have an lpi_data_t structure (which we'll call "LPI data" for short) associated with it. The LPI data must be initialised when the biflow is first observed. The example libprotoident programs use libflowmanager to perform the flow tracking and expiry of completed flows.

Valid direction values for a packet are either 0 or 1. All incoming traffic must use one value and all outgoing traffic must use the other, but it does not matter which way around they are.

Each packet read by the program must be passed into the lpi_update_data function, along with the LPI data for the flow that the packet belongs to and the packet direction. This function extracts any necessary information from the packet and stores it in the LPI data.

Finally, the lpi_guess_protocol function can be used to guess the L7 protocol being used by a given biflow. This function takes the LPI data for that flow and returns a pointer to the protocol module that the flow matches.

The LPI Data Structure

The LPI Data structure stores all the information about a single biflow that is required for libprotoident to determine the application protocol being used by that flow. The following is stored within the structure:

  • The first four bytes of payload observed for each direction.
  • The size of the first payload bearing packet in each direction (ignoring TCP/UDP/IP headers).
  • The transport protocol (i.e. TCP, UDP, ICMP).
  • The port numbers used by the flow.
  • The IP addresses used by the flow (if IPv4).
  • The sequence number of the first payload bearing packet (if TCP) in each direction.
  • The total amount of payload observed in each direction (up to 32KB).

The sequence numbers and total payload are used to ensure that we handle reordered TCP segments correctly.

Note that libprotoident can actually ignore most of the segments/datagrams for a given flow, as we only care about the first payload-bearing packet sent in each direction.

Determining the Protocol

When the guess protocol function is called, the LPI data is tested against each of the supported application protocols until a match is found. Each supported protocol is implemented as a separate module which includes (amongst other things) the rule matching function and a priority value for that protocol. Priorities range from 1 (very high priority) to 255 (very low priority) and determine the order in which the rule matching functions are applied. The priority is based on our confidence in the rules for that protocol and the popularity of the application.

The rule matching function will return true if the LPI data meets the requirements for the application and false otherwise. The function will consist of one or more rules that must be met to result in a successful match. There are four types of rule that a protocol module can use to identify the application protocol for a given LPI data structure:

  • Payload Matches: The most common type of rule in libprotoident, where the four bytes of recorded payload for a given direction is compared against a known payload pattern for the protocol. For example, the BitTorrent rule matching function looks for the character pattern '0x13', 'B', 'i', 't'.
  • Payload Size: These rules require that the amount of payload observed in the first payload-bearing packet matches a given size. For example, one of the Skype rules requires that the initial payload in one direction is exactly 11 bytes. For some protocols (such as SMB), the first four bytes contain a length field which we can match against the size of that packet.
  • Port Number: Port numbers are often used to disambiguate cases where the payload could match multiple rules or to strengthen otherwise weak rules that would be prone to false positives. They are typically used for protocols that have a well-defined port number, e.g. DNS, MySQL.
  • IP Matching: Very rarely, a protocol will include the IP address of one of the biflow endpoints in the first four bytes (such as certain Gnutella UDP messages), which is why we store the IP addresses in the LPI data structure. However, this type of rule doesn't work if the IP addresses have been sanitised during the packet capture process.

Libprotoident does NOT store any information about successful matches, i.e. mapping protocols to IP address and port combinations, that can be used to quickly identify subsequent biflows using the same IP and port even if they do not match any of the known rules. This is because storing state for IP address/port combinations is very memory-intensive, especially on busy links. Instead, libprotoident treats each flow independently but the user could still implement IP address tracking outside of libprotoident if they required it.

Last modified 8 years ago Last modified on 10/19/11 11:46:50