Securing Internet Telephony Media with SRTP and SDP


Requirements for Secure IP Telephony
   Securing Internet Telephony Services
   Securing Internet Telephony Applications
   Summary of the Requirements for Secure RTP Bearer
Encryption of Payloads
   Replay Protection
   Authentication of Messages
   Key Derivation, Key Assignment, and Rekey
SDP Descriptions for SRTP Services
SRTP Policy Recommendations
   SSRC Initialization
   DTMF and Forward Error Correction (FEC)
Gateways and RTP Intermediate Systems
   SRTP Gateways
   SRTP Sessions Mixers, Translators, and Source-Specific Multicast
Firewalls and Network Address Translation
   Firewall/NAT Avoidance
   Application-Layer Gateways 
   Firewall Control
      Off-Path Signaling 
      On-Path Signaling
The SRTP Reference Implementation


This technical white paper is a practical guide for applying SRTP to voice, fax, and other IP telephony media. It is intended for engineers and gives an overview of IP telephony security and technical fundamentals of SRTP. Network administrators can use this paper to learn how to configure SRTP and SDP security services for various purposes. Non-engineers may find this technical white paper useful for its high-level treatment of the telephony-security problem, and a general technical introduction to the needed security services.

This document uses "must" to indicate a requirement for compliance, interoperability, or secure operation. "Should" is a practice that is highly recommended by this document, and "may" indicates an option that is at the discretion of the administrator or application designer.

1. Overview

An "internet" is a network of networks, possibly consisting of many different link types such as Ethernet and Wi-Fi. These networks support electronic mail, web browsing, and other Internet Protocol (IP) applications. More complex than IP networks alone, Internet telephony networks carry voice, fax, modem, and other media over both IP and switched telephone networks (STNs), such as public switched telephone networks. Figure 2-1 shows a call gateway connecting an IP network to an STN. This gateway signals a telephone call between an STN and IP network. The STN does not necessarily run IP and the gateway usually must allow IP telephony networks to match the services of STNs. Many customers want these services to be confidential and integrity-protected on IP networks. Section 2 defines the terms and states the security requirements for telephony media and keys.

Section 3 describes how to protect telephony media using Secure Real-Time Transport Protocol (SRTP) for encryption of the RTP packet payload, for authentication of the entire RTP packet, and for packet replay protection. Section 4 uses Session Data Protocol (SDP) security descriptions to describe the SRTP keys for SRTP streams. Section 5 summarizes SRTP and SDP policies for IP telephony applications.

SRTP keys may be used in SRTP intermediate systems as well as in end systems. Gateways and intermediate systems complicate security, however, and Section 6 covers recent work on security signaling and media handling in these devices.

Section 7 considers intermediate systems, such as firewalls and network address translators (NATs), that have a confounding effect on security and on the operation of voice over IP (VoIP) and other telephony services. Section 8 describes a reference implementation for SRTP.

2. Requirements for Secure IP Telephony

Figure 2-1. An IP and Switched Telephony Network

Figure 2-1 shows two different types of networks. The IP network, typified by the public Internet, connects computers that run IP telephony applications. The switched telephone network (STN) connects analog telephones, modems, and other devices to a private or public telephone network. The public telephone network is commonly referred to as the public switched telephone network (PSTN), but there are private switched telephone networks as well; the term “STN” refers to both. Customers of IP telephony services expect both the connectivity and services that are afforded by today’s STNs and IP networks. This shapes IP telephony signaling and its media bearer architectures. Customer security concerns are typically focused on the IP side of an Internet telephony service—many customers want to secure call and media data against snooping, forgery, replay, and denial of service (DoS) attacks on IP networks.

Although historically there has been no perceived need for security for residential and commercial telephone services, telephone networks are now numerous, varied, and under the governance of diverse organizations. Beginning primarily in the United States, telephone companies have proliferated with different local exchange carriers, long-distance carriers, and specialized carriers offering services. Hundreds of small Internet service providers (ISPs) provide communities with both wired and wireless services; hundreds of companies operate corporate networks with voice capabilities; homeowners operate home networks; and hobbyists engineer “personal telecommunications” operations to interconnect neighborhoods and communities. This network assembly offers many opportunities for hacker mischief, privacy violation, financial fraud, and subversion of the telephone services that billions of people depend upon every day. Telephone subscribers in the U.S. and elsewhere are subject to telemarketers, “slamming” practices of competing providers, fax spam, and marginally legal (or illegal) solicitation for various business or charitable schemes. At least some of the common problems of the public Internet, ranging from distributed DoS (DDoS) attacks to virtual identity theft, may eventually threaten our telephone networks.

This paper does not address STN security or the general problems of Internet telephony policies and legislation. It focuses on the security technologies for IP telephony signaling and media. The security of these devices and their proxies is considered next.

2.1. Securing Internet Telephony Services

Figure 2-2. Control and Media Channels

The end-to-end IP telephony configuration shown in Figure 2-2 extends from end system to end system. End systems are typically phones but one or both ends might be a media gateway to an STN. The first thing to note in Figure 2-2 is that signaling is managed by intermediate systems (signaling controllers, or proxies), not from end to end—one or more signaling intermediaries have access to the call messages. The handling of call signaling is generally regulated by national and international statutes, but because of the international scope and rapidly changing technologies involved, call signaling messages should use message authentication and payload encryption. These security services should be part of the signaling protocol, and are included in protocols such as Session Initiation Protocol (SIP), H.323, and Media Gateway Control Protocol (MGCP). Each of these protocols specifically addresses integrity and privacy protections for the call signaling data, using S/MIME multipart/signed messaging protocol, IP Security (IPSec), Transport Layer Security/Secure Sockets Layer (TLS/SSL), and PGP/MIME.

In addition to signaling, the telephony media (such as voice data) also need privacy and integrity protections. The media path is labeled “media bearer” in Figure 2-2. Media are carried in a media transport such as fax, modem, or audio/video transport. Audio/video payloads such as voice or dial-tone multifrequency (DTMF) media typically use RTP. To protect the privacy of these media, RTP packet payloads may use payload encryption. RTP packets should use message authentication to ensure the integrity of media data. This requirement on the RTP bearer also applies to RTP intermediate systems in the media path, which are shown in Figure 2-3.

Figure 2.3. RTP Mixers, Translators, and Other Intermediate Systems

An RTP intermediate system is typically a voice mixer for a multimedia conference or a gateway that performs transcoding to a particular type of network. Regardless of the function of the intermediate system, it likely will need to decrypt, re-encrypt, validate, and authenticate the packets that pass through it. This further reduces the security of the call—additional devices now have access to the keys and hence access to the media of the end systems.

In addition to their “middle-box” configurations, the characteristics of the networks that serve as media bearers also constrain the media transport—some media bearer networks are optimized for voice data, which can use a lossy, low-bitrate cellular network link, for example. A short message authentication tag may be appropriate for voice data on a low-bitrate, lossy network, but other types of applications media cannot be adequately secured using a short tag. Thus, the security of an Internet telephony connection is both application- and network-dependent. Section 3.3 discusses when short tags are appropriate.

2.2. Securing Internet Telephony Applications

Adding security process into your leadership team’s behaviors can also help the culture effort. At Cisco, certain security technologies are mandatory for our executive leadership and their administrative support teams, such as disk-level encryption for the hard-drive, privacy screens while they travel, and ongoing briefings about travel risks. By working with both the executives and their support teams, not only are more individuals protected, but they are made aware of the risks through multiple sources: the security team, the administrative team, and their peers.

Table 2-1. Diverse Security Needs of IP Telephony Applications

Telephony Application Message Integrity Payload Privacy
Voice Long or short Optional
DTMF Long Recommended
Fax Long Optional
Modem Long or short Recommended
Electronic purchases Long Recommended
Emergency services Long Optional

It is important to keep in mind that not all telephony applications are voice, and that telephony applications have different security requirements. Table 2-1 lists some of these applications and compares their message integrity and payload privacy needs.

Human conversations typically have implicit authentication—humans can recognize or remember a voice, and sometimes talk to strangers over the telephone. A short, 32-bit message tag will often suffice for a voice call; this is called “message authentication with a short tag” and is shown as “short” in Table 2-1. A longer, 80-bit tag provides a stronger message integrity check and is listed as “long” in the table.

DTMF [RFC 2833], and fax [RFC 2532] media include e-commerce transactions and other sensitive information. The default is to assume the need for strong integrity checking; thus, long message authentication is recommended in Table 2-1. Fax services that use RTP can use SRTP. Fax payload encryption may not be needed for a particular application, although message integrity protection generally is recommended for all IP telephony applications. DTMF payloads include phone banking and transactions that include PINs, passwords, and credit card numbers, which should be encrypted.

The modem bearer is another application that commonly requires encryption. In some markets, such as cable Internet services in the U.S., encryption is mandated by law. This is due in part to the nature of link technology, which is conducive to eavesdropping. Use of the RTP modem-bearer payload type over SRTP is currently not defined in the standards, and is not further considered in this document.

Electronic commerce includes practices such as passing a bankcard number over the telephone (DTMF) or voice-activated commands that access financial data. When these services are carried in RTP, they can use SRTP to get strong message authentication with a long tag and payload encryption. Both are recommended for electronic commerce applications.

Other applications, such as emergency services, surveillance, and process control, typically require strong authentication and encryption. Although encryption is always optional, strong authentication is the recommended default because cryptographically strong integrity checks are almost always needed for IP telephony media. “Encryption” only indicates SRTP payload encryption—message encryption, rather than payload encryption, is the only way to encrypt an RTP payload header, and some applications require this level of privacy. Message encryption is not considered in this white paper, but IPSec encapsulation of the RTP or RTP Control Protocol (RTCP) message is often the best way to encrypt the packet header together with its payload.

2.3. Summary of the Requirements for Secure RTP Bearer

Table 2-2. IP Telephony Assets, Risks, and Threats

Assets Risks, Threats, and Protections
RTP packets, RTCP packets Information in the packet payload is often confidential to the user. Integrity of the header and payload is required against passive snooping and active impersonation attacks. SRTP message authentication and encryption protect these assets. If confidentiality of the packet header is needed, then IPSec ESP [RFC 2401] offers alternative or complementary protection.
SRTP keys and cryptographic parameters (SRTP cryptographic context) Keys for encrypting, decrypting, and integrity protecting the RTP packet asset need restricted access. Access controls are needed on the network and end system to protect against passive and active attacks; they need to be functionally specified with their security requirements. When keys are passed in call signaling, they are best protected along with the call signaling messages.
Call signaling traffic Messages used for establishing and terminating a telephone call. The encapsulating protocol of a call-signaling message such as SIP, H.323, or MGCP may also carry cryptographic keys [SDES]. Regardless of key bearer status, the signaling protocol must be protected against snooping or alteration. Such protection might come from using TLS, IPSec, S/MIME, or other data security protocols to protect the signaling messages.
RTP session address Only authorized participants in an RTP session should send packets to the RTP session address (network address plus port). To protect against DoS and other attacks, unauthorized packets must be discarded with minimal processing or memory consumption. Replay protection along with SRTP packet integrity can best protect this asset.
Device/user identity keys Methods for authentication and authorization include public-key cryptography and several password-based, shared key, and group key management schemes. The device/user keys are protected against unauthorized access by a key management protocol on the network and by access controls on the device.

Table 2-2 lists the primary IP telephony assets and what is required to protect them. This paper focuses on the first asset, RTP packets and RTCP packets, which needs three services:

1. SRTP confidentiality of RTP packets protects packet payloads from being read by entities without the secret encryption key.
2. SRTP message authentication of RTP packets protects the integrity of a packet against forgery, alteration, or replacement.
3. SRTP replay protection protects the session address against a DoS attack.

The requirements for one asset, however, cascade to another—packets are protected by an SRTP key, which usually is also protected by a call-signaling message. Access to these assets, in turn, is protected by device/user identity keys, which securely maintain long- and short-term identity keys such as public-key cryptography keys or pre-shared keys. Section 4 describes how SRTP keys are defined and carried; Section 3 tells how they are used.

3. Secure RTP

Figure 3-1. The Three Types of RTP Packets

Figure 3-1 shows three general types of RTP packets. The topmost packet in the figure is an RTP data packet. RTP capability resides mainly in four fields in the RTP packet header:

1. The definition of the particular RTP payload, called the “payload type.”
2. The timestamp value, which is specific to the particular payload type.
3. The sequence number, with the 16 low-order bits being carried in the sequence-number field of the RTP header and the 32 high-order bits being maintained by the SRTP protocol implementation.
4. The source of the RTP message, which is called the synchronization source (SSRC).

Every participant to an RTP session should have an SSRC—and should send RTCP receiver reports (RRs) or sender reports (SRs); aggregate RTCP packets may consume up to five percent of the session bandwidth. The RTCP RR is the middle packet in Figure 3-1. It gives the sender information about the receiving participant, such as loss and other metrics on reception quality. The SR is the bottom packet in Figure 3-1; it identifies the sender by SSRC and contains metrics on what it is sending. The SR will also contain zero or more RR items, one for each stream received on its RTP session address. Additional items may be sent in an RTCP RR or SR that is described in the RTP specification; this is beyond the scope of this paper.

Each RTP session participant is an RTP and/or RTCP source that is identified by a unique SSRC value. The triple <IP address, UDP port, SSRC> identifies an SRTP “stream,” which has a corresponding second Secure RTCP (SRTCP) stream. The middle packet in Figure 3-2 shows the placement of SRTP confidentiality and integrity services in RTP and RTCP packets (the optional master key index is not shown in Figure 3-2—it is not intended for telephony applications). Confidentiality is optional and is used by applications that need private communications. A confidentiality service is obtained by encrypting the payload so that only the sender and receiver that are in possession of the keys can read it. An integrity service is obtained by running a one-way function on the message using a cryptographic key so that the receiver can ensure that the sender of the message possessed a secret key and that no party lacking that cryptographic key modified the message while in transit. The keys for these services are associated with the stream triple <IP address, UDP port, SSRC> and are called “SRTP cryptographic context.”

Figure 3-2. SRTP Encapsulation of RTP Packets

The cryptographic context contains the cryptographic keys, their parameters, and important SRTP stream state information (such as the rollover counter variable). The sequence number from the RTP header (in the topmost packet of Figure 3-2) is added to the 32-bit SRTP rollover counter (ROC) that is stored in the cryptographic context to get the 48-bit sequence number, which is the SRTP packet index for the particular packet. The packet index is encrypted with other parameters to generate keystream segments, shown below in Figure 3-3. The sequence number of each RTP and RTCP packet must be correct for the encryption and decryption.

SRTP uses the RTP sequence number and does not change the RTP header, which is included in SRTP integrity protection as shown in Figure 3-2. SRTP does not modify the RTP header, but it does change the RTCP header in Figure 3-2 (middle and bottom packets) by adding an SRTCP sequence number and an “encrypted bit” to the RTCP header. The encrypted bit is set when the SRTCP payload is encrypted; the sequence number is needed for decrypting the packet using Advanced Encryption Standard-Counter Mode (AES-CM). These extensions are included in the integrity check, where an authentication tag is computed by the sender and validated by the receiver. SRTP adds an authentication tag at the end of all packets.

The integrity check runs a one-way function, Hash-based Message Authentication Code with Secure Hashing Algorithm 1 (HMAC-SHA1), over the header and payload using a secret key. The sender writes the HMAC-SHA1 hash into the authentication tag and the receiver runs the same computation and checks its result against the tag. If the two do not match, the message authentication is said to fail and the packet is discarded. For all SRTP packets, the message authentication coverage is over the RTP or SRTCP packet header (following the User Datagram Protocol [UDP] transport header) and payload. SRTP payload encryption covers the packet payload (which might be RR or SR payloads in the RTCP case).

SRTP extensions to the packet header are minimal by design—practically all SRTP information is stored in the cryptographic context. Session parameters, variables, keys, and a description of the services to be applied to RTP and RTCP packets are found in the cryptographic context, which uses RTP parameters such as SSRC and sequence number rather than duplicating them. Thus, there is no RTP header expansion from SRTP, which adds only an authentication tag of between four and ten bytes in length to the packet (and an optional master key index, which is designed for IPTV applications). The SRTP default encryption algorithm, AES-CM, does not require an explicit initialization vector (IV) in the packet but forms the IV from the packet index.

3.1. Encryption of Payloads

Figure 3-3. SRTP AES-CM Encryption

IP telephony data such as pulse-code modulation (PCM) or code excited linear prediction (CELP) signals are carried in RTP packet payloads. Modern encryption makes plain-text data appear to be random data [MVV], and SRTP default encryption sums a random stream of data (the keystream) with the plaintext stream of IP telephony data. This is called “stream encryption,” and can be performed either byte for byte or bit for bit. If bytes are summed, then the addition is modulo 256 since there are eight bits to the byte and the sum ranges from 0 to 255. The decryption operation subtracts keystream bytes from the corresponding ciphertext bytes, modulo 256, to get the original plaintext. Binary computers efficiently add bits, however, so Figure 3-3 shows addition modulo 2, which is the exclusive-or (XOR) operation. Exoring the keystream to the plaintext produces ciphertext; exoring it to the ciphertext produces plaintext. AES-CM is a parallelizable, randomly accessible cipher that allows keystream pre-computation and does not propagate bit errors [MODES].

Figure 3-3 shows only one keystream block, “Bi,j”, which is the AES encryption of the IV with key “k”. The bottom of Figure 3-3 shows that the IV is computed from the 48-bit packet index, the 32-bit SSRC, and the 112-bit salting key, “k_s”. All of these parameters are left-shifted and exored as shown in the IV equivalence of Figure 3-3. The rightmost 16 bits are initialized to zero; successive blocks of the keystream are generated by incrementing the rightmost 16 bits from 0 to 216. Thus, the maximum number of blocks in an SRTP packet is 216, or 65,535 bytes. Each IV is encrypted along with the key to produce a pseudorandom block of 128 bits, shown as Bi,j in Figure 3-3. Each 128-bit block is exored with an associated block of RTP payload plaintext to produce a block of ciphertext, which covers either part of or the entire payload. Both the encryption and decryption processors run the keystream generator with the packet-index, SSRC, and salting key k_s; each processor synchronously produces the keystream Bi,*—a stream of concatenated AES blocks.

SRTCP payloads may also be encrypted, as shown in Figure 3-3. RTCP does not have a sequence number that can be used in the IV, however, so the entire packet index is carried in the SRTCP packet, whereas only the low-order 16 bits of the 48-bit packet index are carried in an SRTP packet (the 32-bit rollover counter is stored in the cryptographic context). SRTP uses a packet-index determination algorithm that is needed to identify the correct value for the ROC, because lost packets may cause a received packet to have an RTP sequence number (SEQ) that is more than one higher than the highest SEQ received (SL); reordered packets may cause a SEQ that has a lower value than SL.

The SRTP packet index determination algorithm identifies the correct decryption key for the packet, even when packets are lost or reordered in transit. The algorithm tolerates reordering of up to 215 packets. Packets with a SEQ that is within 32,768 of SL (modulo 248) will be correctly associated with a packet index. Thus, SRTP packet-index determination is tolerant of loss.

Algorithm 1

if (SL< 215)
        if (SEQ - SL > 215)
                set v to (ROC-1) mod 232
                set v to ROC, S'L to MAX(SL, SEQ)
        if (SL - 215 > SEQ)
                set v to (ROC+1) mod 232, ROC' to v, S'L to SEQ
                set v to ROC, S'L to MAX(SL, SEQ)
return v*216 + SEQ

Algorithm 1 partitions the sequence-number space in half to find the closest ROC value for the packet with sequence number, SEQ. If SL is in the lower half and the difference between it and SEQ is greater than 215, then SEQ is deemed to be an earlier packet that used the previous ROC value (ROC-1). If their difference is not greater than 215, then the SEQ packet is deemed to have the same ROC value as the SL packet. Complementary logic applies when SL is in the upper half of the sequence-number space—if the difference between SEQ and SL exceeds 215, then the SEQ packet is deemed to be a later packet that uses the next ROC value (ROC+1). Otherwise, it is assumed that the two packets used the same ROC value. Algorithm 1 sets v to one of the set {ROC-1, ROC, ROC+1}, modulo 232, and returns packet index v * 216 + SEQ. ROC is set to ROC', and SL is set to S'L only after the packet is authenticated. Algorithm 2 improves upon Algorithm 1 by reducing the complexity by about a third for the most common case.

Algorithm 2

if (ABS(SL - SEQ) < 215)
        set v to ROC, S'L to MAX(SL, SEQ)
else if (SL< SEQ)
        set v to (ROC-1) mod 232
        set v to (ROC+1) mod 232, S'L to SEQ, ROC' to v
return v*216 + SEQ

3.2. Replay Protection

SRTP packet-index determination determines the index of an invalid packet as well as a valid packet—there can be no integrity check until the authentication key is determined. SRTP replay protection is the first line of defense against packets sent by an attacker. The fact that algorithms 1 and 2 allow packets to be as much as 215 packets out of order is acceptable for identifying the key, but an RTP application should not accept packets it cannot use. A replay window size of 64 is large enough for most RTP applications; DoS and other attacks that use a packet with a bogus or previously received sequence number must know the current sequence number to launch such an attack. If the attacker chooses a sequence number at random, and the window size is 64, there is a 99.9-percent likelihood (1–64/216) that the packet will be discarded before more computationally intense message authentication is applied. Thus, the most effective replay protection is to reduce the CPU, memory, and other resources that get tied up by an attack, which is what SRTP replay protection does.

Figure 3-4. Packet Replay Window

Figure 3.4 shows a fixed-size window on the RTP sequence number space (SEQNUM) that indicates if a packet with the particular sequence number has been received. Packets within the window are accepted, and a packet higher than the window (SEQ = w' > w) causes it to be advanced. The lower edge of the sliding window is advanced to the highest sequence number that has been received, only if the packet is successfully authenticated (w = w').

The SRTP default window size is 64, which means that an authenticated packet with a sequence number that is less than 64 packets behind the highest-numbered packet will be discarded. Packets that are more than 64 packets ahead of the window are discarded, and those within the window are discarded if the RECEIVED? bit shown in Figure 3-4 is set, which indicates that the particular packet has already been received. Thus, a bogus packet that is within 64 packets ahead of the highest sequence number received will pass replay protection. This will cause the window to be tentatively advanced; it will then fail the message authentication and the window will be restored. In that sense, the attack succeeded in forcing the receiver to run an HMAC-SHA1 hash against the packet before discarding it and restoring the replay window, SL, and SEQ to their original values. HMAC-SHA1 limits the affects of forged packets since they are discarded and not decrypted. And SRTP replay protection limits the affects of forged packets that are outside the window and of replayed packets, since these are discarded and not authenticated.

3.3. Authentication of Messages

Figure 3-5. SRTP HMAC-SHA1 Message Authentication

Once the packet index for a packet is determined, the sender and receiver access the keys for encryption/decryption and message authentication. Message authentication is required for SRTCP and is recommended for SRTP. The updates w = w', SL = S'L, and ROC = ROC' shown in the previous sections must occur if the packet passes the validity check.

SRTP message authentication reduces the contents of a packet to a 160-bit number using HMAC-SHA1; the output is unique for a given message authentication key. The holder of the key can verify if the RTP packet it has received is identical to the RTP packet that another key holder has sent. The sender and receiver run the same hash function on the packet concatenated with the ROC, as shown in Figure 3-5. The security of the HMAC-SHA1 integrity check depends on the size of the output tag, which an attacker can guess correctly with probability of 2-tag-bit-length. A 10-byte tag is recommended, but a short, 4-byte tag may be used for G.711 or G.729 packet flows, where the likelihood of a successful forgery is 2-32 and the expected mean of this geometric distribution is 232 attempts before a success. A G.729 flow at 50 packets/second, therefore, has an expected time to a successful forgery of 232/50 seconds, which is less than one success in two years.

Periodic replacement of HMAC-SHA1 keys is recommended in the HMAC-SHA1 specification, but no lifetime bounds are known for HMAC-SHA1 keys [RFC 2104]. Nonetheless, when the SRTCP sequence number rolls over to zero, there is an opportunity for an old packet to be replayed by an attacker. For this reason, the lifetime of an AES-CM master key is 248 SRTP packets or 231 SRTCP packets, whichever threshold is reached first. The HMAC-SHA1 key is derived from the master key in SRTP.

3.4. Key Derivation, Key Assignment, and Rekey

Figure 3.6. Deriving SSRC Keys from the SRTP Master Key

The encryption key, “k”, in Figure 3-6 is shown on one of the right arcs of the figure as the SRTP or SRTCP encryption key, depending on whether the packet is an RTP or RTCP packet. In both cases, a single SRTP master key is input to the Key Derivation Function (KDF). The other input may be the SRTP packet index, derived using the RTP packet sequence number—or in the case of an RTCP packet, the SRTCP packet index is a value that SRTP appended to the RTCP header prior to authenticating (integrity checking) the SRTCP message that contains it. Thus, SRTP creates the several keys needed for SSRC packet encryption and authentication from a single master key. An SRTP master key can be described as follows:


This is the inline parameter for SDP security descriptions [SDES], explained in Section 4. The bytes of concatenated master key and salt are base-64 encoded and are followed by the key packet lifetime (2^20, or 220) and the optional master key index (index 1, which is 32 bits in length), which is not recommended for IP telephony. The master key and salt concatenated into a 30-byte quantity that is 3:4 encoded (three bytes are encoded in four) and thus 40 bytes long—a valid SDP UTF-8 encoding.

Once the master key is installed and session keys are derived, SRTP and SRTCP encryption and authentication keys can be periodically refreshed when the key derivation rate is non-zero and is set to some period. A zero key-derivation rate, however, restricts the KDF to one invocation at the start of the session. A non-zero rate means that every time the packet-index modulo key derivation rate is zero, the KDF will be invoked and a new encryption and a new authentication key will be derived. Setting the key derivation rate to zero is recommended, and is further discussed in Section 5.

A single invocation of the SRTP KDF typically suffices for IP telephony applications. Repeated application of the KDF will not extend the life of an SRTP master key, which must not be applied to more than 248 RTP packets or 232 RTCP packets, whichever threshold is reached first. Owing to the so-called "Birthday Problem," the entropy of a well-chosen, 128-bit master key allows 264 blocks to be encrypted. This is 248 (264-16) packets; hence, the SRTP packet sizes must not exceed 220 bytes or 216 blocks in length. The recommended re-keying behavior is discussed in Section 5.

4. SDP Descriptions for SRTP Services

Figure 4-1. SDP Message with a=crypto Security Descriptions

      o=jdoe 2890844526 2890842807 IN IP4
      s=SDP Seminar
      i=A Seminar on the session description protocol
      u= (Jane Doe)
      c=IN IP4
      t=2873397496 2873404696
      m=video 51372 RTP/SAVP 31
      a=crypto:1 AES_CM_128_HMAC_SHA1_80
      m=audio 49170 RTP/SAVP 0
      a=crypto:1 AES_CM_128_HMAC_SHA1_32
      m=application 32416 udp wb

One cannot decrypt or verify an SRTP packet without decryption or authentication keys. In order to use the keys, however, the receiver needs to know the encryption cipher and mode, the authentication transform and tag length, the key derivation rate, and other information about the SRTP stream. This information is described with the media stream in SDP using a new SDP attribute, “a=crypto”. Figure 4-1 shows two a=crypto lines appearing in two media entries.

The first media entry (m = video 51372 RTP/SAVP 31) has a single a=crypto that uses AES-CM encryption with a 128-bit key and HMAC-SHA1 message authentication with an 80-bit tag for both RTP and RTCP packets. This is the default SRTP cryptographic transform that would have been in effect if “AES_CM_128_HMAC_SHA1_80” did not appear on a=crypto. The inline key has a 16-byte (128-bit) master key with a 14-byte salting key concatenated to it; both are base-64 encoded into 40 UTF-8 bytes. This message, with its plaintext key, needs to be authenticated, since tampering of security parameters must be detectable. The key needs to be encrypted and indecipherable to an unauthorized party. SSL can protect the SDP message, but proxies and relays subvert the end-to-end security to hop-by-hop. For this reason, it is recommended that the security mechanisms of the encapsulating security protocol, such as SIP and MGCP, protect the SDP message with its security descriptions using an appropriate security protocol that maintains the confidentiality of the key if parts of the SDP message are inspected by intermediate systems. S/MIME or PGP/MIME allow the key to be encrypted while other parts of the message are unencrypted, but integrity protected, for access by intermediate systems.

The second media entry (m = audio 49170 RTP/SAVP 0) has an a=crypto line for this media stream. Thus, keys are on a stream basis with the originator of the stream passing the receiver a key for that media stream. It is possible for the originator to offer multiple a=crypto lines, which is why the lines are numbered in Figure 4-1. The answerer selects an offer that is in accordance with its policy or completely rejects the offer [SDES].

Figure 4-2. Establishing Pair-Wise SRTP Cryptographic Contexts

All information needed for an SRTP cryptographic context is in an SDP message with security descriptions (Figure 4-2). The caller is sending to RTP transport address and RTCP address and port 51373. The inline statement has the master key; all parameters are set to their default values. The callee responds with similar information in its SDP message for the stream that it originates. As discussed above, the master key index (the inline value of 1:4 in Figure 4-2) is not recommended for IP telephony. The exchange of Figure 4-2 might be an offer/answer exchange [RFC 3264], although this is not shown in the figure.

SIP, MGCP, and Megaco signaling protocols use SDP and thus can be extended to support SDP security descriptions. When SIPS or other protocol-specific security is not used to protect the SDP security descriptions, generic data security protocols such as SSL, TLS, or IPSec should be used to authenticate an SDP message that contains a security description and to encrypt an SDP message that contains an inline key description. It is important to note, however, that the security association should be end -to-end and not hop-by-hop (Figure 2-1). For effective access control, the termination of the data-security connection should coincide with the end systems that are receiving or sending the SDP messages.

Access control (authentication and authorization) of IP telephony signaling and call media is a challenge, because of the absence of a suitable infrastructure, such as a public key infrastructure (PKI). This issue is considered in the summary chapter.

5. SRTP Policy Recommendations

Table 5 1. Recommended SRTP Defaults

Parameter Default Recommended

SRTP cipher



SRTCP cipher



SRTP authentication



SRTCP authentication



SRTP HMAC tag length


32 (voice)

80 (other)

SRTCP HMAC tag length



SRTP replay-window size



SRTCP replay-window size






Master key length



Session encryption key length



Session authentication key length



Master salt key length



Session salt key length



Key derivation rate



SRTP packets lifetime maximum



SRTCP packets lifetime maximum



MKI indicator



MKI length



The recommended settings for applying SRTP to IP telephony media are to use the SRTP default values. One exception to this rule is shown in Table 5-1—voice streams such as G.711 or G.729 may use a short, 32-bit SRTP HMAC tag length; this is highlighted in the table. A second exception is unencrypted SRTCP, which is recommended for IP telephony control. When to use a shorter tag is a security decision that the user, enterprise, or service provider makes as part of an overall security audit of the IP telephony service. Such an audit will consider physical security of devices in addition to the protocol security of the signaling and the media. The settings in Table 5-1 offer an excellent payload encryption and message authentication service. Every SRTP end system is sure to interoperate given these settings—all SRTP end systems naturally support the defaults.

Although there are many potential alternatives to the default settings, the defaults are recommended as the simplest configuration to a known level of protocol security, which is the current state-of-the-art in cryptographic security. All SRTP end systems support these settings according to the SRTP standard. The shorter, 32-bit tag for RTP voice, however, might not be supported by all SRTP systems, because it appears as a “may” in the standard. Any SRTP end system that supports an 80-bit tag can truncate this output to be 32 bits; nonetheless, failure to support a 32-bit alternative setting for HMAC-SHA1 could result from a lack of testing or programming in a particular SRTP product. Thus, an implementation must handle the case where a 32-bit authentication tag is rejected.

As the basis for interoperability, SRTP will default to the same cipher and mode for SRTP and SRTCP packets. Also, SRTP and SRTCP default to the same authentication transform and have the same cryptographic parameters (apart from a potentially short HMAC-SHA1 tag for SRTP voice packets). The default rekeying behavior is to not rekey an SRTP stream—the MKI length defaults to zero, so the default behavior of an SRTP system is to not change a key for an IP telephone call but to end the call. The effects of this policy are academic; the lifetime of an SRTP AES master key for G.729 is more than 178,000 years (248/50 pps) for SRTP and more than 27 years (231/(50* 0.05) pps) for an SRTCP AES. Thus, re-key is not recommended for IP telephony. Instead, the session should be ended and a new one established.

5.1. SSRC Initialization

The parameters in Table 5-1 are bound with a master key to an SSRC; this binding completely describes a cryptographic context for sending SRTP and SRTCP packets. There are two AES-CM keystreams in an SSRC’s cryptographic context. One keystream is for the SRTP payloads and one is for the stream of SRTCP payloads. Figure 4-2 shows how these are installed through signaling and indexed by a transport address and SSRC, which signaling also establishes. The assignment of the SSRC by its owner allows the cryptographic context to be established prior to the commencement of RTP data packets.

5.2. DTMF and Forward Error Correction (FEC)

DTMF data can be carried by the RTP using the RTP DTMF Payload Type [RFC 2833]. When protecting DTMF packets, the default shall use SRTP default encryption and SRTP default message authentication (80-bit tag). The default SRTP encryption transform, AES-CM, is not secure for DTMF data if SRTP message authentication does not check the integrity of the packet. This is true for many other IP telephony media types as well—because of the commutative property of an additive stream cipher like AES-CM, an attacker can alter known plaintext within the packet and this can only be detected through an integrity check. So the SRTP defaults are recommended for DTMF.

The SRTP default for the FEC Payload Type is to apply FEC before the default SRTP encryption and message authentication transforms and then apply SRTP to the resultant FEC stream. An alternative to the default is to create the FEC stream from the ciphertext and apply FEC after SRTP. There is no need to encrypt the resulting FEC stream, but message authentication should still be applied to the FEC stream even if encryption is not.

6. Gateways and RTP Intermediate Systems

Figure 6-1. The IP Telephony Network of the Future?

Section 2.1 warns that signaling is hard to protect on IP telephony networks because of the use of proxy signaling, and advises that the IP media bearer might be less secure because of the widespread use of mixers and translators; if for no other reason, it is less secure because more devices on the Internet have the session keys. The problems of secure signaling and secure bearer are reduced when only a minimal number of end systems have keys and every end system can tell which devices have access to the data.

Many IP telephony architectures resemble Signaling System 7 (SS7) more closely than the end-to-end architecture of the Internet. Since IP services can operate over switched telephony networks, a simpler architecture results from running only IP telephony services on the STN (Figure 6-1). In principle, there is no reason why connection between IP telephony devices cannot proceed in the same way as IP host computers, which handle call setup between them.

When a call has more than two endpoints, such as ad-hoc, meet-me, or large-scale multicast conferencing calls, this becomes more complicated. The end-to-end conferencing architecture in Figure 6-1 shows no mixers; it uses end-system mixing to eliminate mixers in the network and source-specific multicast (SSM) control to use bandwidth efficiently. But even in an ideal situation, it is hard to eliminate mixers—especially in large-scale calls, where the aggregate flow of the voice can be large and several concurrent talkers need their signals to be mixed. Thus, mixers will likely persist inside the network if only to reduce the bandwidth needed for the call. In fact, it is harder to ensure the conversational integrity of multipoint calls without a mixer, audio bridge, or controller that sends each participant an exact copy of session data.

Mixers, audio bridges, controllers, and similar devices should support SRTP encryption and message authentication for each stream that they originate. Therefore, the mixer terminates all SRTP sessions. In small-group conferencing, the SRTP sessions can be pair-wise. In large-scale multicast conferencing, a single group SRTP multicast session might take the place of a certain number (“N”) of pair-wise sessions. For all applications, however, an SRTP gateway may serve to originate and terminate SRTP sessions as described in Section 6.1.

6.1. SRTP Gateways

Figure 6-2. SRTP Gateway with MGCP Interface

STN and IP gateways need SRTP to secure toll bypass, legacy devices, and other IP telephony products. There are many types of gateways; one important type is the trunking gateway between a switched telephony network and an IP network. Such a gateway manages a large number of digital circuits. The SRTP trunking gateway shown in Figure 6-2 accepts configuration packages from a call agent running MGCP; the “SRTP package” is provided by Flemming Andreasen.

The SRTP package defines SRTP sessions to an SRTP gateway. The package takes the information in SDP security descriptions, and additional information, and translates it to MGCP connection parameters, events and signals, connection options, and even an offer/answer type of capability. Naturally, the protocol exchanges between the call agent and SRTP gateway need protection—an authenticated and encrypted IPSec tunnel is recommended for MGCP delivery of SRTP packages.

Gateways translate RTP/RTCP packets to and from SRTP/SRTCP packets. The SRTP gateway product from Cisco Systems® is built into Cisco IOS® Software for maximum utility to Cisco’s product families.

6.2. SRTP Sessions Mixers, Translators, and Source-Specific Multicast

Figure 6-3. SRTP Group Conference with Four SRTP Sessions

Figure 6-3 shows a small-group conference of the sort that happens daily in business. Only the media path is shown. A small group conference has as many as dozens of participants and a large-scale multicast conference has dozens to thousands of participants. To simulate physical conversation, mixers are placed in the path of the voice signals to multiplex two or more concurrent talkers into a single stream (the mixer is incorporated in the audio bridge in Figure 6-3). Without a mixer, it would be necessary to send up to N voice streams to each end system—a waste of bandwidth that also limits where each conference participant can reside.

In order to protect the privacy and integrity of voice streams, the mixing must also be protected. For example, it does not make sense for the audio mixer to not authenticate itself and its messages to the end systems if it has access to their confidential data. Thus, a secure mixer is not transparent; it is an SRTP end system. We can protect mixed sessions in one of two ways:

      1. Maintain N SRTP cryptographic contexts in the mixer, one for each end system.
      2. Maintain in the mixer one SRTP cryptographic context using a group key.

The first way is the best design for a first product for small groups. The second approach is attractive when the number of end systems is very large, such as in Cisco’s large-scale multicast product.

Figure 6-4. SRTP Conference Control with One Group SRTP Session

IP telephony services need an effective end-to-end model where the end systems hold the keys, and the keys are not shared with intermediate systems. The functions that are removed from the intermediate systems need to be performed in the end system, but the network can help reduce bandwidth with SSM service as shown in Figure 6-4. IP routers with appropriate quality of service (QoS), SSM forwarding, and SSM pruning replace intermediate systems like mixers in the figure.

For example, assume that the policy in the conference of Figure 6-4 is to mix up to four concurrent talkers in a group that could be much larger than four participants. Further assume that all talking participants send 16-kbps voice streams. If there is video as well as voice, SSM (Internet Group Message Protocol v3 [IGMPv3]) pruning allows each end-system participant to choose one or more video feeds while pruning others. The voice packets from each participant get forwarded, up to a maximum of four concurrent flows; beyond this, the SSM router will discard voice packets. Thus, the voice is not mixed in the network but the aggregate voice flow is kept to a maximum data rate of 4 * 16 kbps = 64 kbps. Packet overhead will add to this rate—the actual packet rate includes the packet header and tag overhead, which is small relative to the security benefit.

An RTP multiplexer of encrypted payloads can reduce the packet overhead by inserting the concurrent voice payloads into one packet with the contributing source’s SSRCs advertised in the RTP CSRC fields of the packet header. The multiplexer will be multiplexing encrypted payloads and will not need to have the decryption key, but the source of these multiplexed packets will need to be a valid SSRC participant to the session and will need to authenticate its packets containing the CSRC data. This solution is much more secure than the one shown in Figure 6-3, but more complex than the one in Figure 6-4.

The point of Figure 6-4, however, is to demonstrate that there is no need to insert mixers in the media bearer paths when the bandwidth is not strictly limited to less than 64 kbps per end system. As networks develop to accommodate more available bandwidth, a new architecture for IP telephony becomes feasible based on end-to-end principles of the Internet architecture. There still remain the issues of network address translators and firewalls, which are considered next.

7. Firewalls and Network Address Translation (NAT)

Most IP telephony signaling protocols, including SIP, establish media streams at dynamically allocated port numbers. Because these ports are not known in advance, telephone calls across firewalls and NAT devices will fail unless additional accommodation is made. Cisco IOS Software and Cisco PIX® security appliances provide the ability to statefully inspect signaling streams and, from their contents, determine the address/port tuples that need “pinholes” opened or NAT table mappings installed. A pinhole is a path through a firewall, through which a flow may pass.

Any technology that relies on the ability to read a data stream will fail when the data payload is encrypted; this is certainly the case with firewalls and secure IP telephony. And even when the signaling is not encrypted, the need of the NAT to rewrite payload-embedded addresses interferes with the application’s ability to provide integrity protection. As a result, the firewall interferes with IP telephony application security, and application security interferes with the ability of the firewall to provide traversal capabilities. In addition to security problems, this forces the firewall vendor to maintain large, complex protocol stacks on the firewall and to keep them updated as new versions of protocols are released.

Several mechanisms have evolved to address the problem with the interactions between IP telephony and firewalls/NATs. The approaches can be broadly categorized as:

      • Firewall/NAT avoidance (tunneling or relaying mechanisms)
      • Application-layer gateways and stateful inspection within firewalls and NATs
      • Communication with firewalls and NATs


7.1. Firewall/NAT Avoidance

The firewall and NAT traversal problem has proven to be so challenging to voice applications that mechanisms have evolved for bypassing firewalls and NATs entirely. Examples of this include the TURN protocol, session border controllers, and proprietary solutions.

A simple relaying technology like TURN [TURN], consists of placing a relaying device outside the firewall or NAT, having a server or end host on the inside establish a tunnel or forwarding path with it, and then having all inbound voice traffic sent to the relay. The relay then re-originates the traffic, forwarding it to its real destination.

More complex relaying technology has been developed and is beginning to attract the attention of standards committees. Session border controllers and similar devices also sit outside the firewall or NAT and relay traffic in, but they become more active participants in the telephony application, actually receiving and sending signaling messages and executing application logic.

Firewall/NAT avoidance is attractive because it allows the application to function in the presence of NATs and firewalls with minimal software modifications, and no modifications to firewall or NAT devices. In the typical case they should operate correctly across any type of firewall or NAT.

At the same time, network administrators need to be aware that relays introduce new security exposures, and these may or may not be tolerable, depending on factors such as the local trust environment or the administrative organization of IT responsibilities. That is, should a device that is outside of the firewall be trusted to manage security association for devices that are inside the firewall?

Perhaps the biggest problem is that these are technologies for avoiding firewall policy enforcement. They essentially decentralize network access policy enforcement. When they are installed in the local network, responsibility for preventing them from being exploited for unsupported uses (for example, unauthorized peer-to-peer applications or unauthorized internal servers) becomes the responsibility of those who “own” the relay. Additionally, there may be platform security issues, and a relay running on a compromised host can be used for eavesdropping attacks.

7.2. Application-Layer Gateways

Application-layer gateways are software functions embedded directly in the firewall or NAT. They function by inspecting application traffic as it passes, and modifying addresses as need to support NAT. This presents an obvious challenge for secured voice signaling traffic. Even in cases where the signaling traffic is sent in the clear but is integrity-protected, NAT address payload rewrites will cause the integrity protection mechanism to fail.

One possible approach to solving this problem is to provide keys to the NAT or firewall, allowing it to decrypt the traffic as it traverses and possibly to modify and re-encrypt or recalculate message integrity codes before forwarding. There are some non-trivial issues around sharing keys with third parties (particularly encryption keys), but this may be a reasonable risk in some environments and may be worth the risk in environments that rely heavily on firewall inspection for protocol conformance as protection against application-layer attacks.

7.3. Firewall Control

The third category of approach is to establish direct communication between an application (or application proxy) and a firewall or NAT. This approach has received the most attention in standards committees and is implemented in Microsoft’s UPnP protocol. In firewall control, secured messaging is used to send requests to firewalls and NATs, and to receive responses. These requests will typically be requests for firewall pinholes, requests for NAT table entries (“please install a mapping for this address/port/protocol tuple and return to me the ‘external’ address and port that were assigned”), teardown messages for installed pinholes or NAT table entries, and administrative messages.

Within firewall control are two subcategories, based on their messaging and architectural models: off-path signaling and on-path signaling

7.3.1. Off-Path Signaling

Off-path signaling uses a client-server model to communicate with firewalls and NATs. In it, secured messages are addressed directly to the firewall. Among its advantages are a simple security model and a simple communications model. Unfortunately, it raises some issues around network topology that may be difficult to resolve in complex or multi-homed networks. For example, in a network in which a telephony signaling message has to traverse several firewalls and NATs, understanding network topology well enough to be able to know where the devices are in relation to one another is extremely difficult; because of that, it is likely that an “incorrect” address will be presented in a firewall pinhole request. Another difficult topology problem is routing—if a network is multi-homed and there are multiple egresses from the network, knowing which egress firewall to send a pinhole request to can be challenging.

The STUN protocol is an example of off-path firewall signaling [RFC 3489]. STUN is a special case of a firewall control protocol. Its use is limited to NAT traversal by UDP streams, and rather than communicating directly with a NAT device, it works by sending a message through the NAT, with the creation of a NAT table mapping as a byproduct.

7.3.2. On-Path Signaling

Another approach is to use a protocol like Resource Reservation Protocol (RSVP; RFC 2205), which forwards signaling messages along the path of a stream for the purposes of reserving resources for the stream along that path. A host sends firewall and NAT requests toward its application peer. As the request traverses a participating device, it may choose to honor or reject the request, based on local administrative policy. By using this messaging model along with a “soft-state” approach to maintaining the pinholes and other resources requested by the application, robustness can be provided across complex network topologies (including nested NATs and NATs interspersed with firewalls), as well as across routing changes. A protocol based on this model is being developed in the IETF NSIS working group.

8. The SRTP Reference Implementation

Figure 8-1. Two Types of SRTP Implementations

A complete C-language implementation of SRTP is under development, available publicly at, and updated periodically. The reference implementation is suitable for a “bump-in-the-stack” or “bump-in-the-wire implementation of SRTP, shown in Figure 8-1 (a). These two approaches allow for SRTP to be provided to an unchanged RTP implementation.

It is recommended, however, that implementations use an integrated design, shown in Figure 8-1 (b). This approach is favored because of the dependency that SRTP has with RTP parameters such as the SSRC, the sequence number and roll-over counter.


The authors Mark Baugher, David McGrew, and Melinda Shore, wish to thank Mario Garakani and Stephen Wolff for their review and comments. Dave Oran has pioneered the SRTP work at Cisco Systems with David McGrew.


[KMGMT] J. Aarko, et al., Key Management Extensions for Session Description Protocol (SDP) and Real Time Streaming Protocol (RTSP), IETF Work in Progress, June 2005

[MODES] Dworkin, M., Recommendation for Block Cipher Modes of Operation: Methods and Techniques, NIST Special Publication 800-38A, December 2001.

[MVV] A.J. Mendes, P.C. vonOorschot, S.A. Vanstone, Handbook of Applied Cryptography CRC Press LLC, 1997.

[RFC 2205] Resource Reservation Protocol, IETF, RFC 2205, September 1997

[RFC 2401] Security Architecture for the Internet Protocol, IETF, RFC 2401, November 1998

[RFC 2532] Extended Facsimile Using Internet Mail, IETF, RFC 2532, March 1999

[RFC 2833] RTP Payload for DTMF Digits, Telephony Tones and Telephony Signals, RFC 2833, May 2000

[RFC 3264] An Offer/Answer Model with the Session Description Protocol (SDP), June 2002

[RFC 3489] STUN - Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs), IETF RFC 3489, March 2003

[RFC 3711] The Secure Real-Time Transport Protocol, IETF, RFC 3711, March 2004 

[SDES] F. Andreasen, M. Baugher, D. Wing, SDP Security Descriptions for Media Streams, IETF Work in Progress, September 2005

[TA96] U.S. Telecommunications Act of 1996

[TURN] J. Rosenberg, R. Mahy, C. Huitema, Traversal Using Relay NAT (TURN), IETF Work in Progress, September 2005


This document is part of the Cisco Security portal. Cisco provides the official information contained on the Cisco Security portal in English only.

This document is provided on an “as is” basis and does not imply any kind of guarantee or warranty, including the warranties of merchantability or fitness for a particular use. Your use of the information in the document or materials linked from the document is at your own risk. Cisco reserves the right to change or update this document without notice at any time.

Back to Top