Class 18: Public Key Codes

The history of cryptography is fascinating and exciting, and has been the subject of excellent books. The Codebreakers by David Kahn is a classic; and a new one, The Code Book by Simon Singh was just reviewed in the NY Times Book Review (Nov. 7, by Robert Osserman). The story of codebreaking in World War II is especially riveting. The breaking of the Nazi "enigma" codes was the subject of a Nova documentary just this past weekend (there is lots of related web material that you can find from www.pbs.org).

History -- and the puzzle pages of magazines -- tells us that simple substitution ciphers are easy to break. That is, if you just consistently replace 'A' by 'J', 'B' by 'Z', 'C' by 'B', and so on through the alphabet, the result, although superficially mystifying, can be immediately broken with simple tricks: the most common letter in English is 'E', so just look for the most common letter in the message -- it probably represents 'E', etc. Recurring patterns of double letters will probably be common ones like 'TH', 'RE', etc., and it is typically easy to spot the vowels. The more sophisticated 'enigma' codes used a machine which continually changed the rules throughout every message. Without an identical machine to mimic the secret rule changes, the message was apparently unreadable. Truly unbreakable codes simply rely on long code books, giving random substitutions, which are used once and discarded letter by letter. Without a copy of the code book, there is no way to read the message. That, however, is just the problem! Code books can be stolen and copied! This has happened innumerable times in the history of espionage (in the early part of World War II, for example, the American diplomatic code was being read because a code book had been surreptitiously copied by a safecracker who got access to the American embassy in Rome). The 'enigma' codes could be read because the operators had to agree each day on the settings for the machine, and this message, which came early in the morning, was always intercepted and codebreakers learned to interpret it.

That is, the weak link in most sophisticated encryption systems has been not the encryption itself, which may be essentially unbreakable without other information, but the distribution of the system to its users. The physical existence of code books, which can be stolen, or the necessity of agreeing on key settings by users separated from each other -- the distribution of the code to its users gives codebreakers an entry.

Thus it was a true revolution in encryption when in 1977 a "public key" algorithm was announced. This algorithm provides codes which can be described in public, in complete detail, and yet cannot be broken, apparently. This revolution completely solves the problem of distribution: you just distribute the code publicly. Everyone knows what it is. But then how can it be a secret?? That is the amazing thing.

Public key codes work like this. You set up a code for anyone wanting to send you a secure message, and you publish it. Your code consists of two large numbers, n and e, each approximately 100 digits long, publicly available to anyone. When someone wants to use the code, she writes her message in the form of numbers (perhaps using 'A'=01, 'B'=02, 'C'=03, ..., 'Z'=26), and breaks it into sequences of about 100 digits each, so that it looks like a sequence of very large numbers N1, N2, N3, ...She then raises N1 to the e power, mod n, and sends you the result. (This is the reason for the symbol e: it is the encryption exponent.) Similarly she sends N2 raised to the e power mod n, N3^e (mod n), etc. That is all. Since doing arithmetic mod n can be done very rapidly, even for large n, the code works very fast, and it can be decoded just as fast by someone who knows the secret behind n and e. But what is the secret?

Here is what is going on behind the scenes: The number n is chosen to be the product of two large primes p and q, which will be each about 50 digits long. To find them, you can just test random 50-digit numbers with Fermat's Little Theorem until you find probable primes. As we have seen, this is a fast test. Now the encoded message comes in: let us say it is M1, M2, M3, ... (the result of N1^e (mod n), etc.) To decode it you use a decoding exponent d. That is, you compute M1^d (mod n), M2^d (mod n), etc. The results will be the original message N1, N2, N3, ...! How is d chosen? Well, it must be that (N1^e)^d = N1^(ed)=N1 (mod n). But by Fermat's Little Theorem, we know N1^r=1 (mod n) where r=phi(n). So we just require ed=jr+1=1 (mod r), i.e., the (secret!) decoding exponent d is the inverse of e in mod r arithmetic, where, just to repeat, r=phi(n). You can find inverses mod r by the Euclidean algorithm, as we noted once, and that algorithm is very fast, so it easy to find d. Finally: why can't everyone find d? The answer is that they don't know r=phi(n). But YOU know it: since n=pq, with p and q prime, r=phi(n)=phi(p)*phi(q)=(p-1)(q-1). This is another large number, roughly 100 digits, but without knowing p and q no one can find it. To find p and q from n (which is publicly known, remember), they would have to factor n, and this is the point: no one knows a fast way to factor a large number into two large prime factors!

Here is an example, using much smaller numbers. Take p and q to be 71 and 89. Then n=6319. We can choose e=5143 (quite arbitrarily). Note that phi(6319)=70*88=6160. The inverse of e is d=4567 in Z6160*. So now we are all set. We publish the numbers n=6319 and e=5143. Someone wants to send us the message "I LOVE YOU," but in code, of course. I'll represent the letters by A=01, B=02, etc., and the space by the number 40. Then the message, broken up into 4-digit chunks, is 0940 1215 2205 4025 1521. Now encode it: (0940)^5143=3493 (mod 6319). (1215)^5143=1643 (mod 6319), etc. The complete message is 3493 1643 1673 2332 5646. To decode it, use the decoding exponent: 3493^4567=940 (mod 6319). We have to furnish the leading zero here: 940=0940 for the purpose of reading letters. The message starts 'I '. (All calculations were done with ModPCalc, available in the Math114 folder on the computers in 402 Clapp).

If someone wanted to break this code, they could do it by factoring n. Once they find p and q, they could find r=(p-1)(q-1) and then find the inverse of the encryption exponent 5143 in Zr*. This is the decryption exponent d, which would then be no longer secret. It is not hard to factor 6319 to find 71*89, and then do the rest of what we did in the paragraph above. But imagine if n were 100 digits long! A search through the possible 50 digit numbers for factors would take how long? Suppose you could test factors at the clock rate of typical computers, roughly 100 MHz: you would be testing 10^8 numbers per second. But you have 10^50 numbers to check. That would take 10^42 seconds. Unfortunately a year is only about 10^7 seconds long. So it would take about 10^35 years, i.e., 10000000000000000000000000000000000 years. Not practical!

Assignment:

Write your full name, in exactly the way you usually identify yourself. Count the number of letters and reduce mod 5 to get either 0, 1, 2, 3, or 4. (For example, if your name has 10 letters, you get 0 when you reduce mod 5.) Then decode the corresponding message below. (If your name has 10 letters, you would decode message 0.) NOTE THAT YOU ONLY NEED TO DECODE ONE MESSAGE, NOT ALL OF THEM!

These messages were intercepted on their way to users of public key codes. The intended recipients had published two numbers n and e, the modulus and the encryption exponent. Unfortunately, they chose ridiculously small numbers! Their codes can be broken if you can factor n, which is easy to do for such small numbers. There is even a small program, called 'primfact,' in the Math114 folder on the computers in Clapp 402, which will do it for you. You can do the modular arithmetic using ModPCalc, also in that folder. The letters in the messages were turned into numbers using A=01, B=02, ..., Z=26. (Note the leading zeros!) The space character was encoded as 40.

(0) n=9563, e=281

0448 7912 1240 4242 6472 6136 6404 1153 1588 0707 4711 9254 3481 6399 0583 2903 2237 7172 8408

(1) n=9379, e=1053

6368 2310 3334 9127 4872 4325 7995 9127 8983 9342 6876 6973 1059 4827 6111 5101 4301 4586 7867 8839 5895 6876

(2) n=9991, e=3173

3521 1019 9758 9950 0062 5390 8896 5308 3521 0158 7654 7721 3949 0379 9084 5471 4443 8896 6891 7721

(3) n=9853, e=2147

8471 6281 7792 3353 9436 6147 3642 8815 9256 2477 0691 8628 5365 6281 9797 7087

(4) n=9523, e=3241

0910 3730 7053 3730 3005 6869 8144 0770 6389 3649 5067 4937 3571 6389 1966 9123 8155 5804 7379 5067 2445 4095