Back in the dark ages, I read the papers analyzing the Morris Internet Worm. In order to protect my employers systems, I wanted to do proactive password cracking.
This, however, was painfully slow. A SUN 3/50 could do some 5 password checks a second. This was because the Crypt(3) function does 25 slightly modified DES encryptions using the supplied password as the encryption key.
The number 25 had been chosen to make things slow on the machines available back then. The designers did the mistake of choosing a very slow DES implementation. They should have used an optimized one, doing 1000 iterations.
Optimized Crypt implementations had been made but they were only available in closed security research/hackers circles. So I set out to write my own one. I started with a reference DES implementation that was even slower than the one used in the SunOS crypt function. After having modified it to do UNIX password encryption, I started writing one from scratch using the reference implementation as a debugging help.
I made an initial decision of using 12 bit lookup tables – one that I’ve not seen used in other DES implementations. It cut the number of code lines/instructions in the inner DES loops by 50% at the expense of adding expensive table initialization code. An additional consequence was that not less than 128 kilobytes of lookup tables was needed. This was quite extravagant for that time . I implemented the code on a SUN 3/50 workstation which featured a 15 MHz Motorola 68020 processor and no data cache(s).
After a long time – I only worked on the project now and then – the code worked, running some 50 times faster than the vanilla SunOS4 version. I was soon considering if I should publish the code. The arguments against it were partly ethical – the code could/would be used to hack machines. However, the absence of publicly available crypt implementations was favoring the black hats as they presumably already did have access to optimized crypt implementations. Finally, I took a deep breath, attached LGPL (GNU library license) copyright notices to the code and submitted it to alt.sources. In gave the package the self confident name ‘Ultra Fast Crypt’ (UFC-crypt). The reception was positive – not a single harsh email about publishing hacking potential hacking tools. And no bugs were found – the good thing about crypto code is that a single bit error somewhere in your code makes the output become completely different from what is expected. So if it seems to work, it probably does. If not you have lengthy debugging ahead 🙁
The decision of using 12 bit tables indexes soon backfired when machines with small data caches appeared on the market. Accessing 128k tables at random through a 32k data cache is going to provoke an awful lot of cache misses. Moreover, my simple benchmark program (which encrypted the same string repeatedly) did not show the full scale of the problem. With the large L1 caches of processors nowadays, this is not a problem any more.
Later, I was contacted by Noah Friedman of the Free Software Foundation. He and Richard Stallmann would like UFC-crypt to be used as the crypt(3) function of what became the GNU C library. This would mean donating the code to the FSF. It didn’t take long to decide to accept the proposal. I managed to get a document signed by the CS Dept waiving all copyright/IP claims for the code. So this way UFC-crypt ended up in the GNU C library and eventually in – all (I suppose) Linux machines ranging from robotic cameras to 4096 processor SGI Altix systems.
As a spin-off of the UFC project, I ported the code to the UNI-C Thinking Machines CM200 super computer. The port ended up being able to test 50,000 passwords a second which was very fast at that time.
I also stumbled upon Bihams article A Fast New DES Implementation in Software which describes a totally different way of implementing DES than used by UFC-crypt and most other DES implementations. Given the info in the article, I built a library which should be able to run on any processor with at least 8 bits words. It uses next to no memory and performance scales with the CPU word length. It uses no lookup tables in the DES loops.