aes-586.pl
上传用户:yisoukefu
上传日期:2020-08-09
资源大小:39506k
文件大小:49k
- #!/usr/bin/env perl
- #
- # ====================================================================
- # Written by Andy Polyakov <appro@fy.chalmers.se> for the OpenSSL
- # project. Rights for redistribution and usage in source and binary
- # forms are granted according to the OpenSSL license.
- # ====================================================================
- #
- # Version 3.4.
- #
- # You might fail to appreciate this module performance from the first
- # try. If compared to "vanilla" linux-ia32-icc target, i.e. considered
- # to be *the* best Intel C compiler without -KPIC, performance appears
- # to be virtually identical... But try to re-configure with shared
- # library support... Aha! Intel compiler "suddenly" lags behind by 30%
- # [on P4, more on others]:-) And if compared to position-independent
- # code generated by GNU C, this code performs *more* than *twice* as
- # fast! Yes, all this buzz about PIC means that unlike other hand-
- # coded implementations, this one was explicitly designed to be safe
- # to use even in shared library context... This also means that this
- # code isn't necessarily absolutely fastest "ever," because in order
- # to achieve position independence an extra register has to be
- # off-loaded to stack, which affects the benchmark result.
- #
- # Special note about instruction choice. Do you recall RC4_INT code
- # performing poorly on P4? It might be the time to figure out why.
- # RC4_INT code implies effective address calculations in base+offset*4
- # form. Trouble is that it seems that offset scaling turned to be
- # critical path... At least eliminating scaling resulted in 2.8x RC4
- # performance improvement [as you might recall]. As AES code is hungry
- # for scaling too, I [try to] avoid the latter by favoring off-by-2
- # shifts and masking the result with 0xFF<<2 instead of "boring" 0xFF.
- #
- # As was shown by Dean Gaudet <dean@arctic.org>, the above note turned
- # void. Performance improvement with off-by-2 shifts was observed on
- # intermediate implementation, which was spilling yet another register
- # to stack... Final offset*4 code below runs just a tad faster on P4,
- # but exhibits up to 10% improvement on other cores.
- #
- # Second version is "monolithic" replacement for aes_core.c, which in
- # addition to AES_[de|en]crypt implements AES_set_[de|en]cryption_key.
- # This made it possible to implement little-endian variant of the
- # algorithm without modifying the base C code. Motivating factor for
- # the undertaken effort was that it appeared that in tight IA-32
- # register window little-endian flavor could achieve slightly higher
- # Instruction Level Parallelism, and it indeed resulted in up to 15%
- # better performance on most recent