Speed of Instructions: Nasm, Java, C++ and Python comparison
--D. Thiebaut (talk) 07:02, 21 September 2014 (EDT)
The purpose of this page is to
- get an appreciation for the speed of a processor when it is running simple instructions
- see how Java, C++, and Nasm compare for simple arithmetic code.
- see how optimized C++ generate code, the speed of which can be quite hard to beat!
- see how incorrect results can be generated without noticing.
- see that Python may be slow, but it knows a thing or two about integers.
Contents
Hardware
The computer on which the programs below are run (beowulf2) has the following characteristics (taken from Intel's Web site):
Processor Number: E5-2650 # of Cores: 8 # of Threads: 16 Clock Speed: 2 GHz Max Turbo Frequency: 2.8 GHz Intel® Smart Cache: 20 MB Intel® QPI Speed: 8 GT/s # of QPI Links: 2 Instruction Set: 64-bit Instruction Set Extensions: AVX Embedded Options Available: No Lithography: 32 nm Scalability: 2S Only Max TDP: 95 W VID Voltage Range: 0.60V-1.35V Recommended Customer Price: BOX : $1112.00 TRAY: $1107.00
Note that information about your Linux machine can be obtained directly from the command line:
less /proc/cpuinfo
which, on beowulf2 returns:
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz stepping : 7 microcode : 0x70b cpu MHz : 2000.000 cache size : 20480 KB fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx hypervisor lahf_lm ida arat xsaveopt pln pts dtherm bogomips : 4000.00 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:
Pseudo Code
int a = 0; int b = 2; int c = 3; long noIterations = 100000000l; // the last character is 'ell', not 1. It indicates a long constant. for ( int i=0; i<noIterations; i++ ) a += b*2 + c - i;
Java Code
Source
/**
* SpeedTest.java
* @author D. Thiebaut
*/
class SpeedTest {
public static void main( String[] args ) {
int a = 0;
int b = 2;
int c = 3;
long noIterations = 100000000l;
long start = System.currentTimeMillis(), end;
for ( int i=0; i<noIterations; i++ )
a += b*2 + c - i;
end = System.currentTimeMillis();
System.out.println( "a = " + a );
System.out.println( noIterations + " iterations, "
+ (end-start) + " milliseconds ==> "
+ (end-start)*1000000/noIterations + " ns/iteration" );
}
}
Timed Execution
Compiling and running the program, we get:
javac SpeedTest.java time java SpeedTest a = -187459712 100000000 iterations, 104 milliseconds ==> 1 ns/iteration real 0m0.188s user 0m0.172s sys 0m0.012s
It takes almost 0.2 seconds to load the Java Virtual Machine, and run our program. The program times itself at 104 ms to perform the loop (it does not measure the time taking by the print statements). This accounts for approximately 1 ns (billionth of a second) per iteration.
C++
Source Code
// SpeedTest.c++
// D. Thiebaut
#include <iostream>
#include <time.h>
using namespace std;
int main( void ) {
int a = 0;
int b = 2;
int c = 3;
long noIterations = 100000000l;
clock_t start = clock(), end;
for ( int i=0; i<noIterations; i++ )
a += b*2 + c - i;
end = clock();
cout << "a = " << a << endl;
cout << noIterations << " iterations, "
<< ((float)(end-start))/CLOCKS_PER_SEC*1000
<< " milliseconds "
<< (float)(end-start)/CLOCKS_PER_SEC*1000000000/noIterations
<< " ns/iteration" << endl;
}
Timed Execution
g++ SpeedTest.c++ time ./a.out a = -187459712 100000000 iterations, 370 milliseconds 3.7 ns/iteration real 0m0.375s user 0m0.372s sys 0m0.000s
It takes almost 0.4 seconds for the C++ to run, and approximately 3.7 ns per iteration.
Could it be that C++ is slower than Java?
One thing we didn't do is to optimize our C++ program. Let's do that now:
g++ -O2 SpeedTest.c++ time ./a.out a = -187459712 100000000 iterations, 70 milliseconds 0.7 ns/iteration real 0m0.083s user 0m0.076s sys 0m0.004s
Much better! The code takes now 0.7 ns per itareration, about 30% faster than the Java code.
The G++ Compiler can generate the Assembly Version of the Program
If we add a -S switch on the command line for g++, we get the assembly version of the C++ program:
g++ -S SpeedTest.c++
which results in the SpeedTest.s appearing in our directory. Its contents is shown below. I have highlighted the part of the code that does the arithmetic, and flagged the two calls to the system clock() function:
.file "SpeedTest.c++"
.local _ZStL8__ioinit
.comm _ZStL8__ioinit,1,1
.section .rodata
.LC0:
.string "a = "
.LC4:
.string " iterations, "
.LC5:
.string " milliseconds "
.LC6:
.string " ns/iteration"
.text
.globl main
.type main, @function
main:
.LFB971:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $64, %rsp
movl $0, -40(%rbp)
movl $2, -32(%rbp)
movl $3, -28(%rbp)
movq $100000000, -24(%rbp)
call clock ;;<--- calling the system clock function!
movq %rax, -16(%rbp)
movl $0, -36(%rbp)
jmp .L2
.L3:
movl -32(%rbp), %eax
leal (%rax,%rax), %edx
movl -28(%rbp), %eax
addl %edx, %eax
subl -36(%rbp), %eax
addl %eax, -40(%rbp)
addl $1, -36(%rbp)
.L2:
movl -36(%rbp), %eax
cltq
cmpq -24(%rbp), %rax
setl %al
testb %al, %al
jne .L3
call clock ;;<--- calling the system clock function again!
movq %rax, -8(%rbp)
movl $.LC0, %esi
movl $_ZSt4cout, %edi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl -40(%rbp), %edx
movl %edx, %esi
movq %rax, %rdi
call _ZNSolsEi
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E
movq -16(%rbp), %rax
movq -8(%rbp), %rdx
movq %rdx, %rcx
subq %rax, %rcx
movq %rcx, %rax
cvtsi2ssq %rax, %xmm0
movss .LC1(%rip), %xmm1
divss %xmm1, %xmm0
movss .LC2(%rip), %xmm1
mulss %xmm1, %xmm0
cvtsi2ssq -24(%rbp), %xmm1
movaps %xmm0, %xmm2
divss %xmm1, %xmm2
movss %xmm2, -52(%rbp)
movq -16(%rbp), %rax
movq -8(%rbp), %rdx
movq %rdx, %rcx
subq %rax, %rcx
movq %rcx, %rax
cvtsi2ssq %rax, %xmm0
movss .LC1(%rip), %xmm1
divss %xmm1, %xmm0
movss .LC3(%rip), %xmm1
movaps %xmm0, %xmm2
mulss %xmm1, %xmm2
movss %xmm2, -56(%rbp)
movq -24(%rbp), %rax
movq %rax, %rsi
movl $_ZSt4cout, %edi
call _ZNSolsEl
movl $.LC4, %esi
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movss -56(%rbp), %xmm0
movq %rax, %rdi
call _ZNSolsEf
movl $.LC5, %esi
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movss -52(%rbp), %xmm0
movq %rax, %rdi
call _ZNSolsEf
movl $.LC6, %esi
movq %rax, %rdi
call _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE971:
.size main, .-main
.type _Z41__static_initialization_and_destruction_0ii, @function
_Z41__static_initialization_and_destruction_0ii:
.LFB982:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
cmpl $1, -4(%rbp)
jne .L5
cmpl $65535, -8(%rbp)
jne .L5
movl $_ZStL8__ioinit, %edi
call _ZNSt8ios_base4InitC1Ev
movl $__dso_handle, %edx
movl $_ZStL8__ioinit, %esi
movl $_ZNSt8ios_base4InitD1Ev, %edi
call __cxa_atexit
.L5:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE982:
.size _Z41__static_initialization_and_destruction_0ii, .-_Z41__static_initialization_and_destruction_0ii
.type _GLOBAL__sub_I_main, @function
_GLOBAL__sub_I_main:
.LFB983:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $65535, %esi
movl $1, %edi
call _Z41__static_initialization_and_destruction_0ii
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE983:
.size _GLOBAL__sub_I_main, .-_GLOBAL__sub_I_main
.section .init_array,"aw"
.align 8
.quad _GLOBAL__sub_I_main
.section .rodata
.align 4
.LC1:
.long 1232348160
.align 4
.LC2:
.long 1315859240
.align 4
.LC3:
.long 1148846080
.hidden __dso_handle
.ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3"
.section .note.GNU-stack,"",@progbits
Nasm (Non-Optimized Code)
We translate the arithmetic and loop operations in 32-bit assembly for Intel processors.
Source
;;; SpeedTest1.asm
;;; D. Thiebaut
;;;
;;; Computes
;;;
;;; int a=0, b=3, c=3
;;; for ( int i=0; i<noIterations; i++ )
;;; a += b*2 + c - i;
;;;
;;; Requires the 231Lib.asm library to compile.
;;; To assemble, link, and run:
;;;
;;; nasm -f elf SpeedTest1.asm
;;; nasm -f elf 231Lib.asm
;;; ld -melf_i386 -o SpeedTest1 SpeedTest1.o 231Lib.o
;;; time ./SpeedTest1
;;;
extern _printDec
extern _println
section .data
a dd 0
b dd 2
c dd 3
iter dd 100000000
i dd 0
section .text
global _start
_start:
mov dword[i], 0
for: mov eax, dword[b]
add eax, eax
add eax, dword[c]
sub eax, dword[i]
add dword[a], eax
inc dword[i]
mov eax, dword[iter]
cmp dword[i], eax
jl for
endfor:
mov eax, dword[a]
call _printDec
call _println
;;; exit
mov eax, 1
mov ebx, 0
int 0x80
Timed Execution
time ./SpeedTest1 4107507584 real 0m0.338s user 0m0.336s sys 0m0.000s
First, the number output is not the same as the number output by the C++ and Java program. This is because the _printDec function prints the number as an unsigned number. The C++ and Java programs output -187459712. If we add 232 to -187459712 we get 4107507584. So the number is correct.
Second, we see that the execution takes 0.3 seconds. About the same as the non-optimized C++ program, but slower than the optimized version.
Can we do better? The next section shows a better and speedier version of the assembly program.
Nasm, Optimized
Source
;;; SpeedTest2.asm
;;; D. Thiebaut
;;;
;;; Computes an optimized version of this code:
;;;
;;; int a=0, b=3, c=3
;;; for ( int i=0; i<noIterations; i++ )
;;; a += b*2 + c - i;
;;;
;;; Requires the 231Lib.asm library to compile.
;;; To assemble, link, and run:
;;;
;;; nasm -f elf SpeedTest2.asm
;;; nasm -f elf 231Lib.asm
;;; ld -melf_i386 -o SpeedTest2 SpeedTest2.o 231Lib.o
;;; time ./SpeedTest2
;;;
extern _printDec
extern _println
section .data
a dd 0
b dd 2
c dd 3
iter dd 100000000
i dd 0
section .text
global _start
_start:
mov edx, 0 ; use edx as i
mov ebx, dword[b] ; use ebx as 2*b + c
add ebx, ebx
add ebx, dword[c]
mov eax, 0 ; use eax as a
mov ecx, dword[iter]; ecx is loop counter
for: add eax, ebx ; a += b*2 + c
sub eax, edx ; a += -i
inc edx ; i++
loop for
endfor:
mov dword[a], eax
call _printDec
call _println
;;; exit
mov eax, 1
mov ebx, 0
int 0x80
Timed Execution
time ./SpeedTest2 4107507584 real 0m0.294s user 0m0.292s sys 0m0.000s
Better! But not as good as the optimized C++ code. It is possible that the C++ optimizer unrolled the loop a few times and shaved additional cycles by simplifying the computation.
Moral of the story: It is not easy to do better than optimized C++ code!
Nasm: An Additional Optimization
Prof. Bill Payne, of Columbus State Community College, suggests in an email an additional level of improvement by replacing the loop instruction with
dec ecx jnz for
Payne reports an improvement of the execution time, making the assembly program slightly faster than the C++ example.
Python 3
Source
# SpeedTest.py
# D. Thiebaut
# implements a version of this code:
#
# int a = 0;
# int b = 2;
# int c = 3;
# long noIterations = 100000000l;
# clock_t start = clock(), end;
#
# for ( int i=0; i<noIterations; i++ )
# a += b*2 + c - i;
import datetime as dt
a = 0
b = 2
c = 3
noIterations = 100000000
start = dt.datetime.now()
for i in range( noIterations ):
a += b * 2 + c - i
end = dt.datetime.now()
print( a )
microsecs = (end-start).seconds * 1000000 + (end-start).microseconds
print()
print( noIterations, " iterations, ",
microsecs/1000, " milliseconds ",
microsecs * 1000/noIterations,
" ns/iteration", sep="" )
Timed Execution
-4999999250000000 100000000 iterations, 24851.039 milliseconds 248.51039 ns/iteration real 0m24.886s user 0m24.852s sys 0m0.000s
Two observations:
- Optimized C++ is about 300 times faster than Python!
- The number output by Python is different from the one output by Java and C++.
What we discover is that the number output by Python is actually the correct one. C++ and Java output the wrong number because we had not chosen a large enough format to store our integers. We should have picked long instead.
Other moral of the story: Python may be slow, but its slowness may hide a lot of testing that prevents us from making blatant mistakes!
Final Comments
C++: ints vs longs
- Using the right size variable is important, not only for correctness of operation, but also for speed.
- If we switch the ints of the C++ program to longs, and we run the program again, we get:
g++ -O2 SpeedTest2.c++ time ./a.out a = -4999999250000000 100000000 iterations, 0 milliseconds 0 ns/iteration real 0m0.003s user 0m0.000s sys 0m0.000s
- Note that the value of a output has changed. It is the same as the Python program.
- The execution time is now much faster because we are operating with 64-bit longs, which is the natural word length of the system we are using (Linux Pentium Xeon(R) CPU E5-2650 2GHz, 8 cores).
Java: ints vs Longs
- We observe the same difference when we switch all the ints for longs in the java program:
javac SpeedTest.java time java SpeedTest a = -4999999250000000 100000000 iterations, 89 milliseconds ==> 0 ns/iteration real 0m0.172s user 0m0.156s sys 0m0.012s
- The time decrease is noticeable but not as much as for the C++ program. Since Java always treats an int as a 32-bit value (platform independence), and a long as a 64-bit value, we can't assume that Java was already using 64-bit values for int. So the difference in speed is due in part to the 32- to 64-bit switch, but a large part of the code (virtual machine) is unaffected by the code, and hence the improvement is minimum for Java.
Summary
Python | C++/Java | |
---|---|---|
|
|
|
Additional Information
Visit this page for additional information about the speed of a Java program solving the N-Queens problem on various hardware.