Speed of Instructions: Nasm, Java, C++ and Python comparison

From dftwiki3
Jump to: navigation, search

--D. Thiebaut (talk) 07:02, 21 September 2014 (EDT)



The purpose of this page is to

  1. get an appreciation for the speed of a processor when it is running simple instructions
  2. see how Java, C++, and Nasm compare for simple arithmetic code.
  3. see how optimized C++ generate code, the speed of which can be quite hard to beat!
  4. see how incorrect results can be generated without noticing.
  5. see that Python may be slow, but it knows a thing or two about integers.




Hardware


The computer on which the programs below are run (beowulf2) has the following characteristics (taken from Intel's Web site):

Processor Number:	E5-2650
# of Cores:	8
# of Threads:	16
Clock Speed:	2 GHz
Max Turbo Frequency:	2.8 GHz
Intel® Smart Cache:	20 MB
Intel® QPI Speed:	8 GT/s
# of QPI Links:	2
Instruction Set:	64-bit
Instruction Set Extensions:	AVX
Embedded Options Available: 	No
Lithography:	32 nm
Scalability:	2S Only
Max TDP:	95 W
VID Voltage Range:	0.60V-1.35V
Recommended Customer Price:	BOX : $1112.00
TRAY: $1107.00


Note that information about your Linux machine can be obtained directly from the command line:

 less /proc/cpuinfo

which, on beowulf2 returns:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping	: 7
microcode	: 0x70b
cpu MHz		: 2000.000
cache size	: 20480 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 13 
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt aes xsave avx hypervisor lahf_lm ida arat xsaveopt pln pts dtherm
bogomips	: 4000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

Pseudo Code


 int a = 0;
 int b = 2;
 int c = 3;
 long  noIterations = 100000000l;  // the last character is 'ell', not 1.  It indicates a long constant.
 
 for ( int i=0; i<noIterations; i++ )
      a += b*2 + c - i;


Java Code


Source


/**
 * SpeedTest.java
 * @author D. Thiebaut
 */
class SpeedTest {

    public static void main( String[] args ) {
        int a = 0;
        int b = 2;
        int c = 3;
        long noIterations = 100000000l;
        long start = System.currentTimeMillis(), end;

        for ( int i=0; i<noIterations; i++ )
            a += b*2 + c - i;
        
        end = System.currentTimeMillis();
        System.out.println( "a = " + a );
        System.out.println( noIterations + " iterations, "
                            + (end-start) + " milliseconds ==> " 
                            + (end-start)*1000000/noIterations + " ns/iteration" );
    }
}


Timed Execution


Compiling and running the program, we get:

javac SpeedTest.java
time java SpeedTest
a = -187459712
100000000 iterations, 104 milliseconds ==> 1 ns/iteration

real	0m0.188s
user	0m0.172s
sys	0m0.012s

It takes almost 0.2 seconds to load the Java Virtual Machine, and run our program. The program times itself at 104 ms to perform the loop (it does not measure the time taking by the print statements). This accounts for approximately 1 ns (billionth of a second) per iteration.


C++


Source Code


// SpeedTest.c++
// D. Thiebaut
#include <iostream>
#include <time.h>
using namespace std;

int main( void ) {
    int a = 0;
    int b = 2;
    int c = 3;
    long noIterations = 100000000l;
    clock_t start = clock(), end;

    for ( int i=0; i<noIterations; i++ )
        a += b*2 + c - i;

    end = clock();

    cout << "a = " << a << endl;

    cout <<  noIterations <<  " iterations, "
         << ((float)(end-start))/CLOCKS_PER_SEC*1000
         <<  " milliseconds "
         <<  (float)(end-start)/CLOCKS_PER_SEC*1000000000/noIterations 
         << " ns/iteration" << endl;
   
}


Timed Execution


g++ SpeedTest.c++ 
time ./a.out
a = -187459712
100000000 iterations, 370 milliseconds 3.7 ns/iteration

real	0m0.375s
user	0m0.372s
sys	0m0.000s


It takes almost 0.4 seconds for the C++ to run, and approximately 3.7 ns per iteration.

Could it be that C++ is slower than Java?

One thing we didn't do is to optimize our C++ program. Let's do that now:

g++ -O2 SpeedTest.c++
time ./a.out
a = -187459712
100000000 iterations, 70 milliseconds 0.7 ns/iteration

real	0m0.083s
user	0m0.076s
sys	0m0.004s

Much better! The code takes now 0.7 ns per itareration, about 30% faster than the Java code.


The G++ Compiler can generate the Assembly Version of the Program


If we add a -S switch on the command line for g++, we get the assembly version of the C++ program:

 g++ -S SpeedTest.c++ 

which results in the SpeedTest.s appearing in our directory. Its contents is shown below. I have highlighted the part of the code that does the arithmetic, and flagged the two calls to the system clock() function:

	.file	"SpeedTest.c++"
	.local	_ZStL8__ioinit
	.comm	_ZStL8__ioinit,1,1
	.section	.rodata
.LC0:
	.string	"a = "
.LC4:
	.string	" iterations, "
.LC5:
	.string	" milliseconds "
.LC6:
	.string	" ns/iteration"
	.text
	.globl	main
	.type	main, @function
main:
.LFB971:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$64, %rsp
	movl	$0, -40(%rbp)
	movl	$2, -32(%rbp)
	movl	$3, -28(%rbp)
	movq	$100000000, -24(%rbp)
	call	clock                                ;;<---  calling the system clock function!
	movq	%rax, -16(%rbp)
	movl	$0, -36(%rbp)
	jmp	.L2
.L3:
	movl	-32(%rbp), %eax
	leal	(%rax,%rax), %edx
	movl	-28(%rbp), %eax
	addl	%edx, %eax
	subl	-36(%rbp), %eax
	addl	%eax, -40(%rbp)
	addl	$1, -36(%rbp)
.L2:
	movl	-36(%rbp), %eax
	cltq
	cmpq	-24(%rbp), %rax
	setl	%al
	testb	%al, %al
	jne	.L3
	call	clock                                ;;<---  calling the system clock function again!
	movq	%rax, -8(%rbp)
	movl	$.LC0, %esi
	movl	$_ZSt4cout, %edi
	call	_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
	movl	-40(%rbp), %edx
	movl	%edx, %esi
	movq	%rax, %rdi
	call	_ZNSolsEi
	movl	$_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
	movq	%rax, %rdi
	call	_ZNSolsEPFRSoS_E
	movq	-16(%rbp), %rax
	movq	-8(%rbp), %rdx
	movq	%rdx, %rcx
	subq	%rax, %rcx
	movq	%rcx, %rax
	cvtsi2ssq	%rax, %xmm0
	movss	.LC1(%rip), %xmm1
	divss	%xmm1, %xmm0
	movss	.LC2(%rip), %xmm1
	mulss	%xmm1, %xmm0
	cvtsi2ssq	-24(%rbp), %xmm1
	movaps	%xmm0, %xmm2
	divss	%xmm1, %xmm2
	movss	%xmm2, -52(%rbp)
	movq	-16(%rbp), %rax
	movq	-8(%rbp), %rdx
	movq	%rdx, %rcx
	subq	%rax, %rcx
	movq	%rcx, %rax
	cvtsi2ssq	%rax, %xmm0
	movss	.LC1(%rip), %xmm1
	divss	%xmm1, %xmm0
	movss	.LC3(%rip), %xmm1
	movaps	%xmm0, %xmm2
	mulss	%xmm1, %xmm2
	movss	%xmm2, -56(%rbp)
	movq	-24(%rbp), %rax
	movq	%rax, %rsi
	movl	$_ZSt4cout, %edi
	call	_ZNSolsEl
	movl	$.LC4, %esi
	movq	%rax, %rdi
	call	_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
	movss	-56(%rbp), %xmm0
	movq	%rax, %rdi
	call	_ZNSolsEf
	movl	$.LC5, %esi
	movq	%rax, %rdi
	call	_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
	movss	-52(%rbp), %xmm0
	movq	%rax, %rdi
	call	_ZNSolsEf
	movl	$.LC6, %esi
	movq	%rax, %rdi
	call	_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
	movl	$_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
	movq	%rax, %rdi
	call	_ZNSolsEPFRSoS_E
	movl	$0, %eax
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE971:
	.size	main, .-main
	.type	_Z41__static_initialization_and_destruction_0ii, @function
_Z41__static_initialization_and_destruction_0ii:
.LFB982:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$16, %rsp
	movl	%edi, -4(%rbp)
	movl	%esi, -8(%rbp)
	cmpl	$1, -4(%rbp)
	jne	.L5
	cmpl	$65535, -8(%rbp)
	jne	.L5
	movl	$_ZStL8__ioinit, %edi
	call	_ZNSt8ios_base4InitC1Ev
	movl	$__dso_handle, %edx
	movl	$_ZStL8__ioinit, %esi
	movl	$_ZNSt8ios_base4InitD1Ev, %edi
	call	__cxa_atexit
.L5:
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE982:
	.size	_Z41__static_initialization_and_destruction_0ii, .-_Z41__static_initialization_and_destruction_0ii
	.type	_GLOBAL__sub_I_main, @function
_GLOBAL__sub_I_main:
.LFB983:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	$65535, %esi
	movl	$1, %edi
	call	_Z41__static_initialization_and_destruction_0ii
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE983:
	.size	_GLOBAL__sub_I_main, .-_GLOBAL__sub_I_main
	.section	.init_array,"aw"
	.align 8
	.quad	_GLOBAL__sub_I_main
	.section	.rodata
	.align 4
.LC1:
	.long	1232348160
	.align 4
.LC2:
	.long	1315859240
	.align 4
.LC3:
	.long	1148846080
	.hidden	__dso_handle
	.ident	"GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3"
	.section	.note.GNU-stack,"",@progbits


Nasm (Non-Optimized Code)


We translate the arithmetic and loop operations in 32-bit assembly for Intel processors.

Source


;;; SpeedTest1.asm
;;; D. Thiebaut
;;;
;;; Computes 
;;;
;;;   int a=0, b=3, c=3
;;;   for ( int i=0; i<noIterations; i++ )
;;;            a += b*2 + c - i;
;;;
;;; Requires the 231Lib.asm library to compile.  
;;; To assemble, link, and run:
;;;
;;; nasm -f elf SpeedTest1.asm
;;; nasm -f elf 231Lib.asm 
;;; ld -melf_i386 -o SpeedTest1 SpeedTest1.o 231Lib.o
;;; time ./SpeedTest1
;;;
extern  _printDec
extern  _println
        
        section .data
        
a       dd      0
b       dd      2
c       dd      3
iter    dd      100000000
i       dd      0

        section .text
        global  _start
_start: 

        mov     dword[i], 0

for:    mov     eax, dword[b]
        add     eax, eax
        add     eax, dword[c]
        sub     eax, dword[i]
        add     dword[a], eax

        inc     dword[i]
        mov     eax, dword[iter]
        cmp     dword[i], eax
        jl      for
endfor:
        mov     eax, dword[a]
        call    _printDec
        call    _println

;;; exit
        mov     eax, 1
        mov     ebx, 0
        int     0x80


Timed Execution


time ./SpeedTest1
4107507584
real	0m0.338s
user	0m0.336s
sys	0m0.000s

First, the number output is not the same as the number output by the C++ and Java program. This is because the _printDec function prints the number as an unsigned number. The C++ and Java programs output -187459712. If we add 232 to -187459712 we get 4107507584. So the number is correct.

Second, we see that the execution takes 0.3 seconds. About the same as the non-optimized C++ program, but slower than the optimized version.

Can we do better? The next section shows a better and speedier version of the assembly program.


Nasm, Optimized


Source


;;; SpeedTest2.asm
;;; D. Thiebaut
;;;
;;; Computes an optimized version of this code:
;;;
;;;   int a=0, b=3, c=3
;;;   for ( int i=0; i<noIterations; i++ )
;;;            a += b*2 + c - i;
;;;
;;; Requires the 231Lib.asm library to compile.  
;;; To assemble, link, and run:
;;;
;;; nasm -f elf SpeedTest2.asm
;;; nasm -f elf 231Lib.asm 
;;; ld -melf_i386 -o SpeedTest2 SpeedTest2.o 231Lib.o
;;; time ./SpeedTest2
;;;
extern  _printDec
extern  _println

        
        section .data
a       dd      0
b       dd      2
c       dd      3
iter    dd      100000000
i       dd      0

        section .text
        global  _start
_start: 

        mov     edx, 0          ; use edx as i
        mov     ebx, dword[b]   ; use ebx as 2*b + c
        add     ebx, ebx
        add     ebx, dword[c]
        mov     eax, 0          ; use eax as a
        mov     ecx, dword[iter]; ecx is loop counter
        
for:    add     eax, ebx        ; a += b*2 + c
        sub     eax, edx        ; a += -i
        inc     edx             ; i++
        loop    for
        
endfor: 
        mov     dword[a], eax
        call    _printDec
        call    _println
        
;;; exit
        mov     eax, 1
        mov     ebx, 0
        int     0x80


Timed Execution


time ./SpeedTest2
4107507584

real	0m0.294s
user	0m0.292s
sys	0m0.000s

Better! But not as good as the optimized C++ code. It is possible that the C++ optimizer unrolled the loop a few times and shaved additional cycles by simplifying the computation.

Moral of the story: It is not easy to do better than optimized C++ code!



Nasm: An Additional Optimization


Prof. Bill Payne, of Columbus State Community College, suggests in an email an additional level of improvement by replacing the loop instruction with

           dec   ecx
           jnz   for


Payne reports an improvement of the execution time, making the assembly program slightly faster than the C++ example.

Python 3


Source


# SpeedTest.py
# D. Thiebaut
# implements a version of this code:
#
#    int a = 0;                                                                          
#    int b = 2;                                                                          
#    int c = 3;                                                                          
#    long noIterations = 100000000l;                                                     
#    clock_t start = clock(), end;                                                       
#                                                                                        
#    for ( int i=0; i<noIterations; i++ )                                                
#        a += b*2 + c - i;                                                               

import datetime as dt
a = 0
b = 2
c = 3
noIterations = 100000000
start = dt.datetime.now()

for i in range( noIterations ):
    a += b * 2 + c - i
end = dt.datetime.now()

print( a )

microsecs =  (end-start).seconds * 1000000 + (end-start).microseconds
print()
print(  noIterations, " iterations, ",
         microsecs/1000, " milliseconds ",
         microsecs * 1000/noIterations,
         " ns/iteration", sep="" )


Timed Execution


-4999999250000000

100000000 iterations, 24851.039 milliseconds 248.51039 ns/iteration

real	0m24.886s
user	0m24.852s
sys	0m0.000s


Two observations:

  1. Optimized C++ is about 300 times faster than Python!
  2. The number output by Python is different from the one output by Java and C++.


What we discover is that the number output by Python is actually the correct one. C++ and Java output the wrong number because we had not chosen a large enough format to store our integers. We should have picked long instead.
Other moral of the story: Python may be slow, but its slowness may hide a lot of testing that prevents us from making blatant mistakes!

Final Comments


C++: ints vs longs


  • Using the right size variable is important, not only for correctness of operation, but also for speed.
  • If we switch the ints of the C++ program to longs, and we run the program again, we get:


g++ -O2 SpeedTest2.c++
time ./a.out
a = -4999999250000000
100000000 iterations, 0 milliseconds 0 ns/iteration

real	0m0.003s
user	0m0.000s
sys	0m0.000s


  • Note that the value of a output has changed. It is the same as the Python program.
  • The execution time is now much faster because we are operating with 64-bit longs, which is the natural word length of the system we are using (Linux Pentium Xeon(R) CPU E5-2650 2GHz, 8 cores).


Java: ints vs Longs


  • We observe the same difference when we switch all the ints for longs in the java program:


javac SpeedTest.java 
time java SpeedTest 
a = -4999999250000000
100000000 iterations, 89 milliseconds ==> 0 ns/iteration

real	0m0.172s
user	0m0.156s
sys	0m0.012s


  • The time decrease is noticeable but not as much as for the C++ program. Since Java always treats an int as a 32-bit value (platform independence), and a long as a 64-bit value, we can't assume that Java was already using 64-bit values for int. So the difference in speed is due in part to the 32- to 64-bit switch, but a large part of the code (virtual machine) is unaffected by the code, and hence the improvement is minimum for Java.



Summary


Python    C++/Java

GoogleCar.jpg
Google self-driving car: Slow, but does a lot of work by itself.

  

MaseratiCar.jpg
Maserati: fast, but you have to be tightly controlling what it does!


Additional Information


Visit this page for additional information about the speed of a Java program solving the N-Queens problem on various hardware.