On Thu, Jun 7, 2012 at 5:14 PM, Pádraig Brady
<address@hidden> wrote:
On 03/03/2010 01:06 AM, Pádraig Brady wrote:
> On 02/03/10 18:20, Chen Guo wrote:
>> This is exactly what that guy Shaun Jackman was talking about earlier.
>> I'm actually really surprised this is faster, if I can dig up his e-mail I'll
>> forward him this, I remember him saying something about experimenting
>> with exactly this.
>
> I missed that thread but yes he pretty much had the
> same idea as I stumbled on when trying to perturb
> the posix_fadvise() testing by changing the buffer size.
> http://lists.gnu.org/archive/html/bug-coreutils/2010-02/msg00151.html
> Spooky :)
>
> Shaun, you can use `taskset` to set process affinity BTW.
>
>> Can you profile the difference in the number of I/O system calls?
>
> $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort sort.t/sort.1.test > /dev/null
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 70.70 0.283077 21775 13 read
> 28.97 0.115983 19331 6 munmap
> 0.32 0.001268 0 21609 write
> 0.01 0.000054 8 7 open
> 0.00 0.000000 0 9 close
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 3 brk
> 0.00 0.000000 0 1 1 ioctl
> 0.00 0.000000 0 1 uname
> 0.00 0.000000 0 5 mprotect
> 0.00 0.000000 0 25 rt_sigaction
> 0.00 0.000000 0 1 rt_sigprocmask
> 0.00 0.000000 0 4 getrlimit
> 0.00 0.000000 0 16 mmap2
> 0.00 0.000000 0 9 fstat64
> 0.00 0.000000 0 2 1 futex
> 0.00 0.000000 0 1 set_thread_area
> 0.00 0.000000 0 1 set_tid_address
> 0.00 0.000000 0 1 fadvise64_64
> 0.00 0.000000 0 1 set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.400382 21717 3 total
> Command being timed: "strace -c ./src/sort sort.t/sort.1.test"
> User time (seconds): 26.91
> System time (seconds): 2.01
> Percent of CPU this job got: 90%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:32.02
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 0
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 3
> Minor (reclaiming a frame) page faults: 181060
> Voluntary context switches: 87362
> Involuntary context switches: 2526
> Swaps: 0
> File system inputs: 173504
> File system outputs: 0
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> $ TMPDIR=/ram LANG=C /usr/bin/time -v strace -c ./src/sort -S1M sort.t/sort.1.test > /dev/null
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 38.95 0.035011 1 60991 read
> 33.47 0.030081 90 334 unlink
> 24.17 0.021721 0 81864 write
> 2.07 0.001862 2 1006 munmap
> 0.75 0.000670 1 673 open
> 0.23 0.000209 0 1016 mmap2
> 0.19 0.000167 0 675 fstat64
> 0.09 0.000085 0 675 close
> 0.07 0.000062 0 334 fcntl64
> 0.02 0.000018 0 1337 rt_sigprocmask
> 0.00 0.000000 0 1 execve
> 0.00 0.000000 0 1 1 access
> 0.00 0.000000 0 3 brk
> 0.00 0.000000 0 1 1 ioctl
> 0.00 0.000000 0 1 gettimeofday
> 0.00 0.000000 0 1 uname
> 0.00 0.000000 0 5 mprotect
> 0.00 0.000000 0 334 _llseek
> 0.00 0.000000 0 25 rt_sigaction
> 0.00 0.000000 0 1 getrlimit
> 0.00 0.000000 0 2 1 futex
> 0.00 0.000000 0 1 set_thread_area
> 0.00 0.000000 0 1 set_tid_address
> 0.00 0.000000 0 335 fadvise64_64
> 0.00 0.000000 0 1 set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.089886 149618 3 total
> Command being timed: "strace -c ./src/sort -S1M sort.t/sort.1.test"
> User time (seconds): 21.76
> System time (seconds): 4.51
> Percent of CPU this job got: 98%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:26.79
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 0
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 3
> Minor (reclaiming a frame) page faults: 23038
> Voluntary context switches: 598317
> Involuntary context switches: 2316
> Swaps: 0
> File system inputs: 173504
> File system outputs: 0
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
Hmm I think the results above are invalid
due to MALLOC_PERTURB_ biasing against large buffers.
In the test below the extra external file process dominates
any gains from CPU cache usage:
$ # Test setup
$ cache_size() { lscpu | grep cache | tail -n1 | tr -s ' ' | cut -d' ' -f3; }
$ export TMPDIR=/dev/shm # RAM
$ export OMP_NUM_THREADS=1 # limit to a single CPU to simplify
$ unset MALLOC_PERTURB_ # negatively impacts performance tests
$ shuf -i 1-10000000 > sort.test # 78MB
$ LANG=C /usr/bin/time -v strace -c sort -S$(cache_size) sort.test > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
66.33 0.034878 185 189 unlink
17.89 0.009408 0 45766 read
8.73 0.004590 0 57711 write
4.53 0.002381 4 571 munmap
1.28 0.000671 2 382 open
0.46 0.000242 1 384 fstat
0.35 0.000183 0 580 mmap
0.19 0.000100 0 384 close
0.14 0.000074 0 757 rt_sigprocmask
0.11 0.000058 0 189 fcntl
0.00 0.000000 0 189 lseek
0.00 0.000000 0 5 mprotect
0.00 0.000000 0 4 brk
0.00 0.000000 0 25 rt_sigaction
0.00 0.000000 0 1 1 ioctl
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 190 fadvise64
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.052585 107333 2 total
Command being timed: "strace -c sort --parallel=1 -T/dev/shm -S3072K sort.test"
User time (seconds): 25.88
System time (seconds): 5.81
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.59
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 13900
Voluntary context switches: 429178
Involuntary context switches: 4882
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ LANG=C /usr/bin/time -v strace -c sort -S200M sort.test > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
80.68 0.348335 565 617 read
11.58 0.049994 3846 13 munmap
5.33 0.022995 7665 3 unlink
2.38 0.010281 0 38521 write
0.01 0.000052 52 1 1 access
0.01 0.000042 8 5 mprotect
0.01 0.000032 1 25 rt_sigaction
0.00 0.000000 0 10 open
0.00 0.000000 0 12 close
0.00 0.000000 0 12 fstat
0.00 0.000000 0 3 lseek
0.00 0.000000 0 22 mmap
0.00 0.000000 0 4 brk
0.00 0.000000 0 13 rt_sigprocmask
0.00 0.000000 0 1 1 ioctl
0.00 0.000000 0 1 execve
0.00 0.000000 0 3 fcntl
0.00 0.000000 0 1 getrlimit
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 4 fadvise64
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.431731 39274 2 total
Command being timed: "strace -c sort -S200M sort.test"
User time (seconds): 19.82
System time (seconds): 4.33
Percent of CPU this job got: 94%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:25.57
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 205532
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 355
Minor (reclaiming a frame) page faults: 98057
Voluntary context switches: 159217
Involuntary context switches: 4681
Swaps: 0
File system inputs: 140344
File system outputs: 154168
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I notice also that performance gets worse for the above input
for the small buffer size when I enable multithreading on my 4 core i3.
User time (seconds): 35.40
System time (seconds): 9.20
Percent of CPU this job got: 120%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:36.98
cheers,
Pádraig.