Re: Increase FD_SETSIZE in w32.h to allow more than 32 (actually 29) sub

On Sun, 26 Jan 2025 09:46:34 +0200, Eli Zaretskii wrote:

> > From: "Yue Yi" 
> > Cc: "emacs-devel" ,
> >  "Cecilio Pardo" 
> > Date: Sun, 26 Jan 2025 12:30:01 +0800
> > 
> > > >  /* MSVC runtime library has limit of 64 descriptors by default */
> > > > -#define FD_SETSIZE  64
> > > > +#define FD_SETSIZE (MAXIMUM_WAIT_OBJECTS * MAXIMUM_WAIT_OBJECTS / 2)
> > >
> > > What is the reason for dividing by 2?  A comment should explain that.
> > 
> > As we know, WaitForMultipleObjects supports a maximum of 64 objects. In
> > my implementation, I plan to use an array of length 64 to hold
> > objects. Some of these objects might be thread handles waiting on
> > multiple other objects. Thus, an array of length 64 can represent up to
> > 64 * 64 = 4096 objects.
> > 
> > Based on the implementation of Emacs's inter-process communication in
> > Windows, we need to wait for both the subprocess itself and a
> > pipe. Therefore, this implementation can wait for a maximum of
> > 4096/2=2048 subprocesses. However, to remain consistent with the number
> > of waits supported by `select', I have limited the maximum FD_SIZE to
> > 2048. As a result, the maximum number of subprocesses is reduced to
> > 1024. 2048 comes from (MAXIMUM_WAIT_OBJECTS * MAXIMUM_WAIT_OBJECTS / 2)
> 
> I'm not sure I understand the "to remain consistent with the number of
> waits supported by `select'" part.  Do you mean 'select' as we use it
> in process.c?  Or do you mean something else?  What is that
> limitation?
> 
> Of course, all of the above should be in the comments to explain the
> code.

Thank you for your feedback, and I apologize for the lack of clarity in
my previous explanation. Let me clarify the rationale behind the
FD_SETSIZE value.

On Linux, FD_SETSIZE is typically 1024 by default, and it supports
managing at most 1024 subprocesses via 'pty method. By setting
FD_SETSIZE = 2048 in my code, the effective maximum subprocess capacity
becomes 2048 / 2 = 1024 on Windows, aligning with Linux's default scale.

In my original thoughts, This design aims to approximate cross-platform
behavior consistency. This may be unnecessary. I could set it to the
maximum number of objects my implementation can wait for: 64 * 63.

> > > Also, the above assumes that the values returned by
> > > WaitForMultipleObjects never exceed 0x3fff.  But note that WAIT_FAILED
> > > violates this assumption, so I don't think we can safely use this
> > > trick.  We need some safer solution.
> > 
> > The value of WAIT_FAILED is 0xFFFFFFFF. In my code, I paid special
> > attention to handling this error case. Of course, I will write comments
> > to provide clarification.
> 
> I've seen that, but having to test explicitly for WAIT_FAILED before
> checking the bit you added is error-prone.  So please try to find a
> safer mechanism to report the wait results.  Since this is our code,
> we don't need to stick too closely to the way MS designed the return
> values of WaitForMultipleObjects.

I think I now understand what you mean.

> > > Finally, is it known how long it takes on a modern Windows system to
> > > start a thread?  This could be important for assessing the impact of
> > > this design on the performance of sub-processes on MS-Windows.
> > > 
> > > Thanks again for working on this.
> > 
> > Generally speaking, CreateThread is considered a time-consuming
> > operation. I'm not entirely sure about the exact time it takes, but
> > I'll gather some information. It might be necessary to use a thread
> > pool mechanism or something similar to reduce the overhead of thread
> > creation.
> > [...]
> > I previously implemented a very basic thread pool mechanism. It??s a
> > handmade implementation, just sufficient for the current
> > scenario. However, I felt it was more complex, so I decided to stick
> > with creating a new thread each time a call is made.
> > 
> > If you think the thread pool implementation is better, I can provide
> > both a simple version and a thread pool version of the patch, as they
> > differ only in a small number of places. Then, we can decide which
> > implementation to use.
> 
> Let's discuss this aspect when we know how much time would thread
> creation take.

After some simple investigation, I believe the overhead of creating
new threads for each call to sys_select might be unacceptable.

The link https://stackoverflow.com/a/18275460 provides a good piece of
code for testing the time taken for thread creation. The test results
given in the link show that the time from thread creation to actual
execution ranges from tens to hundreds of microseconds. I slightly
modified the code and tested the time taken for creating and executing
64 threads. Below are my test results:

total:  34.581500 ms, average:   0.540336 ms
max  :   1.429200 ms, min    :   0.039000 ms

I can modify `sys_select' to observe its call time, but I opted for a
simpler approach: using `accept-process-output':

(benchmark-run 1000000 (accept-process-output))
=> (1.4010608 4 0.2107215)

The above results indicate that the call time of `sys_select' should be
less than 2 microseconds. However, when I applied my patch and created
more than 32 child processes, this value ballooned to an astonishing 160
microseconds.

I'm planning to continue my thread pool implementation. WDYT?
From:	Yue Yi
Subject:	Re: Increase FD_SETSIZE in w32.h to allow more than 32 (actually 29) subprocesses
Date:	Tue, 28 Jan 2025 15:21:20 +0800