Re: Increase FD_SETSIZE in w32.h to allow more than 32 (actually 29) sub

Hi Eli and Cecilio,

Thanks for your reply. I will add sufficiently detailed comments in the
next version.

> >  /* MSVC runtime library has limit of 64 descriptors by default */
> > -#define FD_SETSIZE  64
> > +#define FD_SETSIZE (MAXIMUM_WAIT_OBJECTS * MAXIMUM_WAIT_OBJECTS / 2)
>
> What is the reason for dividing by 2?  A comment should explain that.

As we know, WaitForMultipleObjects supports a maximum of 64 objects. In
my implementation, I plan to use an array of length 64 to hold
objects. Some of these objects might be thread handles waiting on
multiple other objects. Thus, an array of length 64 can represent up to
64 * 64 = 4096 objects.

Based on the implementation of Emacs's inter-process communication in
Windows, we need to wait for both the subprocess itself and a
pipe. Therefore, this implementation can wait for a maximum of
4096/2=2048 subprocesses. However, to remain consistent with the number
of waits supported by `select', I have limited the maximum FD_SIZE to
2048. As a result, the maximum number of subprocesses is reduced to
1024. 2048 comes from (MAXIMUM_WAIT_OBJECTS * MAXIMUM_WAIT_OBJECTS / 2)

> > +#define WAIT_TIMEOUT_2 (WAIT_TIMEOUT | 0x4000)
> > +#define WAIT_ABANDONED_0_2 (WAIT_ABANDONED_0 | 0x4000)
> > +#define WAIT_GROUP_SIZE (MAXIMUM_WAIT_OBJECTS - 1)
> > +#define MAXIMUM_WAIT_OBJECTS_2 FD_SETSIZE
> 
> I'd prefer to define our own macros whose names do not resemble the
> ones MS uses.  This is to avoid confusion, so people don't try to look
> up MS documentation or Windows API headers for these symbols.

That's true.

> Also, the above assumes that the values returned by
> WaitForMultipleObjects never exceed 0x3fff.  But note that WAIT_FAILED
> violates this assumption, so I don't think we can safely use this
> trick.  We need some safer solution.

The value of WAIT_FAILED is 0xFFFFFFFF. In my code, I paid special
attention to handling this error case. Of course, I will write comments
to provide clarification.

> > +typedef struct {
> > +  HANDLE* lpHandles;
> > +  int nCount;
> > +  BOOL bWaitAll;
> > +  DWORD dwMilliseconds;
> > +  int nIndex;
> > +} WaitForMultipleObjectsParams;
>
> Please add comments describing each member of the struct.
>
> The bWaitAll member seems to be unnecessary: we always set it to FALSE
> when calling WaitForMultipleObjects end MsgWaitForMultipleObjects.

Yes, I can remove it.

> Also, our conventions are not to use MS-style camel-style symbol names
> for symbols we introduce.  So I would suggest to rename the above to
> something like wait_objects_params.  And similarly for other
> structures and functions in the patch.
>
> > +typedef struct
> > +{
> > +  HANDLE *hObjects;
> > +  WaitForMultipleObjectsParams *pParams;
> > +  HANDLE *pHandles;
> > +  HANDLE hEndEvent;
> > +  int nObject;
> > +  int nThread;
> > +  int nHandle;
> > +} WaitForMultipleObjectsInfo;
>
> Same here: please add comments describing the members.

Ok.

> > +static void
> > +InitializeWaitForMultipleObjectsInfo (WaitForMultipleObjectsInfo *p,
> > +                                   DWORD nCount,
> > +                                   CONST HANDLE *lpHandles,
> > +                                   BOOL bWaitAll,
> > +                                   DWORD dwMilliseconds)
> > +{
> > +  p->nThread = 1 + (nCount - 1 - MAXIMUM_WAIT_OBJECTS / 2)
> > +    / WAIT_GROUP_SIZE;
>
> Please add here comments explaining the various constants you use,
> like 1, 2, etc.

> > +  p->hObjects = xmalloc (sizeof (HANDLE) * p->nObject);
> > +  if (p->nObject != p->nThread)
> > +    memcpy (p->hObjects + p->nThread,
> > +         lpHandles + p->nThread * WAIT_GROUP_SIZE,
> > +         sizeof (HANDLE) * (nCount - p->nThread * WAIT_GROUP_SIZE));
> > +  p->pParams = xmalloc (sizeof (WaitForMultipleObjectsParams) * p->nThread);
> > +  p->pHandles = xmalloc (sizeof (HANDLE) * p->nThread * 
> > MAXIMUM_WAIT_OBJECTS);
>
> Since the maximum size of these is known in advance, and fixed by the
> implementation, I think it would be better to avoid the constant
> allocation and deallocation of these arrays, and instead have static
> arrays of their maximum size.  In normal Emacs usage, sys_select is
> called quite often, and each call will need to xmalloc/xfree these
> variables.  Their maximum size seems to take less than 10KB (am I
> right?), which is a small price to pay for avoiding constant heap
> allocations.

Yes, you're right, the maximun size of p->pHandles is 8 * 32 * 64 =
16384 = 16KB.

> > +  p->hEndEvent = CreateEvent (NULL, TRUE, FALSE, NULL);
>
> I wonder if it would be better to give the event a non-NULL name which
> includes the number of processes.  This could help in debugging, I
> think.
>
> > +      p->hObjects[i] = CreateThread (NULL, 0, WaitForMultipleObjectsWrapped,
> > +                                  p->pParams + i, 0, NULL);
>
> The 2nd and 5th arguments zero might cause trouble, at least in the
> 32-bit builds of Emacs.  See the comments around line 1100 of
> w32proc.c, which explain why and how to request smaller stack size for
> the helper threads.

Thnaks, I'll fix these problems.

> > +static void
> > +CleanupWaitForMultipleObjectsInfo (WaitForMultipleObjectsInfo *p)
> > +{
> > +  SetEvent (p->hEndEvent);
> > +  WaitForMultipleObjects(p->nThread, p->hObjects, TRUE, INFINITE);

> Instead of waiting forever for the threads to exit, wouldn't it be
> better to forcibly terminate them?  When the cleanup function is
> called, we know that at least one thread exited, so we could terminate
> the rest.  Or at least do that after waiting some reasonable amount of
> time, like 100 msec, for them to exit due to SetEvent.  I'm worried
> that we might become stuck in this WaitForMultipleObjects forever if
> some thread fails to exit for some reason.

Definitely I can do like this. I will make the necessary modifications.

> > +static DWORD
> > +WaitForMultipleObjectsThreaded (DWORD nCount,
> > +                             HANDLE *lpHandles,
> > +                             BOOL bWaitAll,
> > +                             DWORD dwMilliseconds)
> > +{
> > +  if (nCount <= MAXIMUM_WAIT_OBJECTS)
> > +    {
> > +      DWORD result = WaitForMultipleObjects (nCount, lpHandles, bWaitAll,
> > +                                          dwMilliseconds);
> > +      if (result == WAIT_TIMEOUT)
> > +     result = WAIT_TIMEOUT_2;
> > +      else if (result >= WAIT_ABANDONED_0
> > +            && result < WAIT_ABANDONED_0 + nCount)
> > +     result += WAIT_ABANDONED_0_2 - WAIT_ABANDONED_0;
> > +      return result;
> > +    }
> > +  WaitForMultipleObjectsInfo info;
> > +  InitializeWaitForMultipleObjectsInfo (&info, nCount, lpHandles,
> > +                                    bWaitAll, dwMilliseconds);
> > +  DWORD result = WaitForMultipleObjects (info.nObject, info.hObjects,
> > +                                      bWaitAll, dwMilliseconds);
> > +  result = ExtractWaitResult (&info, result);
> > +  CleanupWaitForMultipleObjectsInfo (&info);
> > +  return result;
> > +}
> 
> IIUC, this will cause us starting 2 threads when we need to start 65
> processes.  Wouldn't it be better to use just one thread, and wait for
> the other processes as we do now using the rest of available slots in
> the wait_hnd[] array?

In fact, my approach aligns with your description.

p->nThread = 1 + (nCount - 1 - MAXIMUM_WAIT_OBJECTS / 2)
	       / WAIT_GROUP_SIZE;

In the code for InitializeWaitForMultipleObjectsInfo, when we have 65
objects,1 + (65 - 1 - 32) / 63 equals 1. one thread will manage 63
objects, and the remaining two will be in wait_had[]. Only when we have
(63 + 33) objects will we use two threads. In other words, I will use at
most 32 vacant positions in wait_hnd[]. The reason for choosing 32 as
the maximum number of vacant positions is that it is the number of
remaining empty slots when we are using 32 child threads.

We can utilize the vacant positions in wait_hnd[] as much as possible,
but this might require more complex calculations.

Of course, it??s best if I include this information in the comments so
that it??s clear when you first look at the code.

> Finally, is it known how long it takes on a modern Windows system to
> start a thread?  This could be important for assessing the impact of
> this design on the performance of sub-processes on MS-Windows.
> 
> Thanks again for working on this.

Generally speaking, CreateThread is considered a time-consuming
operation. I'm not entirely sure about the exact time it takes, but
I'll gather some information. It might be necessary to use a thread
pool mechanism or something similar to reduce the overhead of thread
creation.

On Sat, 25 Jan 2025 22:12:59 +0100, Cecilio Pardo wrote:

> > +static void
> > +CleanupWaitForMultipleObjectsInfo (WaitForMultipleObjectsInfo *p)
> > +{
> > +  SetEvent (p->hEndEvent);
> > +  WaitForMultipleObjects(p->nThread, p->hObjects, TRUE, INFINITE);
> > +
> > +  xfree (p->hObjects);
> > +  xfree (p->pHandles);
> > +  xfree (p->pParams);
> > +  CloseHandle (p->hEndEvent);
> > +  for (int i = 0; i < p->nThread; i++)
> > +    CloseHandle (p->hObjects[i]);
> > +  return;
> > +}
> 
> I agree with using static data, but in case you keep the allocations,
> close the handles in hObjects before freeing hObjects.

Thanks.

> > IIUC, this will cause us starting 2 threads when we need to start 65
> > processes.  Wouldn't it be better to use just one thread, and wait for
> > the other processes as we do now using the rest of available slots in
> > the wait_hnd[] array?
> > 
> > Finally, is it known how long it takes on a modern Windows system to
> > start a thread?  This could be important for assessing the impact of
> > this design on the performance of sub-processes on MS-Windows.
> 
> Have you considered using the thread pool (QueueUserWorkItem) instead
> of manually creating threads?. This would make the code somewhat more
> complex. Also, threads from the pool use the default stack size, and
> that seems to be a problem here.

I previously implemented a very basic thread pool mechanism. It??s a
handmade implementation, just sufficient for the current
scenario. However, I felt it was more complex, so I decided to stick
with creating a new thread each time a call is made.

If you think the thread pool implementation is better, I can provide
both a simple version and a thread pool version of the patch, as they
differ only in a small number of places. Then, we can decide which
implementation to use.

Regards.
From:	Yue Yi
Subject:	Re: Increase FD_SETSIZE in w32.h to allow more than 32 (actually 29) subprocesses
Date:	Sun, 26 Jan 2025 12:30:01 +0800