[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-gnubg] redesign gnubg to master/slave
From: |
Olivier Baur |
Subject: |
Re: [Bug-gnubg] redesign gnubg to master/slave |
Date: |
Tue, 23 Dec 2003 00:38:46 +0100 |
Hi everybody!
I'm finally back from Africa -- back to the gnubg world!
Le 21 déc. 03, à 20:54, Joern Thyssen a écrit :
One of the recent topics has been multi-threading. Instead of
multi-threading Jim suggested splitting gnubg into a GUI (master) and
an
engine (slave) which communicates over sockets.
Besides avoiding the problem of making the engine of gnubg thread-safe
it also follows the Unix design principles
(http://www.catb.org/~esr/writings/taoup/html), i.e.,
http://www.catb.org/~esr/writings/taoup/html/ch11s06.html.
As Holger pointed out, I have written a thread-safe version of gnubg;
more precisely, a version of gnubg that can compute rollouts in
multiple threads, taking advantage of multi-processor hosts. It can
also communicate with other gnubg's on a TCP/IP network, extending the
notion of multi-threaded "gnubg computing units" to "remote gnubg
computing units".
The main parts of the work here:
1/ make gnubg thread-safe (at least in its rollout part and GUI part)
-- that is , mainly hunting down "bad" behaved globals and other static
variables that would mix up between different thread contexts;
2/ rewrite the rollout function so it now divides its rollout job in
smaller 1-game rollout tasks, with are then handled and dispatched by a
"task engine", which collects results back to the rollout function for
result accumulation; the bad news is this is several-month old, and I
think some new features have been added to rollouts, like rollout
interrupting/resuming.
3/ write TCP/IP exchange mechanism (and network slave/master
notification and automatic joining)
4/ write a GTK GUI (and get a running thread-safe GTK version of gnubg
For those who will have a look at cvs branch-multi, I have set up my
implementation notes at the following address (plain ascii text file):
http://mapage.noos.fr/gnubgosx/processingunits.txt (different than the
user-oriented HTML version that Holger already mentioned:
http://mapage.noos.fr/gnubgosx/procunits.html)
It describes more technically the concepts I used, how I made gnubg
thread-safe, and how I implemented multi-threading and task handling.
Le 22 déc. 03, à 19:42, Jim Segrave a écrit :
A single instance of gnubg should be capable of doing all of play,
analyse and rollout without having to have any external
processes. This is conceptually easiest for users - there's one
program and it does everything they expect.
I also agree with this
There should be a stripped down version of gnubg, call it
gnubg-analyser, which should be able to do the following:
1) Given a position, a set of dice rolls and the evaluation rules,
select the best move, if any exists, for each roll - sort of a
FindnSaveBestMove() over a more limited set.
Or
2) given a position, a rollout setting, and a trial number, rollout a
single game and return the results - more or less
BasicCubefulRollout()
Or
3) given a position, a dice roll and a move or just a cube decision,
and an analysis context, do an AnalyseMove()
This is what I called "slave processing units" ; processing units are
threads of gnubg (typically, one per processor, on one or several
hosts) ; and there's the master gnubg thread, that drives the GUI (GTK)
and uses the available slave processing units, either local -- running
on the host, one per proc --, or remote -- running on remote hosts, one
per processor on remote host --) . The version of gnubg I wrote starts
up in master mode (the usual gnubg), but can be switched to "slave"
mode, ready to join a master. This can be done either in TTY or GTK.
Please note there are also some non-trivial data to transmit along,
like random number generator info (like seed roots or RNG state) --
which game will be rolled out with wich dice is of utmost importance to
variance reduction.
gnubg itself should allow the user to identify client gnubg-analysers
which it can use to help with evaluations. These would be
interconnected with TCP/IP sockets. They could be on localhost or on
remote machines. At startup, gnubg would attempt to open connections
to the gnubg-analysers so that it would know how many were available
at any given time. An analyser would only server a single gnubg, once
it accepts a connection from that gnubg, any other gnubg will be
unable to connect. Users with multiple machines who wanted to have two
gnubg's going and 4 analysers would have to choose which analysers
would be available to which machines.
I've also written this.
Slaves can notifiy their presence either to all available master hosts
on the local TCP/IP network, or to a single, user-chosen master. When a
master spots a self-notifying slave, it automatically adds it as
co-operating gnubg to its list of processing units.
A master can also be set to connect to an arbitrary list of gnubg
slaves, for exemple by feeding it with a startup file containing gnubg
"pu" instructions (type "help pu" , "help get pu", "help set pu" in
gnubg to get the details, or some older mail I posted about it on the
list some months ago; I need to find the references back)
During startup, the analysers should inform gnubg about their
approximate computing power, this allows the master to make some
guesses about load balancing between machines. Say someone has a
couple of old P-300s on a home network and a 2.4G main machine. You
would not want to try to split the work 33%-33%-33%, you'd probably
want to do something like 10% - 10% - 80% when doing analysis or
rollouts.
I didn't go to these ends, but for now you can define, on the master,
different queue sizes for each remote slave; the queue size basically
relates to the number of tasks that are sent at the same time to a
remote host; so if you're using 2 remote hosts, one being twice as fast
as the other one, you would set the queue size of the faster to 2 and
the slower to 1, or maybe 8 and 4 if they happen to compute so fast
that network overhead becomes too important as compared to real
computations) ; basically, it's hard to precompute how you are going to
split the job: maybe your 10%-10%-80% ratio is not correct; maybe one
host will crash and you'll have to split the remaining; maybe some new
hosts might join in the middle of a long rollout; I think the best
policy is "serve a soon as finished": feed a slave as soon as it has
returned the results of all the tasks it was given; work on the fly.
The "ratio" issue only pertains when there are only a few tasks
remaining to compute in the whole job, but then it gets a very hard
work to make a good job splitting decision: who do you give these
remaining tasks to? do you give them all to process A who is ready to
compute? or do you keep some for process B, which runs very fast, and
"should" be ready soon? I chose the simple solution (feed slave as soon
as available, with some refinment in the "end of job near" case), with
some "simple" queue parameter (which the user can adjust according to
remote host processing speed --increase queue size if host is fast--
and network overhead --don't use queue size so small that network
communication gets noticeable--)
The connections should have keep-alives as part of their protocol, so
that the loss of one of the gnubg-analysers would be noticed before
the main gnubg has waited too long for answers that won't be
forthcoming.
I still have to add better "timeout" handling.
For now, gnubg masters can safely handle slave disappearing from the
network (network problem or slave crash/killed/returned to master mode)
I can see a few different modes of working:
During ordinary play, if gnubg is set to some fairly demanding
standard - I often play against gnubg on 2 ply supremo settings, then
individual moves could be shared among analysers. I'm not sure how
cube decisions would be partitioned, but given a dice roll, I'd think
you could do something like:
gnubg builds a list of legal moves. It then divides that list up and
lets the gnubg-analysers pick the best moves out of a subset of the
legal ones, while gnubg does some for itself. When it completes it's
selection, it then waits for the others to complete, gathers the
results and makes the final decision.
For 0 ply (and quite possibly 1 play), there's
probably no point in using any remote analysis. For 2 ply you assume
that each legal move will take the same amount of time to analyse, so
in the situation postulated before of 2 slow analysers, given 12 legal
moves, gnubg would do 10 and pass 1 to each to the other machines.
I didn't write anything for n-ply evaluation yet (only rollouts, see
below)
I agree with you that it's no use doing anything with 0-ply and 1-ply
evals.
I don't think it's a good idea to split the work before starting to
compute. Rather, divide the work (eg, quickly said, divide a 2-ply eval
into, say, 18 1-ply evals if there are 18 legal moves, and have these
18 tasks given to available threads/processes (processing units in my
terminlogy) -- you can't know beforehand the time that n-ply
evaluations will take (it depends on number of available moves in
deeper plies), so it's hard to fairly dispatch the tasks. One main
thread should do the "job management" (this I called the "task engine"
in my implementation, run by the master thread: send tasks, collect
results), and one more "computing thread" per processor crunches
numbers; plus processes running on remote hosts and communicating with
master thread (actually with other threads that run locally on the
master host and handle the individual TCP connections and exchange
protocol)
The main disadvantage to this is
that caching is largely defeated, since the forward evaluations of the
selected move are likely to have been made on a remote machine. Here
gnubg must decide how much work to hand to each analyser and how much
to keep for itself.
I'm not sure cache impact would be so great; I think that cache can be
mostly useful as you "travel down" the tree of possibilities (plies),
since positions might reappear on deeper layers of the tree; I'd think
there is less to share among different (parallel) parts of the tree
when the first node is already different. I'm not sure if I what I mean
is clear...
A different option for play would be to let gnubg handle all aspects
of play. But it could run a background task, particularly when it's
waiting for user input, where it passes any unanalysed moves to
gnubg-analysers and installes their results in the moverecords. If a
user then goes back in the game list and changes a play, this is no
major issue. If one of the moves which has now been undone, as it
were, has been passed to a gnubg-analyser, then when the answer is
returned, gnubg checks and finds that the moverecord is now gone and
simply discards the answer. The idea is that gnubg will have a match
analysed not long after it has been completed - it may even be
possible for the user to see analysis of earlier moves while the game
is still in progress.
Yes, this would be a particularily nice feature to implemented, which
would benefit all users, not only those who have multi-processor
machines or several computers available on a TCP/IP network.
When anlaysing a match or game, the same procedure as the above
paragraph would be followed - gnubg would build the moverecords, then
begin handing them out to the remote analysers. The only difference
would be that the user would have given the analyse game/session/match
command, so gnubg itself would know that it could also do some of the
analysis itself - it would include itself in the list of available
analysers.
Yes, agreed.
As soon as an analyser completes a move, it is given the next
unprocessed one. In this case there's no need to guess at which
machine is the most powerful, they will each work to full capacity as
long as there are uncompleted moves.
Agreed too. This meets the "feed slave as soon as ready" strategy I
propose for position evals and that I implemented in rollouts.
But I think cache will be important here; that is, if you have already
analysed play N in the match with one processing unit, then you'd
better give it play N+1 to analyse, because there surely will be plenty
of useful data in the cache! I think we'll need some task dispatch
algorithm that strive to give processing units as many consequent plays
to evaluate as possible, without precomputing the job splitting.
Interesting problem... :-)
Rollouts are the most interesting problem, because we want to be able
to interrupt and resume rollouts. Gnubg begins by deciding it will do
the first trial, then passing the next n trials to the
gnubg-analysers. During it's rollout, it checks for analysers having
completed. If they have, it stores the results of that one rollout and
issues the next trial to the analyser. When gnubg completes a rollout,
it then takes the results of that rollout and any following ones in
the series (not including gaps) and builds the cumulative result.
Agreed, this is also my "feed as soon as ready" philosophy.
here's a picture or a rollout where gnubg is faster than analyser 1
and much faster than analyser 2. Trial 1 would be the first rollout a
single gnubg would do, trial 1 would be the second, etc. So, for a
rollout of a single move, these would be the same as the games rolled
out. When rolling out two moves, or a cube decisions, trial 1 is move
1, game 1, trial 2 is move 2, game 1, trial 3 is move 1, game 2, etc.
Time
0 1 2 3 4 5 6
0.........0.........0.........0.........0.........0.........0.....
gnubg | trial 1 | trial 4 | trial 7 | trial 9 | trial 11 |
anal1 | trial 2 | trial 5 | trial 8 | trial 12 |
anal2 | trial 3 | trial 6 | trial 10 |
Time action
0 initiate the rollouts
11 gnubg completes trial 1. It can put this into the cumulative
results and begin trial 4
15 anal1 completes a result. This is stored and anal1 begins
trial 5
19 anal2 completes trial 3. This is stored and anal2 begins trial
6
23 gnubg completes trial 4. It can put trial 2, 3, and 4 into the
cumulative results and begin trial 7
31 anal1 completes trial 5. This is stored and anal1 begins trial
8
34 gnubg complets trial 7. Trial 5 can be added to the cumulative
results, but trial 7 must be stored. gnubg begins trial 9
38 anal2 completes trial 6, this is stored and anal2 begins trial
10
45 gnubg completes trial 9. Trial 6 and 7 are added to the
cumulative results, gnubg stores trial 9 and begins trial 11
47 anal1 completes trial 8. This is stored and anal1 begins trial
12
56 gnubg completes trial 11. Trials 8 and 9 are added to the
cumulative
results.
...
If the rollout were interrupted between 45 and 47, then we'd only have
usable results for trials 1..7, if it were between 48 and 56, then
we'd have trials 8 and 9 as well. In general, this would keep all the
processes working more or less flat out (it should even be possible to
add or subtract servers during a rollout).
That's what my "task engine" is responsible for: assigning tasks (and
reassigning failed tasks ) and collecting results.
The potential for people with a home network or friends who are
willing to lend their machines for overnight rollouts or whatever
Right. This I implemented too: slaves joining or quitting a long
rollout without disturbing the whole job -- it basically comes free
with the "serve as soon as..." strategy :-)
One thing: this can also be done on the Internet (why limit to a home
network?)
One can even imagine "gnubg slave farms" to compute big rollouts or
maybe help in training new neural nets (I don't know if training
algorithms can be easily split down into subtasks to dispatch among
cooperating slaves, like the SETI screensaver)
Other thoughts:
To deal with firewall issues, it should be possible to specify the
ports to be used for both outbound connections from gnubg and inbound
to the analyser.
It would be a good idea to link to libwrap
The analyser should be allowed to work either from stdin/stdout (so it
can be invoked from inetd or with a command line to listen on a
specific port.
All the exchanges should be done in ASCII, numerical values can be
exchanged in %.18g or similar format.
It's probably easiest, though not the most efficient, to exchange the
full analysis or rollout contexts with each service request and to
pass them back again as part of the results.
Agreed to. I've also implemented the above, to some degree. Details can
be found in the txt document I pointed at the beginning of this mail.
Bottom line
I'm not sure it is necessary to go through a full rewrite of gnubg in
master/slave at the GUI / engine level, as I've already done it at
master thread / slave thread level, with only the master thread having
access to GUI, and a single gnubg process being able to be switched
to/from master/slave state.
Please have a look at branch-multi, and try and compile it on your
platforms. I know that it works on Macs (multi-threading and remote
processing), on PC's (multithreading only, still got problems with
winsock for remote processing), and I think it should work with most
POSIX-compliant flavours of UNIX/Linux.
Best
-- Olivier