gnuastro-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuastro-devel] [task #15047] Server-client Gnuastro operation


From: Mohammad Akhlaghi
Subject: [gnuastro-devel] [task #15047] Server-client Gnuastro operation
Date: Tue, 18 Sep 2018 10:14:20 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0

URL:
  <https://savannah.gnu.org/task/?15047>

                 Summary: Server-client Gnuastro operation
                 Project: GNU Astronomy Utilities
            Submitted by: makhlaghi
            Submitted on: Tue 18 Sep 2018 04:14:19 PM CEST
         Should Start On: Tue 18 Sep 2018 12:00:00 AM CEST
   Should be Finished on: Tue 18 Sep 2018 12:00:00 AM CEST
                Category: All Gnuastro
                Priority: 5 - Normal
              Item Group: None
                  Status: Need Info
                 Privacy: Public
        Percent Complete: 0%
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
                  Effort: 0.00

    _______________________________________________________

Details:

As we are processing larger and larger datasets (commonly referred to as "big
data"), it is impossible/impractical to download the entire raw data for local
processing. This is especially important with the upcoming LSST
<https://en.wikipedia.org/wiki/Large_Synoptic_Survey_Telescope> project, which
will be producing data at rates of roughly
<https://www.lsst.org/about/dm/technology> 15Tb/night.

One possible solution that came up recently in my discussion with
Mohammad-reza Khellat was the low-level use of network protocols within
Gnuastro. In summary, within each program, all the heavy-duty parts of the
processing (requiring the full input raw data set) will be done on the data
center server and the higher-level parts will be done on the client.

The scenario that we have discussed so far looks something like this (and will
certainly evolve):

* Both the client (user's computer) and server (for example LSST data center)
have the same version of Gnuastro installed.

* The client-side program (for example NoiseChisel on user's computer),
connects with the server-side program (NoiseChisel on server) and gets all the
necessary low-level meta-data of the input dataset (for example numeric data
type and image size: only a few bytes) and defines/manages the necessary
high-level steps. 

* As in the current Gnuastro multi-threaded operations (where the operation is
distributed in many threads), the client-side program instructs/manages the
server-side to use the server's CPU and RAM in processing the low-level data
into the higher-level products.

* The output can then be stored in either of the following to ways: 1) on the
server (for even higher-level processing), 2) on the client. We can use the
SSH format of "server:file" to allow the programs to know where the output
should be stored. In the latter case, during the processing, necessary patches
of the output will be transferred to the client (as it is processed by each
thread on the server, not all at once, thus greatly improving speed and
redundancy) and the final output file is written on the user's computer. 
** In the case of programs like NoiseChisel and Segment (the programs on the
boundary of low-level images to high-level catalogs), the output is a
labeled/integer-valued image which can be highly compressed: for example in a
test I just did, NoiseChisel's raw output (with --rawoutput and
--oneelempertile) on a 2.9Gb image (28362 x 25297 pixels), containing a huge
galactic cirrus structure, is 19.8Mb when compressed with Gzip's --best option
or 14.7Mb with Lzip's --best option (compression ratios of ~150 and ~200).
This can (potentially!) allow these labeled images to be the point where it is
possible continue the processing (or archive) locally (while actual telescope
images are on the server).
** Higher-level programs like Segment or MakeCatalog, can also avoid having to
read the full image into the server's RAM (to consume less of its precious
resources). They can only load the parts of the input image that are needed at
each moment/CPU-thread (over each detection or clump).

We will be looking into existing network protocols to find the best for this
job, or possibly define new protocol that is tailored/suited for efficient
operations like the scenario mentioned above.

This can be done in parallel with task #14779 (Enable usage in HTCondor at
configure time), and may not be totally independent. 

This is mainly a brainstorm now that we will be looking into more details for
implementing. So, please leave your thoughts or comments (the more critical,
the better) on this issue on GNU Savannah (you just need to create an ID in
Savannah to post comments).




    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/task/?15047>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]