coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uniq --accumulate


From: Peng Yu
Subject: Re: uniq --accumulate
Date: Tue, 7 Feb 2012 10:43:19 -0600

On Tue, Feb 7, 2012 at 10:21 AM, Assaf Gordon <address@hidden> wrote:
> Pádraig Brady wrote, On 02/07/2012 11:00 AM:
>> On 02/07/2012 03:56 PM, Peng Yu wrote:
>>>
>>> Suppose that I have a table of the following, where the last column is
>>> a number. I'd like to accumulate the number of rows that are the same
>>> for all the remaining columns.
>>>
>>
>> Thanks for the suggestion,
>> but this is too specialized for coreutils I think.
>
> Slightly off-topic for coreutils,
>
> but a package called "BEDTools" ( http://code.google.com/p/bedtools/ ) 
> provides a program called "groupBy", which does exactly that, and more.
>
> Akin to SQL's "group by" command, the program can group a text file by a 
> specified column, and perform operations (count,sum,mean,median,etc.) on 
> another column.

OK. This is what I'm looking for. The latest groupBy is from filo. I
think that the field separator is not documented. My guess is that it
is '\t'. But users can not change it. Would you please document it and
add an option so that users can change it? Thanks!

~$ groupBy -h

Program: groupBy (v1.1.0)
Authors: Aaron Quinlan (address@hidden)
         Assaf Gordon
Summary: Summarizes a dataset column based upon
         common column groupings. Akin to the SQL "group by" command.

Usage:   groupBy -i [FILE] -g [group_column(s)] -c [op_column(s)] -o [ops]
         cat [FILE] | groupBy -g [group_column(s)] -c [op_column(s)] -o [ops]

Options:
        -i              Input file. Assumes "stdin" if omitted.

        -g -grp         Specify the columns (1-based) for the grouping.
                        The columns must be comma separated.
                        - Default: 1,2,3

        -c -opCols      Specify the column (1-based) that should be summarized.
                        - Required.

        -o -ops         Specify the operation that should be applied to opCol.
                        Valid operations:
                            sum, count, min, max,
                            mean, median, mode, antimode,
                            stdev, sstdev (sample standard dev.),
                            collapse (i.e., print a comma separated list 
(duplicates allowed)),
                            distinct (i.e., print a comma separated list (NO 
duplicates allowed)),
                            concat   (i.e., merge values into a single, 
non-delimited string),
                            freqdesc (i.e., print desc. list of values:freq)
                            freqasc (i.e., print asc. list of values:freq)
                        - Default: sum

        -full           Print all columns from input file.
                        Default: print only grouped columns.

        -inheader       Input file has a header line - the first line will be 
ignored.

        -outheader      Print header line in the output, detailing the column 
names.
                        If the input file has headers (-inheader), the output 
file
                        will use the input's column names.
                        If the input file has no headers, the output file
                        will use "col_1", "col_2", etc. as the column names.

        -header         same as '-inheader -outheader'

        -ignorecase     Group values regardless of upper/lower case.

Examples:
        $ cat ex1.out
        chr1 10  20  A   chr1    15  25  B.1 1000    ATAT
        chr1 10  20  A   chr1    25  35  B.2 10000   CGCG

        $ groupBy -i ex1.out -g 1,2,3,4 -c 9 -o sum
        chr1 10  20  A   11000

        $ groupBy -i ex1.out -grp 1,2,3,4 -opCols 9,9 -ops sum,max
        chr1 10  20  A   11000   10000

        $ groupBy -i ex1.out -g 1,2,3,4 -c 8,9 -o collapse,mean
        chr1 10  20  A   B.1,B.2,    5500

        $ cat ex1.out | groupBy -g 1,2,3,4 -c 8,9 -o collapse,mean
        chr1 10  20  A   B.1,B.2,    5500

        $ cat ex1.out | groupBy -g 1,2,3,4 -c 10 -o concat
        chr1 10  20  A   ATATCGCG

Notes:
        (1)  The input file/stream should be sorted/grouped by the -grp. columns
        (2)  If -i is unspecified, input is assumed to come from stdin.


-- 
Regards,
Peng



reply via email to

[Prev in Thread] Current Thread [Next in Thread]