daggr

aggregate data from text-based streams

npm install daggr
8 downloads in the last month

daggr: filter and aggregate numeric data in plaintext or json form

This tool is still in development. Arguments and syntax may change!

Synopsis

daggr reads records on stdin and filters, transforms, and aggregates them based on the command-line flags. It processes both text and JSON data. It's inspired by both awk(1) and dtrace(1M).

With no arguments, it filters nothing and performs no transformations, so it acts exactly like cat(1):

$ daggr < sample-data/requests.txt
PUT  /dap/public/kartlytics/videos/2012-06-19-00.mov      201 34
PUT  /dap/public/kartlytics/videos/2012-06-19-00.mov.json 202 15
HEAD /dap/public/kartlytics/videos/2012-06-19-02.mov      200 19
GET  /dap/public/kartlytics/videos/2012-06-19-02.mov.json 200 12
PUT  /dap/public/kartlytics/videos/2012-06-19-03.mov      201 35
PUT  /dap/public/kartlytics/videos/2012-06-19-03.mov.json 202 20
GET  /dap/public/kartlytics/videos/2012-06-19-04.mov      200 16
GET  /dap/public/kartlytics/videos/2012-06-19-04.mov.json 200 16
GET  /dap/public/kartlytics/videos/2012-06-19-05.mov      200 16
GET  /dap/public/kartlytics/videos/2012-06-19-05.mov.json 200 15
GET  /dap/public/kartlytics/videos/2012-06-19-06.mov      200 15
GET  /dap/public/kartlytics/videos/2012-06-19-06.mov.json 200 17
GET  /dap/public/kartlytics/videos/2012-06-19-07.mov      200 10
GET  /dap/public/kartlytics/videos/2012-06-19-07.mov.json 200 18
GET  /dap/public/kartlytics/videos/2012-06-19-08.mov      200 8
GET  /dap/public/kartlytics/videos/2012-06-19-08.mov.json 200 8
GET  /dap/public/kartlytics/videos/2012-06-19-09.mov      200 8
GET  /dap/public/kartlytics/videos/2012-06-19-09.mov.json 200 22
GET  /dap/public/kartlytics/videos/2012-06-19-10.mov      200 7
GET  /dap/public/kartlytics/videos/2012-06-19-10.mov.json 200 7
GET  /dap/public/kartlytics/videos/2012-06-28-00.mov      200 16
GET  /dap/public/kartlytics/videos/2012-06-28-00.mov.json 200 18
GET  /dap/public/kartlytics/videos/2012-06-29-00.mov      200 8
GET  /dap/public/kartlytics/videos/2012-06-29-00.mov.json 200 24
PUT  /dap/public/kartlytics/videos/2012-06-29-01.mov      204 40
PUT  /dap/public/kartlytics/videos/2012-06-29-01.mov.json 204 34

Filtering

You could also filter out just the PUTs with:

$ daggr -f '$1 == "PUT"' < sample-data/requests.txt 
PUT  /dap/public/kartlytics/videos/2012-06-19-00.mov      201 34
PUT  /dap/public/kartlytics/videos/2012-06-19-00.mov.json 202 15
PUT  /dap/public/kartlytics/videos/2012-06-19-03.mov      201 35
PUT  /dap/public/kartlytics/videos/2012-06-19-03.mov.json 202 20
PUT  /dap/public/kartlytics/videos/2012-06-29-01.mov      204 40
PUT  /dap/public/kartlytics/videos/2012-06-29-01.mov.json 204 34

Selecting fields

You could select just the third field with:

$ daggr -f '$1 == "PUT"' -o 3 < sample-data/requests.txt
/dap/public/kartlytics/videos/2012-06-19-00.mov
/dap/public/kartlytics/videos/2012-06-19-00.mov.json
/dap/public/kartlytics/videos/2012-06-19-03.mov
/dap/public/kartlytics/videos/2012-06-19-03.mov.json
/dap/public/kartlytics/videos/2012-06-29-01.mov
/dap/public/kartlytics/videos/2012-06-29-01.mov.json

Aggregations

So far, this is just another way to do what 'awk' already does. But daggr also supports DTrace-like aggregating actions. Simplest is "count", which behaves much like "wc -l":

$ daggr count < sample-data/requests.txt 
26

Of course, this can be combined with filtering:

$ daggr -f '$1 == "PUT"' count < sample-data/requests.txt 
6

Grouping results by some other field

Instead of filtering, you could break out the count by method (field 1):

$ daggr -k 1 count < sample-data/requests.txt
PUT 6
HEAD 1
GET 19

You can also aggregate by multiple fields:

$ daggr -k1 -k3 count < sample-data/requests.txt 
PUT,201 2
PUT,202 2
PUT,204 2
HEAD,200 1
GET,200 19

Other types of aggregations

Instead of counting lines, you could instead average the numbers in column 4 (which represent latencies in this dataset):

$ daggr -f '$1 == "PUT"' -v 4 avg < sample-data/requests.txt 
29.666666666666668

Of course, you can break that out by column 1, too:

$ daggr -k 1 -v 4 avg < sample-data/requests.txt 
PUT 29.666666666666668
HEAD 19
GET 13.736842105263158

Another useful aggregating action is "quantize", which generates a power-of-two histogram of a numeric quantity. This example prints out a histogram of the value of field 4 for each value of field 1:

$ daggr.js -k 1 -v 4 quantize < sample-data/requests.txt 
PUT 
           value  ------------- Distribution ------------- count
               4 |                                         0
               8 |@@@@@@@                                  1
              16 |@@@@@@@                                  1
              32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              4
              64 |                                         0

HEAD 
           value  ------------- Distribution ------------- count
               8 |                                         0
              16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
              32 |                                         0

GET 
           value  ------------- Distribution ------------- count
               2 |                                         0
               4 |@@@@                                     2
               8 |@@@@@@@@@@@@@@@@@                        8
              16 |@@@@@@@@@@@@@@@@@@@                      9
              32 |                                         0

JSON data

The above examples use text for simplicity, but you can do all the same things using newline-separated JSON records by specifying the "-j" option. With "-j", each line is parsed as JSON, and the fields become accessible as variables for use with -k, -v, and -o. For example, consider HTTP access records that look like this (similar to those emitted by restify using bunyan), but with newlines only used to separate each record:

{
  "req": {
    "method": "PUT",
    "headers": {
      "accept": "application/json",
      "content-length": "29",
      "content-type": "application/json",
      "date": "Sun, 03 Nov 2013 20:09:44 GMT",
      "expect": "100-continue",
      "x-request-id": "897a3f08-b885-4499-bff6-d53a78e483b1",
      "user-agent": "restify/2.6.0 (ia32-sunos; v8/3.11.10.26; OpenSSL/0.9.8w) node/0.8.26",
      "accept-version": "~1.0",
      "host": "localhost",
      "connection": "keep-alive"
    },
    "httpVersion": "1.1",
    "caller": {
      "login": "dap"
    },
    "request-uri": "/dap/public/kartlytics/videos/2012-06-19-07.mov.json"
  },
  "res": {
    "statusCode": 204,
    "headers": {
      "date": "Sun, 03 Nov 2013 20:09:44 GMT",
      "x-response-time": 34
    }
  }
}

Here's an example that prints out the value of "res.statusCode" for the records with "req.method" == "PUT":

$ daggr.js -j -f 'req.method == "PUT"' -o res.statusCode < requests.json
204
204

Here's an example that takes a bunch of such records and produces histograms of "res.headers['x-response-time']" for each value of "req.method":

$ daggr -j -k req.method -v res.headers.x-response-time quantize < requests.json
POST 
           value  ------------- Distribution ------------- count
               4 |                                         0
               8 |@@@@@@@@@@                               1
              16 |@@@@@@@@@@                               1
              32 |@@@@@@@@@@@@@@@@@@@@                     2
              64 |                                         0

HEAD 
           value  ------------- Distribution ------------- count
               8 |                                         0
              16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
              32 |                                         0

GET 
           value  ------------- Distribution ------------- count
               2 |                                         0
               4 |@@@@                                     2
               8 |@@@@@@@@@@@@@@@@@                        8
              16 |@@@@@@@@@@@@@@@@@@@                      9
              32 |                                         0

PUT 
           value  ------------- Distribution ------------- count
              16 |                                         0
              32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
              64 |                                         0

Example one-liners

Text-based examples

Sum a list of numbers in column 1:

$ daggr sum

Print only lines with column 1 greater than zero:

$ daggr -f '$1 > 0'

or:

$ daggr -f '$1 > 0' print

Print column 3 from lines with column 1 greater than zero:

$ daggr -f '$1 > 0' -o 3

Sum the positive numbers in column 1 on each line:

$ daggr -f '$1 > 0' sum

Print column 2 for each line where column 1 is not "HOST":

$ daggr -f '$1 != "HOST"' -v 2

Print the mean of the numbers in column 3 from lines where column 1 is not "HOST":

$ daggr -f '$1 != "HOST"' -v 3 avg

Sum the numbers in column 3 from lines where column1 is not HOST, and print the results grouped by the value of column 2:

$ daggr -f '$1 > 0' -k 2 -v 3 sum

Generate a power-of-two histogram for values of column 2 where column 1 is greater than zero:

$ daggr -f '$1 > 0' -v 2 quantize 

JSON examples

Most of the above can be translated for JSON data as well.

Print objects where "ms" property is greater than 10:

$ daggr -j -f 'ms > 10'

Print the "url" property of objects where "ms" is greater than 10:

$ daggr -j -f 'ms > 10' -o url

Print the sum of the "rqs" property for objects where "ms" is greater than 10:

$ daggr -j -f 'ms > 10' -v rqs sum

Generate a power-of-two histogram for values of "ms" where "ms" is greater than 10, and group the histograms by "req.url":

$ daggr -j -f 'ms > 10' -k req.url -v ms quantize

Details

Synopsis:

daggr [-j] [-k FIELD...] [-f FILTER ...] [-o FIELD...] [-v FIELD] [ACTION]

FIELD is a JavaScript-style property name -- not an arbitrary JavaScript expression.

FILTER is a JavaScript expression invoked in the context of each record to decide whether to keep that record or discard it. In plaintext mode (the default), $0 denotes the complete line, and $1, $2, $3, and so on denote the first, second, third whitespace-separated fields. In JSON mode, "this" denotes the whole record, and global variables are provided for each of the top-level properties of the record.

Blank rows are ignored. With "-j", rows that don't begin with "{" are ignored.

ACTION is one of:

* avg: given numeric inputs, average the values
* count: given arbitrary inputs, count the number of inputs
* max: given numeric inputs, compute the maximum value
* min: given numeric inputs, compute the minimum value
* sum: given numeric inputs, sum the values
* quantize: given numeric inputs, produce a power-of-two histogram
  describing the distribution of values.

If you don't specify a field with "-v", the first field is used.

The "-k" and "-o" options specify one or more fields to group by and output, respectively. Without "-j", the value is a range of field numbers starting with 1 (e.g., "-k 1,3"). With "-j", the value is a comma-separated list of JSON fields. The "-v" option specifies which field contains the value to process, and it operates similarly but only supports a single field.

npm loves you