dpm2

Like npm but for data packages!

npm install dpm2
14 downloads in the last week
64 downloads in the last month

dpm2

Like npm but for data packages!

NPM

Usage:

CLI

$ dpm2 --help
Usage: dpm2 <command> [options] where command is:
  - init [globs (*.csv, ...)] [urls] [-d, --defaults]
  - cat       <datapackage name>[@<version>]
  - get       <datapackage name>[@<version>] [-f, --force] [-c, --cache]
  - clone     <datapackage name>[@<version>] [-f, --force]
  - install   <datapackage name 1>[@<version>] <datapackage name 2>[@<version>] ... [-c, --cache] [-s, --save] [-f, --force]
  - publish
  - unpublish <datapackage name>[@<version>]
  - adduser
  - owner <subcommand> where subcommand is:
    - ls  <datapackage name>
    - add <user> <datapackage name>
    - rm  <user> <datapackage name>[@<version>]
  - search [search terms]

Publishing and getting data packages

Given a data package:

$ cat package.json

{
  "name": "mydpkg",
  "description": "my datapackage",
  "version": "0.0.0",
  "keywords": ["test", "datapackage"],

  "resources": [
    {
      "name": "inline",
      "schema": { "fields": [ {"name": "a", "type": "string"}, {"name": "b", "type": "integer"}, {"name": "c", "type": "number"} ] },
      "data": [ {"a": "a", "b": 1, "c": 1.2}, {"a": "x", "b": 2, "c": 2.3}, {"a": "y", "b": 3, "c": 3.4} ]
    },
    {
      "name": "csv1",
      "format": "csv",
      "schema": { "fields": [ {"name": "a", "type": "integer"}, {"name": "b", "type": "integer"} ] },
      "path": "x1.csv"
    },
    {
      "name": "csv2",
      "format": "csv",
      "schema": { "fields": [ {"name": "c", "type": "integer"}, {"name": "d", "type": "integer"} ] },
      "path": "x2.csv"
    }
  ]
}

stored on the disk as

$ tree
.
├── package.json
├── scripts
│   └── test.r
├── x1.csv
└── x2.csv

we can:

$ dpm2 publish
dpm2 http PUT https://registry.standardanalytics.io/mydpkg/0.0.0
dpm2 http 201 https://registry.standardanalytics.io/mydpkg/0.0.0
+ mydpkg@0.0.0

and reclone it:

$ dpm2 clone mydpkg
dpm2 http GET https://registry.standardanalytics.io/mydpkg?clone=true
dpm2 http 200 https://registry.standardanalytics.io/mydpkg?clone=true
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/debug
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/debug
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/csv2
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/csv2
.
└─┬ mydpkg
  ├── package.json
  ├─┬ scripts
  │ └── test.r
  ├── x1.csv
  └── x2.csv

But to save space or maybe because you just need 1 resource, you can also simply ask to get a package.json where all the resource data have been replaced by and URL.

$ dpm2 get mydpkg
dpm2 http GET https://registry.standardanalytics.io/mydpkg
dpm2 http 200 https://registry.standardanalytics.io/mydpkg
.
└─┬ mydpkg
  └── package.json

For instance (using jsontool)

$ cat mydpkg/package.json | json resources | json -c 'this.name === "csv1"' | json 0.url

returns:

https://registry.standardanalytics.io/mydpkg/0.0.0/csv1

Note that in case of resources using the require property (as opposed to data, path or url), the metadata of the resource (schema, format, ...) have been retrieved.

Then you can consume the resources you want with the module data-streams.

On the opposite, you can also cache all the resources data (including external URLs) in a standard directory structure, available for all the data packages stored on the registry.

$ dpm2 get mydpkg --cache
dpm2 http GET https://registry.standardanalytics.io/mydpkg
dpm2 http 200 https://registry.standardanalytics.io/mydpkg
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/csv2
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/csv2
.
└─┬ mydpkg
  ├── package.json
  └─┬ data
    ├── inline.json
    ├── csv1.csv
    └── csv2.csv

Each resources of package.json now have a path property. For instance

$ cat mydpkg/package.json | json resources | json -c 'this.name === "csv1"' | json 0.path

returns

data/csv1.csv

Installing data packages as dependencies of your project

Given a package.json with

{
  "name": "test",
  "version": "0.0.0",
  "dataDependencies": {
    "mydpkg": "0.0.0"
  }
}

one can run

$ dpm2 install
dpm2 http GET https://registry.standardanalytics.io/versions/mydpkg
dpm2 http 200 https://registry.standardanalytics.io/versions/mydpkg
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0
.
├── data_modules
└─┬ mydpkg
  └── package.json

Combined with the --cache option, you get:

$ dpm2 install --cache
dpm2 http GET https://registry.standardanalytics.io/versions/mydpkg
dpm2 http 200 https://registry.standardanalytics.io/versions/mydpkg
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/csv2
dpm2 http GET https://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm2 http 200 https://registry.standardanalytics.io/mydpkg/0.0.0/csv2
.
├── data_modules
└─┬ mydpkg
  ├── package.json
  └─┬ data
    ├── inline.json
    ├── csv1.csv
    └── csv2.csv

dpm2 aims to bring all the goodness of the npm workflow for your data needs. Run dpm2 --help to see the available options.

Using dpm2 programaticaly

You can also use dpm2 programaticaly.

var Dpm = require('dpm2);
var dpm = new Dpm(conf);

dpm.install(['mydpkg@0.0.0', 'mydata@1.0.0'], {cache: true}, function(err, dpkgs){
  //done!
});
dpm.on('log', console.log); //if you like stuff on stdout

See bin/dpm2 for examples

Using dpm2 with npm

dpm2 use the dataDependencies property of package.json and store the dependencies in a data_modules/ directory so it can be used safely, without conflict as a post-install script of npm.

Registry

By default, dpm2 uses our CouchDB powered data registry hosted on cloudant.

Why is it called dpm2 and not simply dpm ?

There is already a dpm being developed here but it leverages npm and the npm registry.

License

MIT

npm loves you