r-open-data-500

0.0.1 • Public • Published

title: A data dictionary of the Open Data 500 data

The Open Data 500 is "the first comprehensive study of U.S. companies using open government data to develop new products and services." You may have read about it in. Forbes, Information Week, or Fedscoop This is data about open data, so I naturally wanted it. In the present article, I explain a bit about how the study is being conducted and how the data are presented for download.

Data collection methods

The Open Data 500 looks a bit scary and complicated when you read its website and the articles about it, but the Open Data 500 is pretty much just a straightforward questionnaire survey.

The population of interest

The Open Data 500 Team is interested in companies that meet their eligibility criteria, which they describe as follows.

  1. Be U.S.-based – which can include international companies with a major presence in the U.S.
  2. Earn revenue from its products and services. In addition to for-profit companies, nonprofits may qualify if they support themselves primarily through sales of products and services rather than philanthropy.
  3. Use open government data as a critical resource for its business. (While most Open Data 500 companies will work with federal data, the study will also include some that use city or state data if their work can scale regionally or nationally)

The Open Data 500 site does not explain how they operationalize these criteria, but I presume that it works as follows.

  1. The questionnaire must indicate that a company has a major presence in the U.S.
  2. The answer to "Which of the following are significant sources of revenue for your company?" must include something other than "Philanthropy".
  3. The answer to "Which of the following are critical sources of data for your company?" must include "Federal Open Data", the questionnaire must indicate elsewhere that the company is using state or city data from a range of states or cities.

They are "vetting" companies, but this sounds like just a conversation with the submitters to check that they understood the questionnaire properly.

Sampling strategy

The Open Data 500 uses a sampling strategy that they term "comprehensive call". That sounds really complicated, but it's actually quite simple.

The questionnaire results can be considered a convenience sample convenience sample of companies that meet the Open Data 500 eligibility criteria.

The Open Data 500 Team distributed links to the questionnaire in these places.

  • Recommendations from government and non-governmental organizations studying this field.
  • List of companies using datasets from Data.gov, the federal hub for government data
  • Online Open Data Userbase created by Socrata
  • Directory of open data companies developed by Deloitte
  • Mass email to 3100 people in the GovLab network
  • Mass email to 2200 people on contact list for OpenDataNow.com, website of Open Data 500 project director Joel Gurin
  • Companies identified in research for upcoming book, Open Data Now
  • Response to coverage of the Open Data 500 in Information Week and FedScoop
  • Outreach through Twitter
  • Outreach at "Data Transparency 2013" conference (September 2013, Washington, DC)
  • Blog posts on TheGovLab.org and OpenDataNow.com

And that's it!

I think their use of comprehensive call is why the Open Data 500 Team considers the Open Data 500 to be a comprehensive study.

Case study

The Open Data 500 has released data in the "download", "preview", "full list" sections. These data are the straightforward questionnaire results (with one quirk) for the companies that met their eligibility criteria. Thus, the Open Data 500 is a case study of companies that meet the Open Data 500 eligibility criteria.

Questionnaire and non-questionnaire responses

Open Data 500 Team solicited responses to this questionnaire through [comprehensive call](#sampling strategy), but they also looked for companies that did not respond to the questionnaire; the Team effectively filled out parts of this questionnaire for companies who did not respond.

Just a questionnaire

This is the main thing that confused me about the study, so I'm going to explain this part in a bit more depth in case anyone else was also confused.

The data collection for the Open Data 500 is just the questionnaire. Really. For example, the Team isn't looking at companies products to see which open data are used, and it isn't looking at annual reports to determine whether the company does a lot of business in the United States. This is confusing because the data releases do not look like the questionnaire, but I'm pretty sure that the data come rather directly from the questionnaire.

Pre-launch

According to the "About" page on the Open Data 500 site, the Open Data 500 "will identify, describe, and analyze companies that use open government data in their businesses."

As I understand it, the analysis component of this has yet to occur. At the moment, the Open Data 500 is in "pre-launch". This means that they have begun to collect data but that they haven't done any sort of analysis on it. (This part isn't explained on the website, but a member of the Open Data 500 Team explained this to me in person.)

Preview of 50 companies

The Open Data 500 Team released an "in-depth view" of "50 of the first to complete [the] survey". They also released a "full list" of "500 candidate companies". It took me a while to understand the difference between these companies.

I'm pretty sure that these 50 are just some of the early submitters; they are have not been vetted or ranked in any way. But if that's all it is, why don't they put all of the companies in there?

XXX

I think that preview companies are simply all of the companies that have submitted the questionnaire; the non-preview companies are companies for which the the Open Data 500 team effectively filled out the questionnaire.

xpathApply(candidates.html[[3]], 'contains(@class, "preview-company")')

500-ness

I'm still unsure as to what the "500" in the title means.

Fortune 500?

Many people have suggested that the name is allusion to Fortune 500, but I don't think that's it. The Fortune 500 is list of "the top 500 U.S. closely held and public corporations as ranked by their gross revenue after adjustments". That is, it's the 500 biggest U.S. companies for a particular definition of "big".

The Fortune 500 and the Open Data 500 are both about U.S. companies, but the similarities stop there; as explained on the "About" page, the Open Data 500 is explicitly not a ranking and not about company size.

Number of responses?

The website says that the Open Data 500 is a list of 500 companies, so it might be that the "500" refers to the number of companies responding to the questionnaire, but this was a bit odd to me because they had chosen the name before they first sent out the questionnaire.

One member of the Team told me that this was just a big number as a challenge to themselves. Another told me that they expected, based on Joel Gurin's network, that there were about 500 companies that would respond.

Data files

The Open Data 500 Team has released six main data files.

There are also individual pages about each company, but I'm pretty sure that the only extra information on those is comments submitted through the comment forms.

Preview50_Companies.csv

Preview50_Companies.csv is a denormalized CSV file with r ncol(preview.csv) columns and r nrow(preview.csv)rows. Each row corresponds to a dataset within a company, and each column corresponds to a question from the [questionnaire](http://www.opendata500.com/submitCompany/).r length(unique(preview.csv$CompanyName))` different companies are represented in this dataset.

The column names in this file correspond quite closely to the name attributes in the HTML form source code for the questionnaire.

These columns describe the companies, and they are identical across different rows about the same company.

Code in the file Question from the questionnaire
CompanyName Name of your company
URL Company URL
city In which city is this company located?
STATE State [1]
abbrev State [1]
zipCode Zip Code
ceoFirstName First Name of CEO
ceoLastName Last Name of CEO
companyPreviousName ???
yearFounded Founding Year
FTE Number of FTE's [2]
companyType Type of Company (r pretty.levels(preview.csv$companyType)) [3]
companyCategory What category best describes your company? (r pretty.levels(preview.csv$companyCategory)) [1]
companyFunction Which best describes the function of your company? (r pretty.levels(preview.csv$companyFunction)) [3]
sectors What category best describes your company? (r pretty.levels(preview.csv$sectors)) [1,3]
revenueSource Which of the following are significant sources of revenue for your company? [4]
descriptionLong Please give us a short public statement describing your company’s mission and work. You can take this material from your website or other publications if you choose to.
descriptionShort As a summary, please provide a one sentence description of your company.
socialImpact Besides revenue generation, how do you measure the impact your company has for society and the public good?
financialInfo Please include any financial or operational information that will help us understand your company. We are interested in specific information like past and projected annual revenues, total outside investment dollars to date, and significant investors or partners.
criticalDataTypes Which of the following are critical sources of data for your company? By “critical,” we mean that your company would have to shut down a line of business, shut down completely, or replace the data in some way if the data were no longer available.

It does not include the following questions from that first page of the questionnaire.

Code from the web form Question from the questionnaire
firstName First Name [5]
lastName Last Name
title Title
email Email
phone Phone
contacted Please check here if you would be willing to be contacted for further information about your company.
datasetWishList What datasets (if any) are not currently available that would be useful for your company to have as government open data?
companyRec What other companies, either in your sector or other sectors, would you recommend we contact regarding their use of government open data?
conferenceRec What conferences or events do you think would be helpful to us in surveying the field of open data companies?

The following columns come from the "New Dataset" page of the questionnaire.

Code in the file Question from the questionnaire
datasetName Name of Dataset
datasetURL URL of Dataset
agencyOrDatasetSource Agency or Source

The file does not include the following columns from the "New Dataset" page.

Code from the web form Question from the questionnaire
typeOfDataset Type of Dataset (Federal Open Data, State Open Data, City/Local Open Data, Other)
rating On a scale of 1 to 4, how would you rate the usefulness of this dataset? (1- poor, 4- excellent) Your answer can reflect your experience with data quality, format of the data, or other factors.
reason Why did you give it this rating?

Finally, there is also a DATASETS column, which is the number of datasets submitted for the particular the company.

You can think of this file as a CSV version of OD500_Companies.json.

Notes:

  1. In some cases, answers to one question are presented redundantly across multiple columns.
  2. "FTE" probably stands for "full-time equivalent employees".
  3. The questionnaire has different categories from the levels reported in this file.
  4. This cell contains a comma-and-space (, ) delimited list of items, and I haven't picked apart the lists to find all of the possible values in the list.
  5. This is from the "Personal Information" section, which presumably describes the person who is filling out the questionnaire.

500_Companies.csv

500_Companies.csv is a CSV file with r ncol(candidates.csv) columns and r nrow(candidates.csv)` rows. Each row corresponds to a unique company, and each column corresponds to a question from the questionnaire.

Code in the file Question from the questionnaire
CompanyName Name of your company
URL Company URL
city In which city is this company located?
STATE State
abbrev State
zipCode Zip Code
companyCategory What category best describes your company? (r pretty.levels(preview.csv$companyCategory))
descriptionShort As a summary, please provide a one sentence description of your company.

This file provides no data about datasets used by the companies.

OD500_Companies.json

OD500_Companies.json is a JSON file with an array of associative arrays (that is, a list of mappings). It has r length(preview.json) rows (associative ararys) and r unique(sapply(preview.json, length)) columns (items per associative array). Each row corresponds to a unique company, and each column corresponds to a questionnaire question.

Code in the file Question from the questionnaire
companyName Name of your company
url Company URL
city In which city is this company located?
state State [1]
zipCode Zip Code
ceoFirstName First Name of CEO
ceoLastName Last Name of CEO
previousName ???
yearFounded Founding Year
fte Number of FTE's [2]
companyType Type of Company (r pretty.levels(preview.json, 'companyType')) [3]
companyCategory What category best describes your company? (r pretty.levels(preview.json, 'companyCategory')) [1]
companyFunction Which best describes the function of your company? (r pretty.levels(preview.json, 'companyFunction')) [3]
sector What category best describes your company? (r pretty.levels(preview.json, 'sectors')) [1,3]
revenueSource Which of the following are significant sources of revenue for your company?
descriptionLong Please give us a short public statement describing your company’s mission and work. You can take this material from your website or other publications if you choose to.
descriptionShort As a summary, please provide a one sentence description of your company.
socialImpact Besides revenue generation, how do you measure the impact your company has for society and the public good?
soccialInfo Please include any financial or operational information that will help us understand your company. We are interested in specific information like past and projected annual revenues, total outside investment dollars to date, and significant investors or partners.
criticalDataTypes Which of the following are critical sources of data for your company? By “critical,” we mean that your company would have to shut down a line of business, shut down completely, or replace the data in some way if the data were no longer available.

It does not include the following questions from that first page of the questionnaire.

Code from the web form Question from the questionnaire
firstName First Name [4]
lastName Last Name
title Title
email Email
phone Phone
contacted Please check here if you would be willing to be contacted for further information about your company.
datasetWishList What datasets (if any) are not currently available that would be useful for your company to have as government open data?
companyRec What other companies, either in your sector or other sectors, would you recommend we contact regarding their use of government open data?
conferenceRec What conferences or events do you think would be helpful to us in surveying the field of open data companies?

In addition to the 20 columns I describe above, there are two columns for identificatiers. One is the companyId column, which is the unique identifier for the particular company. Within the questionnaire, this shows up inside the URL for the "New Dataset" page.

http://www.opendata500.com/addData/$companyId/

The other is the datasets column, which lists identification codes for datasets (like r preview.json[[1]]$datasets[1]) and references the datasetID column in OD500_Datasets.json.

You can think of this file as a JSON version ofPreview50_Companies.csv.

Notes:

  1. In some cases, answers to one question are presented redundantly across multiple columns.
  2. "FTE" probably stands for "full-time equivalent employees".
  3. The questionnaire has different categories from the levels reported in this file.
  4. This is from the "Personal Information" section, which presumably describes the person who is filling out the questionnaire.

OD500_Datasets.json

OD500_Datasets.json is a JSON file with an array of associative arrays (that is, a list of mappings). It has r length(preview.json) rows (associative ararys) and r unique(sapply(preview.json, length)) columns (items per associative array). Each row corresponds to a dataset. Three of the columns correspond directly to questionnaire questions.

The following columns come from the "New Dataset" page of the questionnaire.

Code in the file Question from the questionnaire
datasetName Name of Dataset
datasetURL URL of Dataset
source Agency or Source

The file does not include the following columns from the "New Dataset" page.

Code from the web form Question from the questionnaire
typeOfDataset Type of Dataset (Federal Open Data, State Open Data, City/Local Open Data, Other)
rating On a scale of 1 to 4, how would you rate the usefulness of this dataset? (1- poor, 4- excellent) Your answer can reflect your experience with data quality, format of the data, or other factors.
reason Why did you give it this rating?

The file also contains two identifier columns. identificatiers. One is the datasetID column, which serves as a primary key for this table. The other is the usedByCompany column, which references the companyId in OD500_Companies.json,

preview (HTML)

preview is an HTML page containing a non-standard representation of a data table about companies.

The companies are represented as a nodes with the following XPath.

preview.xpath

The file contains r length(preview.html) companies and about 11 fields (depending on your definition of a field). The fields are approximately a subset of the fields for Preview50_Companies.csv.

I don't feel like writing out selectors for every field within each company node, but you can figure it out by looking at the code for the first company.

preview.html[[1]]

I do want to point out the dataset nodes in particular. Each company node lists zero or more datasets, each with a URL and a title. Here is how you query them.

df <- data.frame(
  urls = xpathApply(preview.html[[1]], 'div[@class="m-list-company-full"]/div[@class="m-full datasets"]/ul/li/a/@href')
  titles = xpathApply(preview.html[[1]], 'div[@class="m-list-company-full"]/div[@class="m-full datasets"]/ul/li/a/text()')  
)
kable(df)

candidates (HTML)

candidates is another HTML page containing a non-standard representation of a data table about companies. You can select the companies with the following XPath.

candidates.xpath

This file contains r length(candidates.html) companies. To give you a feel for the schema, the first company is represented like this.

candidates.html[[1]]

I'd say that this file contains seven fields. Four of them are direct questionnaire questions.

XPath within the company node Questionnaire question or meaning
a/h3/strong/text() Name of your company
p[@class="m-homepage-list-location"]/text() In which city is this company located?
em/text() Which best describes the function of your company?
p[@class="m-homepage-list-desc"]/text() As a summary, please provide a one sentence description of your company.

Three of them are not

XPath within the company node Meaning
r preview.company.xpath Is the company part of the "Preview" companies?
r survey.company.xpath Did the company submit the questionnaire?
a/@href Link to a page on the Open Data 500 site with more information from the questionnaire about the company

Loading into R

Readme

Keywords

none

Package Sidebar

Install

npm i r-open-data-500

Weekly Downloads

0

Version

0.0.1

License

ISC

Last publish

Collaborators

  • tlevine