v1.0
Start Here

Step 2: Creating Data Cut

Many datasets on the PHS Data Portal are too large to use for analysis. You can make the size of your analytical dataset manageable by reducing the number of rows or reducing the number of variables, or by doing both. Using these techniques will help you to create an initial cohort to perform your analysis.


Prerequisite

You must first have access to a dataset (see the ☞ instructions) before you are able to create a data cut.


Creating a Project

The first thing you need to do to create a data cut is to create a Project (☞ learn more).

A Project is a data query toolkit on the PHS Data Portal. To start a project, go to a dataset page and click the "Add to project" button. This button is only enabled for datasets which are public or a restricted datasets where you have completed all access requirements.

**Figure 1:** An animated demonstration on creating a new project

Figure 1: An animated demonstration on creating a new project

A dialog box will appear when you click the "Add to project" button:

  • If you want to create a new project, then click the "New" button, fill out the new project form and finish the action by clicking the "Save" button.
  • If you have an existing project that you want to include the dataset into, then select the project and click the "Add to selected" button.

Either of the actions will bring you to the project page (☞ learn more) where you can start building your data cut.

Cutting the Data

In the project page, click the diamond-shaped icon under the dataset rectangular icon (note: look for the dataset name to identify it). The action will trigger two new icons to appear (i.e., representing the Transform and Table nodes).

**Figure 2:** Data transformation nodes

Figure 2: Data transformation nodes

Click the Transform node (e.g., Transform 1) to view the data transformation dialog panel (☞ learn more). This is the panel where you will formulate your query to cut your dataset. Click the "Run" button (at the top right hand side) to execute the query and create your data cut.

**Figure 3:** Data transformation dialog

Figure 3: Data transformation dialog

We are going to briefly walk you through to 3 different data cutting operations using the tool:

1) CUTTING THE DATA BY COLUMN

Use the top-most part of the data transformation dialog panel to keep the columns (or variables) in the output dataset. Use the two arrow buttons to move the selected variable(s) between the Discard and Keep box. For example, I am keeping the age, year, cdc_dead and cdc_pop variables and discarding the others for my "National Population and Death" data cut (see the illustration below):

**Figure 4:** Data projection

Figure 4: Data projection

2) CUTTING THE DATA BY ROW

Use the "Build" section and formulate the filter criteria under the "Filter rows" header to reduce the number of rows. The filter criterion generally has 3 components: a variable name, a filter operator (e.g., equal, greater than, less than, like, etc.) and a filter value. For example, I want to only focus on the population numbers of 59 years old males from 2001 to 2010 for my data cut (see the illustration below).

**Figure 5:** Filtering rows based on column criteria

Figure 5: Filtering rows based on column criteria

3) LINKING TWO DATASETS OR MORE

Another way to create a data cut is to use the join table operation. By joining multiple tables (or datasets), we are creating a connection between data points that we care about and we also may discard those which we don't want.

Use the "Build" section and formulate the join criteria under the "Join table" header to merge the datasets. The join criterion generally has 3 components: a counterpart dataset, a join operator (e.g., inner join, left join, right join, etc.) and variables to match from both datasets. For example, we want to join the "National Population and Death" dataset with the "World Population and Death" dataset by matching the gender, year and age variables and we are only interested in records that match (i.e., people who exist in both datasets).

Demo: Create a Cohort

The video below demonstrates a combination of data cutting operations to create a specific data cohort.