v1.0
Need Help?

FAQ: Data Access

Q: How do I get access to the PHS datasets?

Please ☞ follow the instructions in this documentation to learn more.


Q: How do I login using the Duo Authenticator?

Here are the steps:

  • Enter your Stanford credential (i.e., SUNet ID and password) on the Stanford Login page and click "Login" button.
  • Select the device that has the Duo Mobile app and click "Send Me a Push" button.
  • Check your device (e.g., smartphone or tablet) and approve the login request via the Duo Mobile app.

Q: How do I establish a secure VPN connection?

Here are the steps:

  • Launch Cisco AnyConnect VPN client app.
  • Enter the Stanford VPN address: su-vpn.stanford.edu
  • Enter your Stanford credential (i.e., SUNet ID and password)
  • Type "1" to push the 2-step authentication request to your default Duo device. Or if you have multiple devices, type the option number accordingly.
  • Check your device (e.g., smartphone or tablet) and approve the login request via the Duo Mobile app.

Q: I can't use the Cardinal Key

You can alternatively authenticate your login using the Duo Mobile Two-Factor Authentication (see ☞ the installation instructions).


Q: What does '1% sample dataset' mean?

It means you are getting only one hundredth of the total records in a dataset. For example, the Optum OMOP Cost table has 3 billion records. One percent out of it equals to 30 million records and that's the number you are getting as a 1% sample. Our system picks randomly the records to get the 1% sample.


Q: Table with a cell size less than 10 cannot be downloaded. What does this mean?

The term cell size refers to the count statistic in your aggregated table.

If you have a table that contains a frequency count and the value is less than 10, you are not permitted to download or publish such table due to privacy and security concerns.

Important to note is that we put the restriction only to the patient count. The restriction does not apply for non-human subjects. For example, the number of cancer patients or the number of children under 5 years old will be evaluated, but the number of common diseases or the number of sick cows will not be evaluated.

When you encounter cell sizes smaller than 10 please consider broadening your selection criteria or collapse subgroups in line with your research question. For example, in a study about a rare disease you may decide to show results for all patients in general rather than separate results for men and women. If the cell is important, you may consider putting <11. We strongly discourage publication of cells smaller than 20 and will review these on a case by case basis.


Q: What is the cost to access/use PHS data?

This depends on several factors:

  • Which dataset(s) you intend to use (you can explore our datasets on the PHS Data Portal).

    • For some datasets, the data owners may charge a fee for 'reuse' of PHS copies of the data (which is still usually substantially cheaper--by 10s or 100s of thousands of dollars--than purchasing, storing, and managing a separate copy of the data for oneself/one's lab, however.) This includes datasets such as Medicaid, Medicare, and CO APCD.
    • For some datasets, PHS collects cost-recovery fees to pay for continued access to the data, updates to the data (additional years of data), storage costs, and staff time for maintaining the data. This includes datasets such as MarketScan (starting in 2026).
  • Which computational environment(s) you intend to use.

  • The nature of the analysis you wish to conduct.

    • More computationally-intensive analyses (such as those using AI or advanced machine learning techniques) may require more computational resources, and sometimes that costs money, but the cost of those additional computational resources will depend on (1) which computational environment(s) you intend to use, (2) what types of computational resources you need, and (3) how much of them you need.
  • What institution(s) you are affiliated with.

    • Some of our data contracts only cover Stanford students, staff members, faculty members, Fellows, and HR-recognized visiting scholars. For the those datasets, the data owners may charge an additional fee to put a third-party agreement (a.k.a. a Data Rider) in place for non-Stanford individuals who wish to use the Stanford copy of the data. (Data Riders are still usually substantially cheaper--by 10s or 100s of thousands of dollars--than purchasing, storing, and managing a separate copy of the data for oneself/one's lab, however.)
    • Please note: Nearly all of our datasets can only be used by individuals from academic or government institutions, and are not available for use by private individuals or individuals associated with for-profit organizations.