Summary


A well-designed database can significantly reduce the amount of time spent on "data cleaning" prior to data analysis. A database (the data tables created by Microsoft ACCESS or the open-source PostgreSQL program) is helpful if the data contain repeated assessments of many variables. This page summarizes how to get a database software program to work with statisitcal software packages like SPSS, Stata, and R.

Database Design


This web page summarizes a few things for a beginner to keep in mind during database design. A carefully designed database can be seamlessly imported into a statistical package for data analysis, making data analysis much less time consuming. Here are a few suggestions on how to name the variables, define their data types, and fill in their values.

Microsoft Excel is usually not a good choice for data management, although it gets used a lot. It is useful if you are meticulous about data entry. I can think of several problems. For example, it sometimes changes TEXT data into numbers and vise versa. Sometimes empty rows and columns are saved and imported into a statistical package. These empty rows and columns cause the statistical packages to abort. Sometimes missing values are imported into a statistical package as a text string like "#NULL!" instead of an empty string or a special character representing missing data like the period '.' (both SPSS and Stata recognize the period for missing data). Dates are not usually recorded in a way that is compatible with date data in statistical packages. There are many other problems, which may take a long time to fix. There are so many times that I spent more time finding these needles in a hay stack of data than on data analysis. So try to use ACCESS, SPSS, or SAS to store your data, even though MS Excel seems like a very convenient tool for someone who is relatively inexperienced with database management.

Variable Names

Generally, variable names should not contain symbols like "#", "@", "-", "'". Sometimes people use "id#" for a variable on id numbers. That can cause confusion. Variable names like "IES-R" are bad because the minus sign means subtraction in statistical packages. Also, it often helps (but is not required) to abbreviate the variable names to within 8 to 10 characters. Abbreviated variable names can be viewed more clearly in the point-and-click data analysis method in SPSS. We are used to long file names (with symbols) in a word-processor program that we do not realize these restrictions until later. Then a lot of time is wasted "cleaning" the data. Here is a list of common things to consider in designing a database (and in variable naming in general):

The last point is important because some statistical packages (typically those on a computer running UNIX) are case-sensitive so that "distr6m" is different from "Dist6M". It is usually a good idea to stay with all lower-case letters. Also, here are some bad variable names: "id#", "ses-1", "Depressed@t1", and "VAR 2".

Data Types

Numbers (NUMERIC in SPSS) are the most commonly used data type, like 3, 2, -1, 2.2, and 9999. STRING variables are not used as often (like 'Private' and 'Public' schools) because of various reasons. One is that while a blank entry of a NUMERIC variable is considered a missing value, a blank STRING with nothing in it is just blank, not missing. DATES (like 2001-02-22 or 02/22/2002) can get complicated depending on what software packages are processing them. One of the worst headaches is to use spreadsheet programs to manage dates like '11/01/01'. When the spreadsheet data are imported into a statistical software package, it is possible for the statistical package to mistaken November 01, 2001 for January 11, 1901. The international standard notation DATE notation is YYYY-MM-DD (e.g., see http://www.cl.cam.ac.uk/~mgk25/iso-time.html). So the fourth day of July in the year 2007 is written in the standard notation as 2007-07-04. But a spreadsheet created by a user in the United States typically follows a notation of "07/04/2007". You can minimize problems by forcing the spreadsheet program to adopt the international standard notation. A database program (e.g., Postgresql and mySQL) is a better choice because it follows the international standard so few conversions are needed.

Other things to keep in mind are:

[continue revision]


Quick-Links