Summary


A well-designed database can significantly reduce the amount of time spent on "data cleaning" prior to data analysis. This web page provides a few recommendations on good data management practices.

Don't Use a Spreadsheet Program


Do not use Excel for data management.

Why?

Reason 1, date-time variables are often messed up in Excel, either erroneous coded as a text string (so that later you have to manually convert it into a date-time variable) or worse, mistaken as the wrong date (e.g., '11/01/01' taken to represent November 01, 2001 rather than January 11, 1911).

The international standard DATE notation is YYYY-MM-DD (e.g., see this page by Markus Kuhn. The fourth day of August in the year 2007 is written in the standard notation as 2007-08-04. But a spreadsheet created by a user in the United States typically follows a notation of "08/04/07". A computer program expecting the international standard may convert a text string "08/04/07" into a date-time variable of April 07, 2008.

You can of course use the "Format Cells" option in Excel to set the variable to the desired Date format. But this becomes inefficient manual work if there are more than a handful of date-time variables. A better solution is to define them as date-time variables using a database program such as Postgresql, mySQL, and Microsoft ACCESS.

Reason 2, if you have a typo in a numeric variable (e.g., "9 .", a number 9, two spaces, and a period), then Excel or a statistical package can automatically convert the whole column into a TEXT column. That column then becomes a TEXT variable when imported into a statistical package.

Suppose your dataset only contains 25 variables. If in Excel you accidentally entered a blank cell in column 46, then you might wind up getting a dataset with 46 variables, not 25, and with the intervening columns all imported and automatically named "VAR26", "VAR27", etc. That's reason 3.

A related problem occurs when you add analyses to the same data spreadsheet. These analyses may be imported as raw data values. Or sometimes a user adds a plot to the right of the data columns. That plot has to be manually removed before the data columns can be imported into another program.

Reason 4. Sometimes missing values are imported into a statistical package as a text string like "#NULL!" instead of an empty string or a special character representing missing data like the period '.' (both SPSS and Stata recognize the period for missing data).

There are many other reasons. When I receive data prepared in Excel, I often spend up to 80-90% of the time fixing these problems. The actual statistical analysis often takes not much time. So please use a database to manage your data, even though MS Excel seems like a very convenient tool.

Avoid Symbols in Variable Names

Variable names should never contain symbols like "#", "@", "-" (minus), or "'" (apostrophe). For example, you should not use "id#" for a variable on id numbers. Variable names like "IES-R" (note the hyphen) are bad because the minus sign means subtraction in statistical packages. Some statistical packages simply can not work with these symbols when they appear in variable names.

Do Use Short Variable Names

It used to be the case that some statistical packages only work with variable names no longer than 8 characters of length because the operating system (MS-DOS, for example) does not allow it.

Although this restriction no longer applies in the statistical packages I know, it is still a good idea to abbreviate variable names to within 8 to 10 characters. Abbreviated variable names can be viewed more clearly. They also make it easier to write a statistical program.

The problem is that we get used to long file names such as "Jane Doe Dissertation V 225 11/2/13_almost final.docx" that we habitually type long variables names without giving it much thought. Abbreviated names require careful thinking. That probably is why they are not vary popular. Sadly, computer programs shape our behaviors. It is too easy to type long names without thinking about ways to simplify them.

If You Can, Use a Database

A database using PostgreSQL, MySQL, or Microsoft ACCESS is harder than Excel. But the investment is well worth it. A carefully designed database can be seamlessly imported into a statistical package for data analysis, making data analysis much less time consuming.

Variable Names

The last point is important because some statistical packages (typically those on a computer running UNIX) are case-sensitive so that "distr6m" is different from "Dist6M". It is usually a good idea to stay with all lower-case letters.

Data Types

Numbers (NUMERIC in SPSS) are the most commonly used data type, like 3, 2, -1, 2.2, and 9999. STRING variables are not used as often (like 'Private' and 'Public' schools) because they can be coded numerically as 1 for private school and 2 for public school. It makes sense to use numbers to code STRING variables. For example, a statistical package may consider a blank entry of a NUMERIC variable as missing by default, it may treat an empty STRING as just a blank string, not missing.

Other things to keep in mind are:


Quick-Links