Stata User Manual in K300 1.0 documentation

Optinal K300 Project: Regarding Panel Study of Income Dynamics (PSID)

«  Section 4: Linear Regression   ::   Contents

Optinal K300 Project: Regarding Panel Study of Income Dynamics (PSID)

Section 1: Questions

  • Data file:

    • The data file k300_psid.dta can be downloanded here. This is the file you are going to use in your project. You need to refer to Section 2: Detailed Information about PSID Data for the detailed information regarding this data set.
    • The original Panel Study of Income Dynamics (PSID) data gathers information about a very large sample of people, individuals are interviewed annually over thier life time. The PSID collects data on a wide array of social, demographic, health, economic, geospatial and psychological data. The data set we use here is just a small subset of it.
  • Format Requirement:

    You need to submit TWO files to be able to receive full credits, one is a text file and the other is a do-file.

    • Text File: In this file, you need to clearly label the question numbers, paste all the relevant figures and test outputs you obtained from STATA.
    • Do-file: You need to provide me with your detailed do file. You need to make sure I can run it and get the figures and results containing in your text file. Significant amount of credits would be deducted if the do-file is not working.
  • Questions:

    • Part I: Exploratory Data Analysis

      1. You’re interested in finding out how the distribution of people’s wage income (log wage) evolves over time. Try to replicate the figures provided below. On the left, various percentiles of individuals’ log wage incomes are plotted against time. Note that different percentiles are normalized to be 1 at the beginning in the year 1971. For the figure on the right, standard deviation of people’s wage income is plotted against time.

        Briefly describe the trend you’ve discovered in these plots.

        _images/section_project_1.png _images/section_project_2.png
      2. Pick one year (I chose the year 1980), and just use the data in that year to conduct your analysis. Within that year, identify the factors which may affect people’s income. Use mean plot to help you visualize the main effects and interaction effect.

        For example, I identified educational attainment, region, industry can affect people’s income. So I plotted the main effect regarding factor “educational attainment” (the figure on the left), and the interaction effect of factor “region” and factor “industry” in the figure on the right.

        Following my example, you need to first list all the factors that may affect people’s income, and then generate the mean plots regarding all the main effects and interaction effects. To get full credit, you need to list the factors and generate the relevant plots as much as possible.

        _images/section_project_3.png _images/section_project_4.png
    • Part II: Inferential Statistics

      1. (Continuation of Question 2), now use the ANOVA tests we’ve learned to conduct statistical tests to check significance of relevant main effects and interaction effects.
      2. (Appreciate the equivalence between ANOVA and linear regression) Conduct the same analysis as you did in question 3 using linear regression technique. You need to make sure you get the same results as in question 3.
    • Part III: Challenging Question

      1. The figure on the right in question 1, we plotted standard deviation of log wage over time. It seems there is a significant increase in the standard deviation after 1980’s. Could you (i). provide your reasoning why this change is observed; (ii) provide rigorous statistical test to check whether the change in standard deviation is significant.

Section 2: Detailed Information about PSID Data

PSID data can be found in the website provided by University of Michigan. Detailed documents can be found in the provided webpage.

Section 2.2: More on Variables

x11101ll

This is personal identification number, same number denotes the data is from the same person.

year

The year when the data was collected, it takes value from 1971 to 2011.

age

The age of individual, age at the time of the interview.

educ

The number of grades of school that the head accomplished. Note that the way this variable is coded is not consistent through the years:

  • Year 1968 ~ 1974 and 1985 ~ 1990

    Value/Range Code Value/Range Text
    0 cannot read or write
    1 0-5 grades
    2 6-8 grades
    3 9-11 grades
    4 12 grades, high school
    5 12 grades plus non-academic training
    6 college, no degree
    7 college degree, no advanced degree mentioned
    8 college, advanced or professional degree
    9 NA
  • Year 1975 ~ 1984 and 1991 ~ 2011

    Value/Range Code Value/Range Text
    0 none
    1 one
    2 two
    3 three
    4 four
    5 five
    6 six
    7 seven
    8 eight
    9 nine
    10 ten
    11 eleven
    12 twelve; GED
    13 first year of college
    14 second year of college, with or without Associate’s degree
    15 third year of college
    16 fourth year of college; college graduate
    17 at least some postgraduate work
    99 NA

indwgt

Individual weight. If you are not familiar with weight in the survey data, you can ignore this variable.

empstatus

The employment status of individual: working now, unemployed, retired or other. Note that the way this variable is coded is also not consistent throughout the years:

  • Year 1968 ~ 1975

    Value/Range Code Value/Range Text
    1 Working now, or only temporarily laid off
    2 Looking for work, unemployed
    3 Retired, permanently disabled
    4 housewife
    5 student
    6 Other
  • Year 1976 ~ 1996

    Value/Range Code Value/Range Text
    1 Working now
    2 Only temporarily laid off, sick leave or maternity leave
    3 Looking for work, unemployed
    4 Retired
    5 permanently disabled; temporarily disable (from 1996 on)
    6 housewife
    7 student
    8 Other; “workfare”; in prison or jail
    9 NA; refused (from 1996 on)
  • Year 1997 ~ 2001

    Value/Range Code Value/Range Text
    1 Working now
    2 Only temporarily laid off, sick leave or maternity leave
    3 Looking for work, unemployed
    4 Retired
    5 permanently disabled; temporarily disable (from 1996 on)
    6 Keeping house
    7 student
    8 Other; “workfare”; in prison or jail
    98 DK
    99 NA; refused
  • Year 2003

    Value/Range Code Value/Range Text
    0 Wild code
    1 Working now
    2 Only temporarily laid off, sick leave or maternity leave
    3 Looking for work, unemployed
    4 Retired
    5 permanently disabled; temporarily disable (from 1996 on)
    6 Keeping house
    7 student
    8 Other; “workfare”; in prison or jail
    22 Wild code
    99 NA; refused

industry

The main occupation of the individual. The way how this variable is coded is as follows:

  • Year 1968 ~ 2001

    Value/Range Code Value/Range Text
    0 Inap.: not eligible for retroactive coding
    17-28 Agriculture, Forestry, and Fisheries
    47-57 Mining
    67-77 Construction
    107-398 Manufacturing
    407-479 Transportation, Communications, and Other Public Utilities
    507-698 Wholesale and Retail Trade
    707-718 Finance, Insurance, and Real Estate
    727-759 Business and Repair Services
    769-798 Personal Services
    807-809 Entertainment and Recreation Service
    828-897 Professional and Related Services
    907-937 Public Administration
    999 NA
  • Year 2003 ~ 2011

    Value/Range Code Value/Range Text
    0 Inap.: did not work for money or has not worked for money
    17-29 Agriculture, Forestry, and Fisheries
    37-49 Mining
    57-69 Utilities
    77 Construction
    107-399 Manufacturing
    407-459 Wholesale Trade
    467-579 Retail Trade
    607-639 Transportation and Warehousing
    647-679 Information
    687-699 Finance and Insurance
    707-719 Real Estate and Rental and Leasing
    727-749 Professional, Scientific, and Technical Services
    757-779 Management, Administrative and Support, and Waste Management Services
    786-789 Educational Services
    797-847 Health Care and Social Assistance
    848 Wild code
    856-859 Arts, Entertainment, and Recreation
    866-869 Accommodations and Food Services
    877-929 Other Services (Except Public Administration)
    937-987 Public Administration and Active Duty Military
    999 NA

region

Region is where some individual lived at the time of interview. In the data provided to you, I’ve just kept 4 possible values where 1 means northeast, 2 means north central, 3 means south and 4 means west.

  • Year 1968 ~ 1969

    Value/Range Code Value/Range Text
    1 Northeast (Connecticut, Maine, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, Vermont)
    2 North Central (Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin)
    3 South (Alabama, Arkansas, Delaware, Florida, Georgia, Kentucky, Louisiana, Maryland, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee, Texas, Virginia, Washington DC, West Virginia)
    4 West (Arizona, California, Colorado, Idaho, Montana, Nevada, New Mexico, Oregon, Utah, Washington, Wyoming)
  • Year 1970 ~ 2011

    Value/Range Code Value/Range Text
    0 Wild code (from 1999 onward)
    1 Northeast
    2 North Central
    3 South
    4 West
    5 Alaska, Hawaii
    6 Foreign country
    9 NA (from 1972 onward)

homeowner

Whether an individual owns his home or pay rent or other.

  • Year 1968 ~ 1993

    Value/Range Code Value/Range Text
    1 Owns home (or trailer, fully or jointly)
    5 Rents (or shares rent)
    8 Neither (owns nor rents)
  • Year 1994 ~ 2011

    Value/Range Code Value/Range Text
    0 Inap.
    1 Owns or is buying home, either fully or jointly; mobile home owners who rent lots are included here
    5 Pays rent
    8 Neither owns nor rents
    9 DK; NA; refused (Wild code in year 2007)

familysize

Number of people (children plus adults) in this FAMILY UNIT.

CPIURS

CPI-URS is the annual average Consumper Price Index research series (CPI-U-RS) where the price level in Dec. 1977 is set to be 100.

real_wagehead

Amount of individual’s wage and salary income, which is normalized using CPI-U-RS. So the dollar value is in terms of Dec. 1977 dollar.

real_housevalue

The value of house if any in terms of Dec. 1977 dollar.

real_faminc

This variable represents the summation of the following variables: taxable income of Head and Wife; total transfer of Head and Wife; taxable prorated income of others; total prorated transfers of others. The value is in terms of Dec. 1977 dollar.

educgrp

This variable is generated from variable educ. It can take six values where 1 means elementary school, 2 means some high school, 3 means high school graduate, 4 means some college, 5 means college graduate, and 6 means graduate school.

birth

The year when some individual was born.

industry_grp

This variable is generated from variable industry. It takes five values: 1 denotes agriculture, 2 denotes mining, 3 denotes constructing, 4 denotes manufacturing, and 5 denotes others.

log_real_wagehead

This is just the log value of the variable real_wagehead.

age2

This is the age square of some individual.

age3

This is just age cube of some individual.

«  Section 4: Linear Regression   ::   Contents