Data cleaning & validation process followed at GetCounted

If we are commissioned for data cleaning, validation or analysis part, we follow the following steps for cleaning of the data. We recommend these practices to be followed by clients too if they are doing the questionnaire programming and data cleaning at their end.

Steps that should be followed in data cleaning during administration of the survey:

1. There is a fundamental difference between online self-filling survey and face-to-face offline surveys. Right at the beginning of a survey, the existing logic built in the survey should ensure capturing of clean data (lot better than face-to-face). There is no ‘manual’ data punching process involved and data gets captured automatically in the pre-designed back-end layout as per the pre-decided variable & value labeling.

  1. Coding & labeling is therefore automated
  2. Cleaning needs arising from data punching/digitization is not there

2. As a process, after the first 50-100 sample is achieved; the data should be downloaded and checked for possible errors in programming and therefore in data capture. Only after checking that out you should go for full fledged fielding of the sample

  1. Data going in to wrong fields is checked and cleaning need is avoided

3. At GetCounted, our process ensures that the same person is not able to participate again in the same survey (De-duping). Since a person has a unique ID with which he logs in, in addition to the Session ID and IP tracking, we don't show same survey twice to one person and also we take steps to ensure that a panel member is not asked to fill more than one survey in a fixed duration of time.

4. Upon completion of the sample collection the following cleaning should be undertaken:

  1. Validation of de-duping process working
  2. Duplication check - Using Unique IP and same session ID, start time of survey
  3. Straight liner - By checking average survey filling time (end time - start time), running descriptive stats, central tendency, plotting the outliers on histogram & scatter plot
  4. Junk response check across all open ended questions, by comparing it with dictionaries built over years and manually checking out exceptions
  5. We don't believe in response cleaning, the final cleaning action should always be deleting the case/respondent.

We would also like to share few points to showcase efficacy of online research

  • Voluntary participation in a self-administered questionnaire leads to better quality of response
  • Usage of ‘panel’ also allows for pre-selection of target audience, than to invite them randomly for participating in a voluntary survey
  • The authenticity of the respondent is always better controlled and identifiable in a panel where the respondent has an ongoing relationship with the panel.
  • Data gets captured online and saves the possibility of errors of data punching, creates possibility of auto validation & logic (skip patterns), and cleaning checks
  • Automation minimizes human errors to a very large extent, particularly takes care of non-sampling errors (haven’t seen more than 2-5% error rates at variable level)
  • Ultimately saves post processing time and enhances quality