How Correct is My Data-Cardinal Rules for Data Collection

How Correct is My Data: Cardinal Rules for Data Collection
Written: May 2, 2007
A marketing manager of a pharmaceutical who has recently joined the company is looking at a report lying on his table about the performance of the sales representatives, their individual targets and the actual achievements. He needs to know what is the correctness of the data being presented, and how much confidence should he have in the presented data. What are the specific questions he should be asking that could lead him to determine the level of confidence he should have in the figures.

This paper presents a tool consisting of four rules for determining the correctness of data. It also gives two case studies that show how this tool can be used in improving the correctness of data and removing the errors. The rules are also useful in identifying the redundant steps that can be eliminated during Business Process Reengineering. During the computerization process, the rules help in identifying the step that needs to be first computerized. The framework has been developed on the basis of experience of several computerization efforts across a wide range of companies.

Rules for Measuring Data Correctness

There are four cardinal rules for ensuring correctness of
data during collection. We define these rules in terms of four distances. Basic
idea is that reduction of each distance improves the correctness of captured
data. Each rule refers to reducing the distance between the point of origin of
data and its collection point as shown in table 1.
Where it is generated
How many places it has been through before being
When it is generated
How long ago was the data generated
Who generates it
How many intermediaries handled it before it was entered
What generates it
How many documental transformations it has gone through
before being entered.


Place-distance indicates the distance between the place of origin of data and the place of collection or data-entry into the system. It refers to the number of places the data was physically handled before it was finally entered in to the system. Each intervening physical location, each intervening storage medium introduces degradation and potential loss of correctness. Thus, the more places the data is physically handled the more its degradation.

Ideally the place where the data is generated should also be the place for its data entry to ensure minimum errors. Let’s consider a case where a client of a firm fills in a manual data form. The filled forms are then batched together and sent for data-entry to the computer department where the data-entry personnel types it in. In this case, errors are introduced at each stage. When the client manually fills in the data on the form, he may not understand the question or may enter some wrong information in a particular field, this would then be carried over to the batching station. Hopefully, someone would detect it during batching and scrutiny. However, during batching there is a potential of misplacing some forms or inadvertently being put in a wrong batch. Some forms may not simply arrive in time for becoming part of the proper batch. The data-entry operator may not understand the writing on the form or may introduce some new typographical errors during the entry. Thus, each stage introduces additional set of errors.

It is for this reason that web interfaces where the user sits down him and places an on-line request result in the minimum number of errors.


Time-distance indicates the distance between the point in time of the origin of data and the point in time of its data-entry into the system. It refers to the number of days or months that may have passed since the data was generated till its final entry. Time delays make the data inaccurate. Apart from the wear and tear of the manual form that may occur while the paper is stored, the data itself has the tendency loosing its currency as time passes. More the time the document has been lying without being entered in to the system, more the chances of information becoming obsolete, misplaced, or misinterpreted. Time context of some information may change. Present as well as future may become past, and future may become present.

Lack of currency of data is in itself an inaccuracy. For example, the price of product may get superceded, the discount offer may expire, data generator may become inaccessible, earlier data may get ordered later, making the ordering illogical etc.

Ideally, time-distance should be zero. The data should be entered when it is generated. Thus, when Walls salesperson go for deliveries to the retailers, they take the inventory and enter it there and then through their handheld computers and send it for processing to the central computers.


Person-distance indicates the distance in terms of number of intermediaries between the person originating the data and the person collecting or entering the data into the system. It indicates the number of persons who have handled the data on its way to its eventual entry in to the system. Each person who handles the data may potentially introduce errors, misplacement or delays that can lead to eventual degradation of the correctness or validity of the data.

Ideally, best data entry is by a the person who generates the data. That is, when people distance is zero. Thus, online web reservation of tickets is often the most error free transaction. Similarly, ATM transactions where the person doing the data entry is the same person who is withdrawing (or depositing) is often the most rewarding given the transaction completes without technical problems. Similarly, chances of incorrect ordering on the web where self is placing the order versus via telephone call where the person doing the ordering is talking to the call center person who is then entering the information. Chances of errors increase when the person entering the data is different from the person generating the data.


Document-distance indicates the distance in terms of number of transformations the original document has gone through before it was ready for entry in to the system. Typically, if the original data is obtained in a manual form, it is possible that the form may directly be used for data entry. It is also possible that the software for receiving the data may not be capable of receiving the point data, in which case the data obtained form the form has to be aggregated. For example, a system for recording the visits data of a sales executive of a pharmaceutical may either ask the or it may be the case that the data from forms is entered in to some manual registers or reports from which it may be directly entered or it may go through one or more summarizations and aggregations before being finally entered. Each documental transformation is a potential source of error or inaccuracy.

Correctness = f(
DTime , DPerson , DDoc , DPerson)
Data should be entered where it is
generated, when it is generated, without transformation and by
the person who generates it.

Why Integrate: Objectives

  • Reduction of errors
  • Lower costs
  • Faster processes
  • Ease of Submission and filing
  • Reduced need for human interaction
  • Elimination of middle-men
  • Faster reconciliation
  • Better transparency
  • Greater accountability

Case Study: NADRA

  • Initial fiasco
  • Data entry from registration forms filled in 1999-2000
  • Time distance: 1-2 yrs; rain, storage deterioration
  • Doc distance: Scanned copies of forms, folds
  • Person distance: Data-entry operators at hourly rate
  • Place distance: Centralized outsourced data entry centers
  • Stories: Female/male swap, father/son swap, same age of all villagers

Recovery Phase

  • Time distance: 0: Form entered when received
  • Doc distance: 0: Printout of the form is the document
  • Person distance: 0: Presence of applicant, picture, thumb print
  • Place distance: 0: Entered where presented

Coupling of Inter-organizational Systems

  • Loosely coupled
    • Filing by paper
  • Coupling
    • Sending of returns through email via an excel file
  • Medium Coupling
    • Logging on to the ST system via web and entering the data
  • Tightly coupled
    • Generate a computerized report in a standard format
    • XML document: Computer readable standardized format
    • Generated report automatically uploaded in ST system
    • Specialized computer to computer connection

Measures of coupling

  • Human involvement vs machine-to-machine
  • Number of Documents/format transformations
  • Database generates a report
  • Transferred to Excel, Attached in email
  • Send – Receive, Detach
  • Read and Re-input
  • Maturity of database systems of the taxpayers
  • Greater the maturity greater the potential of coupling
  • Enabling a tighter coupling
  • Carrot and stick
  • Incentives
  • Free or discounted provision of sales systems
  • Tax credits for filing in particular formats and media
  • Lesser auditing
  • Early adopters facilitation
  • Enabling systems
  • Development and free provision of sales systems
  • Support business processes
  • Help business in recording sales, purchases automatically
  • Purchase invoices, Sales invoices other sales documents
  • Generates reports automatically in desired formats
  • Incentives to software product suppliers
  • To provide free sales tax modules integrated with their products


Leave a Reply

Your email address will not be published. Required fields are marked *