Technology Assisted Research Methodologies:
A Decision Model to Assess Quality

Michael S. Gendron
Marianne J. D’Onofrio

Management Information Systems
Central Connecticut State University

860-832-3293
gendronm@ccsu.edu
http://wwwsb.ccsu.edu/faculty/gendronm

http://wwwsb.ccus.edu/dataquality

 


Introduction

Organizations face new and more complex ways of collecting data.    The Internet provides many of these innovations, while at the same time it provides an increased number of variables that must be considered when choosing how to collect data. One of the considerations is the collection of data with the appropriate level of quality. This paper introduces the Technology Assisted Research Methodology Data Collection Quality Model (TARM-DQM) of assessing the best technology for Internet-based data collection. TARM (Technology Assisted Research Methodologies) is the term we use to describe technologies that are employed to collect and process survey data (D'Onofrio & Gendron, 2001) .

We believe that the selection of the best Internet TARM implementation can be formalized to enhance decision-making.  This paper presents a preliminary model that aids in the understanding the multi-attribute nature of data quality that accords the selection of Internet TARM.

There are basically two types of sampling done when Internet TARM is employed to collect data:  1) traditional random sampling, and 2) ad hoc data sampling. We also see at least two types of Internet data collection activities:  1) e-mail surveys, and 2) web surveys.

Sampling Methodologies

Traditional sampling methodology has evolved over many years starting in the late nineteenth century; however, basic statistical techniques for probability sampling were first proposed by J. Neyman (Neyman, 1934) .  Random sampling is employed to ensure the validity, reliability, generalizability and representativeness of data collected.  In short, it is used to make sure that the data collected sufficiently describes the population so that the data is useful.

Ad hoc sampling is where subjects occur naturally, and no or few statistical methods are used to ensure the validity, reliability, generalizability and representativeness of data collection. In other words, data is collected from subjects that happen to be available without regard for ensuring that the sample is representative.  This is very easy to do in Internet TARM implementations.  One only has to envision a survey placed on a high volume web site – all visitors are requested to complete a survey but no attempt is made to select respondents…hence the name “ad hoc.”

Data Collection Activities

While there are many types of data collection activities that can occur over the Internet, we choose to emphasize two.  This paper discusses survey data collected by email and by the web.

In an email sample, subjects are sent the survey and are asked to return it by email.  For web-based surveys subjects are contacted and asked to visit a website. They complete a survey there.

The TARM-DQM Model

The proposed model uses multi-attribute utility theory (Drummond, Stottard, & Torrance, 1994; Haddawy & Haddawy, 1997) to determine the type of data collection that has the highest data and information quality and to remove some of the subjectivity from the decision process. This theory proposes the representation and the construction problems (Martin, 2000) .  The representation problem involves the determination of attributes that represent the decision maker’s preferences so they can be described by a function. The construction problem involves the maximization of the function and the estimation of its parameters.  Our goals are to further validate the data quality attributes thus solving the representation problem and to create an orthogonal set of scales that solve the construction problem while maintaining clear relative importance of attributes.

Table 1 - TARM-DQM
DECISION CHOICE MODEL

 

EMAIL SURVEY

WEB
SURVEY

RANDOM
SAMPLE

Q

Q

AD HOC SAMPLE

Q

Q

Q represents the output of a utility function to calculate data quality.

We propose that by creating a function (Q) that the data collection option with the highest relative quality can be determined. Based on work that defines data quality attributes (Wang & Strong, 1996) and our own research, we propose that these decisions can be made with a finite set of attributes and parameters. For each of these attributes an importance rating (I) and a level (L) would be derived, as is shown in Table 2.  Earlier work suggests that the importance weightings (I) are organization specific (Gendron & D'Onofrio, 2000) . Further research is needed to validate the metrics required to ensure appropriate levels (L) for each cell of the model (i.e., the levels (L) are specific to the decision choices explicated in Table).

Table 2 – Data Quality Attributes and Parameters

Attributes
(Definition)

Importance

Level

Access Security (data cannot be accessed by competitors, data are of a proprietary nature, access to data can be restricted, secure)

I1

L1

Accessibility (accessible, retrievable, speed of access, available, up-to-date)

I2

L2

Accuracy (data are certified error-free, error-free, accurate, correct, flawless, reliable, errors can be easily identified, the integrity of the data , precise)

I3

L3

Appropriate Amount of Data (the amount of data is appropriate to the task at hand)

I4

L4

Believability (believable)

I5

L5

Completeness (the breadth, depth and scope of information contained in the data)

I6

L6

Concise (well-presented, concise, compactly represented, well-organized, aesthetically pleasing, form of presentation, well formatted, format of the data)

I7

L7

Cost Effectiveness (cost of data accuracy, cost of data collection, cost effective)

I8

L8

Ease of Operation (easily joined. easily changed, easily updated, easily downloaded/uploaded, data can be used for multiple purposes, manipulatable, easily aggregated, easily reproduced, data can be easily integrated, easily-customized)

I9

L9

Ease of Understanding (easily understood, clear, readable)

I10

L10

Flexibility (adaptable, flexible, extendable, expandable)

I11

L11

Interpretability (interpretable)

I12

L12

Objectivity (unbiased, objective)

I13

L13

Relevancy (applicable, relevant, interesting, usable)

I14

L14

Representational Consistency (data are continuously represented in the same format, consistently represented, consistently formatted, data are compatible with previous data)

I15

L15

Reputation (the reputation of the data source, the reputation of the data)

I16

L16

Timeliness (age of data)

I17

L17

Trace-ability (well-documented, easily traced, verifiable)

I18

L18

Value-added (data give you competitive advantage, data add value to your operations)

I19

L19

Variety of Data & Data Sources (you have a variety of data and data sources)

I20

L20

Using a multi-attribute utility model these parameters would give us the formula:

Importance (I) rating scales have been developed and are Likert scales anchored at 1 as unimportant and 7 as Extremely Important. Our current research suggests that it is necessary to give respondents a non-applicable choice as an alternative to the importance rating scale as they seem reluctant to rate any data quality attribute as un-important.  The scales for level (L) are under development and could be either ordinal or nominal with mapped responses. 

Implementation

An organization presented with choices for data collection needs to make a decision regarding which choice is best.  In our experience this is often an unstructured decision (Simon, 1977) .  This model attempts to remove some of the subjectivity and thus enhance decision-making.

Use of the proposed model would proceed as follows:

  1. Data is collected from managers about the important of the proposed data to be collected.  (i.e., 20 parameters in Table 2 – Importance (I)).
  2. Data is collected from the organization’s technical department about the level of each attribute as it relates to the data collection decision choices (i.e. 20 parameters in Table 2 – Level (L); see Table 1 for decision choices).
  3. The function Q would be computed for each cell.
  4. The relative differences between Q would be used to indicate relative quality between choices. 

To date the following has been done:

  1. The parameters to define data quality have been confirmed. (Gendron & D'Onofrio, 2000; Wang & Strong, 1996)
  2. Importance (I) ratings have been analysed. Those analyses suggest that importance ratings are organization specific
  3. A set of scales for the variable level (L) has been designed and is being confirmed.

Conclusion

This paper presents a multi-attribute decision model to assist in the selection of the best data collection alternative.  This model is designed to provide managers with a tool to aid in their decision-making.

References

D'Onofrio, M. J., & Gendron, M. S. (2001). Technology Assisted Research Methodologies: A historical perspective of technology-based data collection methods. Paper presented at the Internet Global Summit: A Net Odyssey - Mobility and the Internet - The 11th Annual Internet Society Conference, Stockholm Sweden.

Drummond, M., Stottard, G., & Torrance, G. (1994). Methods of Economic Evaluation of Healthcare Programmes. New York: Oxford Press.

Gendron, M. S., & D'Onofrio, M. J. (2000). Data Quality in the Healthcare Industry:  An Exploratory Study. Paper presented at the Systemics Cybernetics and Informatics, Orlando, FL.

Haddawy, V. H., & Haddawy, P. (1997). Problem-Focused Incremental Elicitation of Multi-Attribute Utility Models. Paper presented at the Thirteenth Conference on Uncertainty in Artificial Intelligence, Brown University, Providence, Rhode Island, USA.

Martin, M. (2000, May 10, 2000). Multi-criteria decision aid, [available: http://www.agena.co.uk/mcda_article/mcda_intro3.html]. Agena [2001, 04/06].

Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97, 558-606.

Simon, H. (1977). The New Science of Management Decision. Engle Cliffs, New Jersey: Prentice-Hall.

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy:  what data quality means to data consumers. Journal of Management Information Systems, 12(4), 5-34.