![]() |
919.685.9300 FAX 919.685.9310 admin@niss.org |
Digital Government II
Data Confidentialty, Data Quality and Data Integration for Federal
Databases:
Foundations to Software Prototypes
Proposal Summary: This is a proposal for a large-scale, cross-disciplinary,
high-impact research program to create abstractions,
theory, implementable methodology and software prototypes to meet three
central,
interacting,
data-driven challenges facing Federal statistical agencies—data
confidentiality (DC), data quality (DQ) and data integration (DI).
The project addresses fundamental research questions in multiple disciplines: computer science, to formulate abstractions and design algorithms that accommodate interactions among DC, DQ and DI; the statistical sciences, to provide decision-theoretic formulations that account for both the risk and the utility of disseminating information, and of the consequences of DC, DQ and DI for inference; and software and systems engineering, to build prototype systems that operate at realistic scales, in order to evaluate and refine new theory and methodology. Complementing these are domain knowledge, to link uses of information to requirements for DC and DQ; and visualization, to support understanding of abstractions, algorithms, and system operation.
The project will be carried out by statistical and computer scientists from the National Institute of Statistical Sciences, Carnegie Mellon University, the University of Maryland College Park, the Institute for Social Research at the University of Michigan, Purdue University, Southern Methodist University and the Los Alamos National Laboratory.
As partners in the project, five leading Federal statistical
agencies—the Bureau of Labor Statistics, the Bureau
of
Transportation Statistics, the Census Bureau, the National
Agricultural Statistics Service and the National Center for
Education Statistics—will
ensure that the research is relevant, timely and applicable.
The partners will provide essential access to
data and participation of personnel in development and evaluation
of methods and software systems.
Software Engineering
Collaborative Research:
Acquiring Accurate Dynamic Field Data
Using Lightweight
Instrumentation
Proposal Summary: Dynamic analyses, such as profiling and testing, are a key part of state-of-the-art analysis and validation approaches to software quality assurance. The effectiveness of dynamic analyses depends, in part, on the degree to which the sample input data that are exercised reflect the way programs are actually used in the field. Acquiring accurate data is hard, expensive, and rarely done.
Software developers use a variety of program analysis and testing techniques to increase their confidence that a program has particular properties, such as functional correctness, liveness, and scalability. State-of-the-art analysis is powerful and effective, but limited in two ways. First, resources—particularly time, money, and machine cycles—are finite, albeit extensive. Second, it is challenging for developers and testers to fully predict the actual environment and the usage patterns for programs in the field, but the effectiveness of their analysis activities often depend on the accuracy of that prediction
In this proposal, we address these two limitations with an approach that augments conventional analysis techniques by instrumenting fielded instances of programs, e.g., during beta test. Our approach increases confidence in particular program properties by combining information acquired from the collection of fielded instances with earlier local analysis and testing information. Each fielded instance will be instrumented in a lightweight way, perturbing it only slightly, and different instances will in general be instrumented to gather different information. Information gathered from the distributed instances will allow us to perform more effective analyses when combined with information computed during the earlier conventional analysis and testing phase.
Preliminary experimental results show our approach (1) uses resources more effectively by parallelizing key aspects of the analysis process and (2) builds better predictive models that capture actual environments and usage patterns by getting early feedback from executions of the program from users as part of the analysis process. As a result, we are confident that distributed dynamic analyses using lightweight instrumentation will succeed for the following reasons:
To demonstrate the feasibility of our techniques and tools in practice, we will focus on two large-scale, production-quality, performance-intensive infrastructure software projects: ACE and TAO, which are widely-used, open-source middleware. We will use ACE+TAO to demonstrate empirically that our technologies and processes can enable real-world developers and users to tailor their QA tools, techniques and processes to improve such areas as fault detection, performance evaluation, memory footprint minimization, and power reduction. We have chosen to focus on the ACE+TAO projects because (1) we control their development process and source code, (2) they are production-quality software that embody many characteristics of performance-intensive infrastructure software, and (3) they exemplify key trends in software R&D.
Web Data
Bayesian Models Linking Web Site Structure
Proposal Summary: This is a proposal to create a set of four increasingly complex, but scalable, Bayesian models that relate the usage (specifically, user page transitions) of a Web site to its structure, and to apply, validate and refine the models using real data from four qualitatively different Web sites—an E-commerce site, a site operated by a large financial institution, a content site and an information site.
The Bayesian models share the one essential characteristic
that makes them scalable: the destinations from a given page (whether
it is static or generate dynamically) are classes of pages
that mirror the tree structure of the site, rather than
individual pages. Examples are the parent, children and siblings of a page.
Scalability results from replacing the full [page
X page] transition matrix
by the much “narrower” [page X destination
classes] matrix. All four models assume Dirichlet prior distributions
for transitions from each page. The first three employ very aggregated
classes of transitions, and
differ according to whether the transition distributions and the priors
are the same for all pages. The fourth model disaggregates the “child” and
“sibling” destinations. Calculation of posterior distributions
varies in difficulty: some are available in closed form, while others require
intensive MCMC computation.
Applications include relating user behavior to site structure (For example, pages with frequent transitions to other than parent, siblings, children and special destinations such as the home page are dissonant with respect to the Web site structure.); comparison of site usage at different times, or for different classes of users; segmentation of sessions; quantification of inter-relationships among pages (which also may not respect the site structure); simulation (for example, to evaluate hardware or application server capabilities); and prediction of user behavior, including forecasts, for example, of the economic impact of promotional campaigns. The ultimate impact is more efficient Web sites that serve users more effectively.
The Bayesian framework allows these applications to be addressed quantitatively, using formal hypothesis tests and predictive distributions. In addition, rigorous model assessment will provide insight into what level of aggregation is appropriate to which analyses of Web data.
Social Networks
Dynamics for Social Network
Processes:
Comparing Statistical Models with Intelligent Agents
Proposal Summary: The goal of this research is to reconcile two methods for modeling change in social networks over time—p* models and intelligent agent models. The latter family has received much attention from social scientists but little from mathematicians and almost none from statisticians, and so constitutes a promising and important opportunity for collaboration.
We will contrast the properties of these two approaches, exploring in particular what kinds of qualitative behavior in social networks are captured usefully and interpretably by each. The primary tools will be latent variable representations and dynamical systems analysis. We seek not to declare that one class on models is “better” than the other, but to construct a framework that yields insight into their strengths and limitations in multiple settings. The impact on the social sciences will be dramatic: researchers will be able to choose in a principled manner a model whose dynamic properties are most appropriate to each application.
To evaluate and refine the our framework, to assess its scalability, we will use as a testbed a very complex simulation model that describes the formation of terrorist networks.
Specific components of the research include time-evolving networks, comparison of the evolution of multiple models, which presents challenging “calibration” problems in order to compare models whose fundamental formulations do not map readily onto one another, a unified dynamics for social networks and intelligent agents, sensitivity analyses, selection of comparison and validation criteria, and visualization.
The project will have two phases, each lasting one year. The first will be construction of the framework, and the second its implementation and evaluation on the testbed, with results fed back to improve the framework.
The project team spans the social, mathematical, statistical
and computer sciences, and is drawn from NISS, the lead
institution on the proposal, Carnegie Mellon University, Duke University,
North Carolina State University and the University of North Carolina at
Chapel Hill. The participants are experienced, influential researchers,
each with an established record of cross-disciplinary collaboration. Many
subsets of the team have worked with one another previously.
Chemical Informatics
Web-Enabled Virtual Screening
Proposal Summary: The long-term objective of this project is to develop computational algorithms and software to gain theoretical and empirical insights in the use of chemical diversity for determining quantitative structure-activity relationships (QSARs). In addition to addressing scientific and technical goals with respect to QSAR modeling, planning-period tasks will include specific activities to bring together the researchers and to facilitate inter-disciplinary communication. Specific Aim 1 is to develop and enhance collaborations between three broad disciplines: statistics, computer science, and chemistry. This will be accomplished primarily through several intense workshops per year and regularly scheduled status meetings. Specific Aim 2 is to initiate a benchmarking study to compare structural descriptors, modeling strategies, and methods of model assessment. Through a web server, results will be posted from analyzing several datasets using many QSAR modeling techniques, a variety of molecular descriptors, and a number of assessment criteria. Specific Aim 3 is to design and beta-test web-accessibility of modeling software. PowerMV, a cheminformatics software tool created at the National Institute of Statistical Sciences, will be upgraded and made web-accessible. And Specific Aim 4 is to develop a broad view of cheminformatics tools based on the singular value decomposition and other similar decompositions where computations take advantage of the high degree of sparseness often exhibited by HTS data sets. The significance of these specific aims, in support of the long-term objective, is to reduce resource requirements for, and thus streamline, the process of drug discovery.
Education Statistics
Education Statistics Services Institute-Statistics
Further Information
NAEP Education Statistics Services Institute
Entire
site © 2000-2003, National Institute of Statistical Sciences. All
Rights Reserved. |
This
page updated on
May 1, 2006 3:23 PM
|