Core Area: Information Signatures
We create mathematical signatures from text, multimedia, and sensor data, discovering new ways of summarizing key features in large, heterogeneous data sets. We focus on performance and scalability, creating computationally efficient representations of complex data sets.
Research Topics and Products
IN-SPIRE™ provides tools for exploring textual data, including Boolean and “topical” queries, term gisting, and time/trend analysis tools. This suite of tools allows the user to rapidly discover hidden information relationships by reading only pertinent documents. IN-SPIRE™ has been used to explore technical and patent literature, marketing and business documents, web data, accident and safety reports, newswire feeds and message traffic, and more. It has applications in many areas, including information analysis, strategic planning, and medical research.
Many analytical tools are provided to work in concert with the visualizations, allowing users to investigate the document groupings, query the document contents, investigate time-based trends, and much more.
More information can be found at the IN-SPIRE web site.
Keyword Extraction and Themes with RAKE and CAST
The goal of this research is to provide more descriptive cues so that users have better insight into the features of a text collection. It also allows them to explore with greater precision and identify or evaluate more specific relationships. In order to accomplish this, the algorithm Computation and Analysis of Significant Themes (CAST) was created. CAST computes a set of themes for a collection of documents based on automatically extracted keyword information.
The Rapid Automatic Keyword Extraction (RAKE) algorithm provides this keyword information. RAKE automatically extracts single- and multi-word keywords from individual documents and then provides a set of high-value keywords to CAST which are clustered into themes. Each computed theme comprises a set of highly associated keywords and a set of documents that are highly associated with the theme’s keywords.
Whereas many text analysis methods focus on what distinguishes documents, RAKE and CAST focus on what describes documents, ideally characterizing what each document is essentially about. Keywords provide an advantage over other types of signatures as they are readily accessible to a user and can be easily applied to search other information spaces. The value of any particular keyword can be readily evaluated by a user for their particular interests and applied in or adapted to multiple contexts.
RAKE and CAST are implemented in several projects, including IN-SPIRE.
The Starlight Information Visualization System™ graphically depicts information, dramatically accelerating and improving human ability to derive meaningful knowledge from increasingly large and complex information resources. It is simultaneously a powerful information analysis tool and a platform for conducting advanced visualization research.
Starlight is explicitly designed to manipulate the types of relationships humans need to understand in order to solve complex, multifaceted, real-world problems. Graphical representations enable the underlying relationships to be visually interpreted. Viewers can interactively move among multiple representations of the same information in order to uncover correlations that may span multiple relationship types.
For more information, see the Starlight web site.
Edge Computing is pushing the frontier of computing applications, data, and services away from centralized nodes to the logical extremes of a network. It enables analytics and knowledge generation to occur at the source of the data. This approach requires leveraging resources that may not be continuously connected to a network such as laptops, smartphones, tablets and sensors.
An Edge Computing reference platform named Kaval, running on the Android operating system, has been developed at PNNL to help provide Edge Computing solutions. Kaval is a common platform for clients who need the ability to collect and analyze data on a mobile device, while potentially sharing that data with others devices in the field.
Download Kaval flier (PDF)
Scalable Reasoning System
The Scalable Reasoning System (SRS) is an analytic framework for developing web-based applications. Using a growing library of both visual and analytic components, custom applications can be created for any domain, from any data source.
SRS incorporates the simplicity and accessibility of web-based solutions with the power of an extensible and adaptable back-end analytics platform. SRS applications have been deployed to:
- Analyze unstructured text
- Explore hierarchical taxonomies (like Visual Patent Search)
- Support real-time analysis of trends and patterns in streaming social media data (SRS Social Media Analysis Flier and Video)
- Organize and provide visual search and navigation of large document repositories
Download Flier (PDF)
Canopy is a suite of visual analytic tools designed to support deep investigation of large multimedia collections. Canopy combines the understanding of data represented in multiple formats: video, image, and text and presents that information to users through new visual representations. Users can explore relationships between documents and among subcomponents of documents. Canopy incorporates cutting-edge extraction techniques, state-of-the-art content analysis algorithms and novel interactive information visualizations so that analysts can comprehend and articulate the big picture. Canopy helps to reduce analysts’ workload and ultimately the effort of identifying critical intelligence for decision makers.
- Aids and expedites triage of multimedia data. For example, Canopy can help with analytic problems such as, “I have data from a large collection of files; help me investigate this collection to determine the most relevant files without my having to watch every movie, view every image, and read all the text.”
- Bootstraps the analysis process by providing visual clues to potential data relationships and highlights connections, giving the user an understanding of all the data and additional context of its structure. This facilitates discovering previously unknown content and/or unexpected or non-obvious relationships.
- Provides insight into multimedia content similarities and relationships by discovering and visualizing the relationships in an interactive and dynamic user interface.
- Provides true multimedia analysis, not just stovepipe analysis of an individual type of data augmented with metadata. For example, a Word document with components such as embedded images and text is evaluated as a cohesive information item, where the association of these various document elements is preserved.
Download Flier (PDF)
Lighthouse is a content analysis system that facilitates the analysis, synthesis, and retrieval of multimedia content. It accomplishes this through an ensemble of state-of-the-art characterization and decision support processes that provide search, classification, summarization, and temporal analysis. This system provides our customers with the ability to find image and video duplicates including both exact matches and fuzzy or partial matches and includes classifiers for object recognition and face detection. Lighthouse provides temporal analysis of videos including shot boundary detection, video summaries, and event detection. The system summarizes the image and video content of a collection by clustering similar content. Lighthouse can be used to identify relationships in collections of data. For example a user may suspect that a video has been created from a portion of another video. To learn more the user can explore the similarity measures provided by Lighthouse to trace the repurposed video back to the video of origin. Lighthouse matches industry standards as demonstrated through an extensive set of benchmarks on collections commonly used within the image processing community.
Graph analytics is the study and analysis of data that can be transformed into a graph representation consisting of nodes and links. Scientists at the Pacific Northwest National Laboratory (PNNL) have been actively involved in graph analytics R&D in social network, cyber communication, electric power grid, critical infrastructure, bio-informatics, and earth sciences applications. The mission of graph analytics research at PNNL is not merely about research per se; it has the essential and enduring purpose of producing pragmatic working solutions that meet real-life challenges.
We have developed a series of cutting-edge graph analytics technologies to explore and analyze graphs with different sizes and complexities. For the exploration of small world graphs such as a social network, we developed the concept of a graph signature that extracts the local features of a graph node or set of nodes and used it to supplement the exploration of a complicated graph filled with hidden features. For larger graphs with about one million nodes, we further developed the concept of a multi-resolution, middle-out, cross-zooming technique that allows users to interactively explore their graphs on a common desktop computer. Currently, we are developing the concept of an extreme-scale graph analytics pipeline designed to handle graphs with hundreds of millions of nodes and tens of billions of links. Much of our work developed at PNNL has been loosely integrated into a graph analytics library known as Have Green.
Download Flier (PDF)
To help analysts detect and assess potentially malicious events in large amounts of streaming computer network traffic, PNNL researchers have developed a new behavioral model-based anomaly detection technique. The Correlation Layers for Information Query and Exploration (CLIQUE) system builds models for the expected behavior for user defined host groups on a network and compares these models to a specified time window which can be real-time streaming or exploratory data to generate early indicators of 'non-normal' network activity.
CLIQUE's visual interface allows analysts to view detailed displays of network activity and spot the machines, buildings, facilities or other sources of traffic behaving anomalously. Users can navigate through their data temporally, viewing time periods as short as a few minutes or as long as several days. CLIQUE models and visualizations are designed to scale to immense data volumes, operating on datasets that are comprised of billions of transactions per day, helping to meet the data-intensive cyber security challenge.
Download Flier (PDF)
Related PapersShow all abstracts
Interactive Visual Comparison of Multimedia Data through Type-specific Views
Burtner ER, SJ Bohn, and DA Payne. 2013. "Interactive Visual Comparison of Multimedia Data through Type-specific Views." In Proceedings of the SPIE: Visualization and Data Analysis, Paper No. 86540M. SPIE, Bellingham, WA. doi:10.1117/12.2004735
Analysts who work with collections of multimedia to perform information foraging understand how difficult it is to connect information across diverse sets of mixed media. The wealth of information from blogs, social media, and news sites often can provide actionable intelligence; however, many of the tools used on these sources of content are not capable of multimedia analysis because they only analyze a single media type. As such, analysts are taxed to keep a mental model of the relationships among each of the media types when generating the broader content picture. To address this need, we have developed Canopy, a novel visual analytic tool for analyzing multimedia. Canopy provides insight into the multimedia data relationships by exploiting the linkages found in text, images, and video co-occurring in the same document and across the collection. Canopy connects derived and explicit linkages and relationships through multiple connected visualizations to aid analysts in quickly summarizing, searching, and browsing collected information to explore relationships and align content. In this paper, we will discuss the features and capabilities of the Canopy system and walk through a scenario illustrating how this system might be used in an operational environment.
Coherent Image Layout using an Adaptive Visual Vocabulary
Dillard SE, MJ Henry, SJ Bohn, and LJ Gosink. 2012. "Coherent Image Layout using an Adaptive Visual Vocabulary." In IS&T/SPIE Electronic Imaging. PNNL-SA-92482, Pacific Northwest National Laboratory, Richland, WA Proc. SPIE 8661, Image Processing: Machine Vision Applications VI, 86610Q (March 6, 2013); doi:10.1117/12.2004733
When querying a huge image database containing millions of images, the result of the query may still contain many thousands of images that need to be presented to the user. We consider the problem of arranging such a large set of images into a visually coherent layout, one that places similar images next to each other. Image similarity is determined using a bag-of-features model, and the layout is constructed from a hierarchical clustering of the image set by mapping an in-order traversal of the hierarchy tree into a space-filling curve. This layout method provides strong locality guarantees so we are able to quantitatively evaluate performance using standard image retrieval benchmarks. Performance of the bag-of-features method is best when the vocabulary is learned on the image set being clustered. Because learning a large, discriminative vocabulary is a computationally demanding task, we present a novel method for efficiently adapting a generic visual vocabulary to a particular dataset. We evaluate our clustering and vocabulary adaptation methods on a variety of image datasets and show that adapting a generic vocabulary to a particular set of images improves performance on both hierarchical clustering and image retrieval tasks.
Speech Information Retrieval: a Review
Hafen RP, and MJ Henry. 2012. "Speech Information Retrieval: a Review." Multimedia Systems 18(6):499-518. doi:10.1007/s00530-012-0266-0
Speech is an information-rich component of multimedia. Information can be extracted from a speech signal in a number of different ways, and thus there are several well-established speech signal analysis research fields. These fields include speech recognition, speaker recognition, event detection, and fingerprinting. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major speech analysis fields. The goal is to introduce enough background for someone new in the field to quickly gain high-level understanding and to provide direction for further study.Error processing SSI file
Speech information retrieval: a review
Hafen RP, and MJ Henry. 2012. "Speech information retrieval: a review." Multimedia Systems 18(6):499-518. doi:10.1007/s00530-012-0266-0
Speech is an information-rich component of multimedia. Information can be extracted from a speech signal in a number of different ways, and thus there are several well-established speech signal analysis research fields. These fields include speech recognition, speaker recognition, event detection, and fingerprinting. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major speech analysis fields. The goal is to introduce enough background for someone new in the field to quickly gain high-level understanding and to provide direction for further study.
Extreme Scale Visual Analytics
Wong PC, HW Shen, and V Pascucci. 2012. "Extreme Scale Visual Analytics." IEEE Computer Graphics and Applications 32(4):23-25. doi:10.1109/MCG.2012.73
Extreme-scale visual analytics (VA) is about applying VA to extreme-scale data. The articles in this special issue examine advances related to extreme-scale VA problems, their analytical and computational challenges, and their real-world applications.
In Silico Identification Software (ISIS): A Machine Learning Approach to Tandem Mass Spectral Identification of Lipids
Kangas LJ, TO Metz, G Isaac, BT Schrom, B Ginovska-Pangovska, L Wang, L Tan, RR Lewis, and JH Miller. 2012. "In Silico Identification Software (ISIS): A Machine Learning Approach to Tandem Mass Spectral Identification of Lipids." Bioinformatics 28(13):1705-1713. doi:10.1093/bioinformatics/bts194
MOTIVATION: Liquid chromatography-mass spectrometry-based metabolomics has gained importance in the life sciences, yet it is not supported by software tools for high throughput identification of metabolites based on their fragmentation spectra. An algorithm (ISIS: in silico identification software) and its implementation are presented and show great promise in generating in silico spectra of lipids for the purpose of structural identification. Instead of using chemical reaction rate equations or rules-based fragmentation libraries, the algorithm uses machine learning to find accurate bond cleavage rates in a mass spectrometer employing collision-induced dissociation tandem mass spectrometry. RESULTS: A preliminary test of the algorithm with 45 lipids from a subset of lipid classes shows both high sensitivity and specificity.
A Space-Filling Visualization Technique for Multivariate Small World Graphs
Wong PC, HP Foote, PS Mackey, G Chin, Jr, Z Huang, and JJ Thomas. 2012. "A Space-Filling Visualization Technique for Multivariate Small World Graphs." IEEE Transactions on Visualization and Computer Graphics 18(5):797-809 . doi:10.1109/TVCG.2011.99
We introduce an information visualization technique, known as GreenCurve, for large multivariate sparse graphs that exhibit small-world properties. Our fractal-based design approach uses spatial cues to approximate the node connections and thus eliminates the links between the nodes in the visualization. The paper describes a robust algorithm to order the neighboring nodes of a large sparse graph by solving the Fiedler vector of its graph Laplacian, and then fold the graph nodes into a space-filling fractal curve based on the Fiedler vector. The result is a highly compact visualization that gives a succinct overview of the graph with guaranteed visibility of every graph node. GreenCurve is designed with the power grid infrastructure in mind. It is intended for use in conjunction with other visualization techniques to support electric power grid operations. The research and development of GreenCurve was conducted in collaboration with domain experts who understand the challenges and possibilities intrinsic to the power grid infrastructure. The paper reports a case study on applying GreenCurve to a power grid problem and presents a usability study to evaluate the design claims that we set forth.
A Highly Parallel Implementation of K-Means for Multithreaded Architecture
Mackey PS, JT Feo, PC Wong, and Y Chen. 2011. "A Highly Parallel Implementation of K-Means for Multithreaded Architecture." In 19th High Performance Computing Symposium (HPC 2011): SCS Spring Simulation Multiconference (SpringSim 2011), April 3-7, 2011, Boston, MA. ACM , New York, NY.
We present a parallel implementation of the popular k-means clustering algorithm for massively multithreaded computer systems, as well as a parallelized version of the KKZ seed selection algorithm. We demonstrate that as system size increases, sequential seed selection can become a bottleneck. We also present an early attempt at parallelizing k-means that highlights critical performance issues when programming massively multithreaded systems. For our case studies, we used data collected from electric power simulations and run on the Cray XMT.
Graph Analytics—Lessons Learned and Challenges Ahead
Pak Chung Wong, Chaomei Chen, Carsten Gorg, Ben Shneiderman, John Stasko, Jim Thomas, Graph Analytics—Lessons Learned and Challenges Ahead, IEEE Computer Graphics and Applications, vol. 31, no. 5, pp. 18-29, Sep./Oct. 2011, doi:10.1109/MCG.2011.72
Graph analytics is one of the most influential and important R&D topics in the visual analytics community. Researchers with diverse backgrounds from information visualization, human-computer interaction, computer graphics, graph drawing, and data mining have pursued graph analytics from scientific, technical, and social approaches. These studies have addressed both distinct and common challenges. Past successes and mistakes can provide valuable lessons for revising the research agenda. In this article, six researchers from four academic and research institutes identify graph analytics' fundamental challenges and present both insightful lessons learned from their experience and good practices in graph analytics research. The goal is to critically assess those lessons and shed light on how they can stimulate research and draw attention to grand challenges for graph analytics. The article also establishes principles that could lead to measurable standards and criteria for research.
Automatic Keyword Extraction from Individual Documents
Rose SJ, DW Engel, NO Cramer, and WE Cowley. 2010. Automatic Keyword Extraction from Individual Documents. Chapter 1 in Text Mining: Application and Theory, vol. 1, ed. MWBerry, J Kogan, pp. 3-20. John Wiley & Sons, Chichester, United Kingdom.
This paper introduces a novel and domain-independent method for automatically extracting keywords, as sequences of one or more words, from individual documents. We describe the method's configuration parameters and algorithm, and present an evaluation on a benchmark corpus of technical abstracts. We also present a method for generating lists of stop words for specific corpora and domains, and evaluate its ability to improve keyword extraction on the benchmark corpus. Finally, we apply our method of automatic keyword extraction to a corpus of news articles and define metrics for characterizing the exclusivity, essentiality, and generality of extracted keywords within a corpus.
Events and Trends in Text Streams
Engel DW, PD Whitney, and NO Cramer. 2010. Events and Trends in Text Streams. Chapter 9 in Text Mining: Application and Theory, vol. 1, ed. MWBerry, J Kogan, pp. 3-20. John Wiley & Sons, Chichester, United Kingdom.
Text streams--collections of documents or messages that are generated and observed over time--are ubiquitous. Our research and development are targeted at developing algorithms to find and characterize changes in topic within text streams. To date, this research has demonstrated the ability to detect and describe 1) short duration, atypical events and 2) the emergence of longer-term shifts in topical content. This technology has been applied to predefined temporally ordered document collections but is also suitable for application to near-real-time textual data streams.
Real-Time Visualization of Network Behaviors for Situational Awareness
Best DM, SJ Bohn, DV Love, AS Wynne, and WA Pike. 2010. Real-Time Visualization of Network Behaviors for Situational Awareness. In Proceedings of the Seventh International Symposium on Visualization for Cyber Security, pp. 79-90. ACM , New York, NY.
Plentiful, complex, and dynamic data make understanding the state of an enterprise network difficult. Although visualization can help analysts understand baseline behaviors in network traffic and identify off-normal events, visual analysis systems often do not scale well to operational data volumes (in the hundreds of millions to billions of transactions per day) nor to analysis of emergent trends in real-time data. We present a system that combines multiple, complementary visualization techniques coupled with in-stream analytics, behavioral modeling of network actors, and a high-throughput processing platform called MeDICi. This system provides situational understanding of real-time network activity to help analysts take proactive response steps. We have developed these techniques using requirements gathered from the government users for which the tools are being developed. By linking multiple visualization tools to a streaming analytic pipeline, and designing each tool to support a particular kind of analysis (from high-level awareness to detailed investigation), analysts can understand the behavior of a network across multiple levels of abstraction.
Multimedia Analysis + Visual Analytics = Multimedia Analytics
Chinchor N, JJ Thomas, PC Wong, M Christel, and MW Ribarsky. 2010. Multimedia Analysis plus Visual Analytics = Multimedia Analytics. IEEE Computer Graphics and Applications 30(5):52-60.
Multimedia analysis has focused on images, video, and to some extent audio and has made progress in single channels excluding text. Visual analytics has focused on the user interaction with data during the analytic process plus the fundamental mathematics and has continued to treat text as did its precursor, information visualization. The general problem we address in this tutorial is the combining of multimedia analysis and visual analytics to deal with multimedia information gathered from different sources, with different goals or objectives, and containing all media types and combinations in common usage.
High-Throughput Real-Time Network Flow Visualization
Best DM, DV Love, WA Pike, and SJ Bohn. 2010. High-Throughput Real-Time Network Flow Visualization. FloCon2010, New Orleans, LA.
This presentation and demonstration will introduce two interactive, high-throughput visual analysis tools, Traffic Circle and CLIQUE, and will discuss the analytic requirements of the U.S. government cyber security capabilities for which the tools were developed and are being deployed. Both tools take a time-based approach to visual analysis, with Traffic Circle displaying raw data and CLIQUE computing real-time behavioral models. Performance benchmarks will also be discussed; the tools are currently capable of ingesting and presenting data volumes on the order of hundreds of millions of flow records at once.
A Novel Application of Parallel Betweenness Centrality to Power Grid Contingency Analysis
Jin S, Z Huang, Y Chen, D Chavarria-Miranda, JT Feo, and PC Wong. 2010. A Novel Application of Parallel Betweenness Centrality to Power Grid Contingency Analysis. In IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010), pp. 1-7. Institute of Electrical and Electronics Engineers, Piscataway, NJ.
In Energy Management Systems, contingency analysis is commonly performed for identifying and mitigating potentially harmful power grid component failures. The exponentially increasing combinatorial number of failure modes imposes a significant computational burden for massive contingency analysis. It is critical to select a limited set of high-impact contingency cases within the constraint of computing power and time requirements to make it possible for real-time power system vulnerability assessment. In this paper, we present a novel application of parallel betweenness centrality to power grid contingency selection. We cross-validate the proposed method using the model and data of the western US power grid, and implement it on a Cray XMT system - a massively multithreaded architecture - leveraging its advantages for parallel execution of irregular algorithms, such as graph analysis. We achieve a speedup of 55 times (on 64 processors) compared against the single-processor version of the same code running on the Cray XMT. We also compare an OpenMP-based version of the same code running on an HP Superdome shared-memory machine. The performance of the Cray XMT code shows better scalability and resource utilization, and shorter execution time for large-scale power grids. This proposed approach has been evaluated in PNNL's Electricity Infrastructure Operations Center (EIOC). It is expected to provide a quick and efficient solution to massive contingency selection problems to help power grid operators to identify and mitigate potential widespread cascading power grid failures in real time.
A Multi-Level Middle-Out Cross-Zooming Approach for Large Graph Analytics
Wong PC, PS Mackey, KA Cook, RM Rohrer, HP Foote, and MA Whiting. 2009. A Multi-Level Middle-Out Cross-Zooming Approach for Large Graph Analytics. In IEEE Symposium on Visual Analytics Science and Technology (VAST 2009), ed. J Stasko and JJ van Wijk, pp. 147 - 154. IEEE , Piscataway, NJ.
This paper presents a working graph analytics model that embraces the strengths of the traditional top-down and bottom-up approaches with a resilient crossover concept to exploit the vast middle-ground information overlooked by the two extreme analytical approaches. Our graph analytics model is developed in collaboration with researchers and users, who carefully studied the functional requirements that reflect the critical thinking and interaction pattern of a real-life intelligence analyst. To evaluate the model, we implement a system prototype, known as GreenHornet, which allows our analysts to test the theory in practice, identify the technological and usage-related gaps in the model, and then adapt the new technology in their work space. The paper describes the implementation of GreenHornet and compares its strengths and weaknesses against the other prevailing models and tools.
A Novel Visualization Technique for Electric Power Grid Analytics
Wong PC, K Schneider, P Mackey, H Foote, G Chin, R Guttromson, and J. Thomas. 2009 A Novel Visualization Technique for Electric Power Grid Analytics. Visualization and Computer Graphics, IEEE Transactions on 15(3):410-423
The application of information visualization holds tremendous promise for the electric power industry, but its potential has so far not been sufficiently exploited by the visualization community. Prior work on visualizing electric power systems has been limited to depicting raw or processed information on top of a geographic layout. Little effort has been devoted to visualizing the physics of the power grids, which ultimately determines the condition and stability of the electricity infrastructure. Based on this assessment, we developed a novel visualization system prototype, GreenGrid, to explore the planning and monitoring of the North American Electricity Infrastructure. The paper discusses the rationale underlying the GreenGrid design, describes its implementation and performance details, and assesses its strengths and weaknesses against the current geographic-based power grid visualization. We also present a case study using GreenGrid to analyze the information collected moments before the last major electric blackout in the Western United States and Canada, and a usability study to evaluate the practical significance of our design in simulated real-life situations. Our result indicates that many of the disturbance characteristics can be readily identified with the proper form of visualization.
Describing Story Evolution from Dynamic Information Streams
Rose SJ, RS Butner, WE Cowley, ML Gregory, and J Walker. 2009. Describing Story Evolution from Dynamic Information Streams. In IEEE Symposium on Visual Analytics Science and Technology (IEEE VAST) VAST 2009, Oct. 12-13, 2009, Atlantic City, NJ, pp. 99-106. IEEE , Piscataway, NJ.
Sources of streaming information, such as news syndicates, publish information continuously. Information portals and news aggregators list the latest information from around the world enabling information consumers to easily identify events in the past 24 hours. The volume and velocity of these streams causes information from prior days' to quickly vanish despite its utility in providing an informative context for interpreting new information. Few capabilities exist to support an individual attempting to identify or understand trends and changes from streaming information over time. The burden of retaining prior information and integrating with the new is left to the skills, determination, and discipline of each individual. In this paper we present a visual analytics system for linking essential content from information streams over time into dynamic stories that develop and change over multiple days. We describe particular challenges to the analysis of streaming information and explore visual representations for showing story change and evolution over time.
Two-stage Framework for Visualization of Clustered High Dimensional Data
Choo J, SJ Bohn, and H Park. 2009. Two-stage Framework for Visualization of Clustered High Dimensional Data. In IEEE Symposium on Visual Analytics Science and Technology (IEEE VAST). Pacific Northwest National Laboratory, Richland, WA. [Unpublished]
In this paper, we discuss 2D visualization methods of high dimensional representation of the data that are clustered and their associated label information is available. We propose a two-stage framework for visualization of such data based on dimension reduction methods. In the first stage, we obtain the reduced dimensional data by a supervised dimension reduction method such as linear discriminant analysis that preserves the original cluster structure in terms of its criterion. The resulting optimal reduced dimension depends on the optimization criteria and is often larger than 2. In the second stage, in order to further reduce the dimension to 2 for visualization purposes, we apply another dimension reduction method such as principal component analysis that minimizes the distortion in the lower dimensional representation of the data obtained in the first stage. Using this framework, we propose several two-stage methods, and present their theoretical characteristics as well as experimental comparisons on both artificial and real-world text data sets.
A Dynamic Multiscale Magnifying Tool for Exploring Large Sparse Graphs
Wong PC, HP Foote, PS Mackey, G Chin, Jr, HJ Sofia, and JJ Thomas. 2008. "A Dynamic Multiscale Magnifying Tool for Exploring Large Sparse Graphs." Information Visualization 7:105-117.
We present an information visualization tool, known as GreenMax, to visually explore large small-world graphs with up to a million graph nodes on a desktop computer. A major motivation for scanning a small-world graph in such a dynamic fashion is the demanding goal of identifying not just the well-known features but also the unknown–known and unknown–unknown features of the graph. GreenMax uses a highly effective multilevel graph drawing approach to pre-process a large graph by generating a hierarchy of increasingly coarse layouts that later support the dynamic zooming of the graph. This paper describes the graph visualization challenges, elaborates our solution, and evaluates the contributions of GreenMax in the larger context of visual analytics on large small-world graphs. We report the results of two case studies using GreenMax and the results support our claim that we can use GreenMax to locate unexpected features or structures behind a graph.
BioGraphE: High-performance bionetwork analysis using the Biological Graph Environment
Chin G, Jr, D Chavarría-Miranda, GC Nakamura, and HJ Sofia. 2008. "BioGraphE: High-performance bionetwork analysis using the Biological Graph Environment." BMC Bioinformatics.
We introduce a computational framework for graph analysis called the Biological Graph Environment (BioGraphE), which provides a general, scalable integration platform for connecting graph problems in biology to optimized computational solvers and high-performance systems. This framework enables biology researchers and computational scientists to identify and deploy network analysis applications and to easily connect them to efficient and powerful computational software and hardware that are specifically designed and tuned to solve complex graph problems. In our particular application of BioGraphE to support network analysis in genome biology, we investigate the use of a Boolean satisfiability solver known as Survey Propagation as a core computational solver executing on standard high-performance parallel systems, as well as multi- threaded architectures.
Scalable Visual Analytics of Massive Textual Datasets
Krishnan M, SJ Bohn, WE Cowley, VL Crow, and J Nieplocha. 2007. Scalable Visual Analytics of Massive Textual Datasets. In IEEE International Parallel & Distributed Processing Symposium. Long Beach, CA, March 26-30, 2007.
This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.
Visual Analysis of Weblog Content
Gregory ML, DA Payne, D McColgin, NO Cramer, and DV Love. 2006. "Visual Analysis of Weblog Content." In International Conference on Weblogs and Social Media '07. pp. 227-230. Boulder, March 26-28, 2007.
In recent years, one of the advances of the World Wide Web is social media and one of the fastest growing aspects of social media is the blogosphere. Blogs make content creation easy and are highly accessible through web pages and syndication. With their growing influence, a need has arisen to be able to monitor the opinions and insight revealed within their content. This paper describes a technical approach for analyzing the content of blog data using a visual analytic tool, IN-SPIRE, developed by Pacific Northwest National Laboratory. We will describe both how an analyst can explore blog data with IN-SPIRE and how the tool could be modified in the future to handle the specific nuances of analyzing blog data.
Diverse Information Integration and Visualization
Havre SL, A Shah, C Posse, and BM Webb-Robertson. 2006."Diverse Information Integration and Visualization." In Visualization and Data Analysis 2006 (EI10). SPIE The International Society for Optical Engineering, San Jose, CA.
This paper presents and explores a technique for visually integrating and exploring diverse information. Society produces, collects, and processes ever larger and diverse data including semi- and un-structured text, as well as transaction, communication, and scientific data. It is no longer sufficient to analyze one type of data or information in isolation. Users need to explore their data/information in the context of related information to discover often hidden, but meaningful, complex relationships. Our approach visualizes multiple, like entities across multiple dimensions where each dimension is a partitioning of the entities. The partitioning may be based on inherent or assigned attributes of the entities (or entity data) such as meta-data or prior knowledge captured in annotations. The partitioning may also be derived from entity data. For example, clustering, or unsupervised classification, can be applied to arrays of multidimensional entity data to partition the entities into groups of similar entities, or clusters. The same entities may be clustered on data from different experiment types or processing approaches. This reduction of diverse data/information on an entity to a series of partitions, or discrete (and unit-less) categories, allows the user to view the entities across a variety of data without concern for data types and units. Parallel coordinates visualize entity data across multiple dimensions of typically continuous attributes. We adapt parallel coordinates for dimensions with discrete attributes (partitions) to allow the comparison of entity partition patterns for identifying trends and outlier entities. We illustrate this approach through a prototype, Juxter (short for Juxtaposer).
Generating Graphs for Visual Analytics through Interactive Sketching
Wong PC, HP Foote, PS Mackey, KA Perrine, and G Chin, JR. 2006. "Generating Graphs for Visual Analytics through Interactive Sketching." IEEE Transactions on Visualization and Computer Graphics 12(6)
We introduce an interactive graph generator, GreenSketch, designed to facilitate the creation of descriptive graphs required for different visual analytics tasks. The human-centric design approach of GreenSketch enables users to master the creation process without specific training or prior knowledge of graph model theory. The customized user interface encourages users to gain insight into the connection between the compact matrix representation and the topology of a graph layout when they sketch their graphs. Both the human-enforced and machine-generated randomnesses supported by GreenSketch provide the flexibility needed to address the uncertainty factor in many analytical tasks. This paper describes over two dozen examples that cover a wide variety of graph creations from a single line of nodes to a real-life small-world network that describes a snapshot of telephone connections. While the discussion focuses mainly on the design of GreenSketch, we include a case study that applies the technology in a visual analytics environment and a usability study that evaluates the strengths and weaknesses of our design approach.
Graph Signatures for Visual Analytics
Wong PC, HP Foote, G Chin, JR, PS Mackey, and KA Perrine. 2006. "Graph Signatures for Visual Analytics." IEEE Transactions on Visualization and Computer Graphics 12(6)
We present a visual analytics technique to explore graphs using the concept of a data signature. A data signature, in our context, is a multidimensional vector that captures the local topology information surrounding each graph node. Signature vectors extracted from a graph are projected onto a low-dimensional scatterplot through the use of scaling. The resultant scatterplot, which reflects the similarities of the vectors, allows analysts to examine the graph structures and their corresponding real-life interpretations through repeated use of brushing and linking between the two visualizations. The interpretation of the graph structures is based on the outcomes of multiple participatory analysis sessions with intelligence analysts conducted by the authors at the Pacific Northwest National Laboratory. The paper first uses three public domain datasets with either well-known or obvious features to explain the rationale of our design and illustrate its results. More advanced examples are then used in a customized usability study to evaluate the effectiveness and efficiency of our approach. The study results reveal not only the limitations and weaknesses of the traditional approach based solely on graph visualization but also the advantages and strengths of our signature-guided approach presented in the paper.
Have Green - A Visual Analytics Framework for Large Semantic Graphs
Wong PC, G Chin, Jr, HP Foote, PS Mackey, and JJ Thomas. 2006. "Have Green - A Visual Analytics Framework for Large Semantic Graphs." In IEEE Symposium on Visual Analytics Science and Technology, pp 67-74. Baltimore, Maryland, October 31-November 2, 2006.
A semantic graph is a network of heterogeneous nodes and links annotated with a domain ontology. In intelligence analysis, investigators use semantic graphs to organize concepts and relationships as graph nodes and links in hopes of discovering key trends, patterns, and insights. However, as new information continues to arrive from a multitude of sources, the size and complexity of the semantic graphs will soon overwhelm an investigator's cognitive capacity to carry out significant analyses. We introduce a powerful visual analytics framework designed to enhance investigators' natural analytical capabilities to comprehend and analyze large semantic graphs. The paper describes the overall framework design, presents major development accomplishments to date, and discusses future directions of a new visual analytics system known as Have Green.