
Social Media Data Set
Overview
Focus: EsgynDB All-In-One SQL accessing structured, semi-structured, and unstructured data using standard EsgynDB functionality.
Inspired by how airlines use social media to improve customer service, this Test Drive scenario allows you to examine whether it makes sense to add social-media commentary data to your application. First, you’ll run queries that help you understand the data and then you’ll run queries that joins sample commentary data with the flights and weather data thereby allowing you to see, for example, how travelers interact with the carriers when delays occur.
This part of the Test Drive focuses on how EsgynDB allows you to query structured, semi-structured, and unstructured data to answer business questions. You’ll use sample user-defined functions (UDFs) to search unstructured text data joining the results with structured data stored in SQL tables for basic sentiment analysis. Also, you’ll join the results of text search with semi-structured data stored in JSON format and structured data stored in SQL tables to understand how airline carriers and airports interact with their customers using social media applications.
About the Data Set
The overall Test Drive data model is expanded with the Social Media Data Set, which is a structured CSV file that contains values containing unstructured data that mimics what social-media comments data might look like.
The data consists of sample comment text for February 2016 for airports and airline carriers but without links to web pages, images, and the permanent location of the comment. The user names are made up based on the airline carrier’s two-letter code (for example, @aa for American Airlines) or airport’s three-letter code (for example, @sfo for San Francisco International Airport.) user IDs, links, etc. are generated, not real. Similarly, user names are generated (@user1, @user2, etc.).
The comments data is located in a schema named socialmedia, which hosts a comments fact table plus handles and sentiments dimension tables. The handles dimension table helps you find comments for airports, carriers, and news stations. The sentiments dimension table contains positive and negative opinion keywords (a sentiment lexicon) downloaded from the Computer Science department of the University of Illinois1.
The sentiment analysis demonstrated herein is fairly basic; that is, it relies on keywords only rather than context such as “not good” versus “good.” Refer to Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005 for a more detailed discussion on identifying positive and negative opinions, emotions, and evaluations.

Querying Structured and Non-Structured Data
The following queries help you understand the content of Social-Media Data Set:
File | Questions | Query Description |
---|---|---|
query200.sql | What are the top-20 comment generators? Provide count, sample comment text, mentions, and hashtags. | Basic count query. |
query201.sql | What comments generated by an airline or airport on 2016-02-23 interacted directly with a customer? | Basic join query with text filtering on the comment text. Makes use of that mentions (@username) in comments are extracted to a separate column. |
query202.sql | What comments generated by an airline or airport for the month of February 2016 interacted directly with a customer and contained the phrase “thank you”? | Simple text search added to the text-filtering technique shown in query201.sql. |
query203.sql | What users had the most resends in the last 14 days in February 2016 and first 14 days in March 2016? What was the total amount of resends for each user? | Shows usage of date intervals and aggregation functions. |
query210.sql | Which airline had the most cancelled flights and what did they say to their customers? | Uses OLAP RANK() function to count and rank cancelled flights. Uses multi-column IN predicate. Join with handles dimension table to find social-media user name for the airline in order to extract comments. |
Performing Basic Sentiment Analysis
The following queries use SQL text search (like) and search UDFs (based on Apache Lucene) to identify and rank comments with positive and negative opinion keywords obtained from the sentiments dimension table, which contains the following columns:
Field | Description | Data Type |
---|---|---|
keyword | Word that describes an opinion. For example, “sorry” or “great”. May be misspelled on purpose. | varchar(32) |
sentiment | Whether keyword is “positive” or “negative”. | char(8) |
Lucene Search UDFs
The Lucene search UDFs use a Lucene index to search the comment column in the socialmedia.comments fact table. This index has been prebuilt for you. The searches generate a match score, which is sorted in top-n order with the highest score sorted first. In addition, the key columns in the Lucene index (comment_time and username) are also returned.
The following UDFs allow you to search the comments:
UDF | Parameters | Example |
---|---|---|
search_using_table |
| select * from udf ( search_using_table ( TABLE( select keyword from socialmedia.sentiments where sentiment = 'positive' ) , '/opt/trafodion/commentsindex' , 140 , 10 , 'COMMENT:KEYWORD' ) ) ; |
search_single_column |
| select * from udf ( search_using_table '/opt/trafodion/commentsindex' , 140 , 10 , 'COMMENT' , '+sorry +delay' ) ) ; |
Search Queries
The following queries show how to search unstructured data and how to use search capabilities to perform basic sentiment analysis:
File | Questions | Query Description |
---|---|---|
query220.sql | What top-5 airports have the most positive and negative comments? Provide a summary count of the positive and negative comments. | A basic sentiment analysis is performed by scanning the mentions for the airport’s username and by scanning the comment text for keywords that have been assigned positive/negative sentiments. Joins are used to allow use of the LIKE predicate to scan column text using values retrieved from dimension tables. |
query221.sql | What top airlines have the highest positive:negative comment ratio? | A basic sentiment analysis is performed by scanning the mentions for the airport’s username and by scanning the comment text for keywords that have been assigned positive/negative sentiments. Joins are used to allow use of the LIKE predicate to scan column text using values retrieved from dimension tables. |
query230.sql | What are the top-five most positive comments directed toward airlines in the month of February, 2016? | Uses a sample Lucene search UDFs, which uses a Lucene index to search the comment column in the socialmedia.comments fact table. A score value is returned, which indicates best match. The first parameter to the UDF is a table value. In this case, a result set with positive keywords. The result set is joined with socialmedia.comments and socialmedia.handles to find comments directed to airlines. |
query231.sql | What are the top-five most comments directed toward airlines with the words “sorry” and “delay” in the month of February, 2016? | Uses a sample Lucene search UDFs, which uses a Lucene index to search the comment column in the socialmedia.comments fact table. A score value is returned, which indicates best match. The search is based on hard-coded key words. The result set is joined with socialmedia.comments and socialmedia.handles to find comments directed to airlines. |
query250.sql | What airport(s) have the most negative comments and for what date(s)? What are the most delayed flights for those dates and what was the fastest 2-minutes wind speed measured? | The query used a Lucene search UDF to identify the comments and dates. The result set is joined with flight information and then with weather information obtained from semi-structured JSON data. |
1 Notes
- The appearance of an opinion word in a sentence does not necessarily mean that the sentence expresses a positive or negative opinion. See Bing Liu. “Sentiment Analysis and Subjectivity.” An chapter in Handbook of Natural Language Processing, Second Edition. (Editors: N. Indurkhya and F. J. Damerau), 2010.
- You will notice many misspelled words in the sentiments table. They are not mistakes. They are included as these misspelled words appear frequently in social media content.
COMMENTS Table