Introduction

It is estimated that 80% of data held by firms is unstructured (Balducci & Marinova 2018). This has led to a need process this data in order to gain a competitive advantage. Whilst this has many challenges due to AI having difficulty classifying the data, there have been many advancements to help overcome this issue, leading to firms being able to use their unstructured data to help improve their products such as improving their marketing strategy & helping to identify patents who are likely to have a particular medical condition.

What is unstructured data?

Balducci & Marinova (2018) define unstructured data “as a single data unit in which the information offers a relatively concurrent representation of its multifaceted nature without predefined organization or numeric values. Table 1 clarifies what constitutes UD (unstructured data) according to three characteristics (non-numeric, multifaceted, concurrent representation) that differentiate highly UD from highly SD (structured data), as well as specifying how UD might be employed to develop theory and obtain novel conceptual and managerial insights beyond what can be gleaned from SD”.

However, not all definitions are as specific. Jacob Isaksen as quoted by Pritchard (2019) defines unstructured data as “data held outside data structures like tables and rows without predictable content patterns, such as documents, emails, photos or free text”. This is largely consistent with Müller et al. (2016) who describe unstructured data as “often textual and … messy. Unstructured data comprises documents, emails, instant messages or user posts and comments on social media, and presents a challenge to data miners; analysing unstructured data is more complex, more ambivalent and more time consuming”.

Although Balducci & Marinova (2018) and Müller et al. (2016) imply that typically unstructured data is not held in relational databases (like Microsoft SQL Server or Oracle SQL Developer), Isaksen’s definition specifically excludes it due to them having “data structures like tables and rows” (Pritchard 2019). Isaksen may have missed is that in some cases, unstructured data can be held in tables and rows. Müller et al. (2016) uses the example of email which could easily be stored in a relational database with the pre-defined fields (columns) being along the lines of: Address Sent From, Address Sent To, Date Sent, Subject, Body but this is often still considered to be largely unstructured due to the majority of the information that is likely to be useful being stored in the one field (the body field).

A characteristic only brought up by Balducci & Marinova (2018) is that data does not have to clearly “unstructured” or “structured”. Throughout their research paper, they often classify data as “highly structured” or “highly unstructured”, implying it can be in between whist Isaksen as quoted by Pritchard (2019) & Müller et al. (2016) make no mention this could be the case.

How is unstructured data being used?

Robinson, Goh & Zhang (2012) use product reviews for mobile phones in China to conduct their research on data & opinion mining. The purpose of their study is to “effectively extract accurate, reliable, influential and useful information from the raw opinion data collected from informal product review”. They conduct interviews to help determine the effectiveness of a review in each of the relevant categories. They state that there are many applications for data mining in a business setting due to it allowing the organisation to gain a better awareness of the internal & external business environment to gain a competitive advantage, “through the superior awareness data mining grants” and ”the development of business strategies and improvement of internal and external processes”. Factors that were used to determine how positive or negative a review was included explicit statements (such as star ratings), presents of very negative description of features, semantic orientation of words, length of the description of the features, formality of the language & comparisons to other products.

This research is furthered by Sperková, Vencovsky & Bruckner (2015) who aimed to explore how to measure the quality of the service received by a company. Unlike Robinson, Goh & Zhang (2012), they aimed to automate this process. They state that “tremendous strides were made in recent years to automate the analysis of unstructured text data” including “focusing on individual metrics as word length, the presence of keywords, or the overall semantic orientation of terms within the data”.

One of the main issues prior studies have encountered is that often the results are unused due to the algorithm not adequately assessing the context and therefore producing inaccurate results (Sperková, Vencovsky & Bruckner 2015). Sperková, Vencovsky & Bruckner (2015) aimed to solve this issue by implementing various artificial intelligence (AI) enhancements to help resolve this issue. For example, historically computerized programs have trouble associating an appraisal word (eg: good, bad) with the appropriate subject (eg: product, quality of service) when there is more than one subject being discussed. Sperková, Vencovsky & Bruckner (2015) reduce the significance of this problem by measuring “the distance between the appraisal word and subject in the content and put together always the closest words” and by increasing the volume of data to “reduce its significance”.

Another issue, Sperková, Vencovsky & Bruckner (2015) attempted to solve is that “some appraisal words are neutral if they stay alone, but they get sentiment in the context of the sentence. For example (the) adjective “slow” is carrying neutral sentiment, but if we put it to the context when “some service transfer is slow”, then it is getting negative sentiment. This is the reason why is necessary to create appraisal words repository where those words should be define(d) in relation to service purpose. The last thing which should be treated is the negation of the adjectives (phrases like “not good”), where the negation word turns the sentiment of the appraisal word”.

Another way they aim to make the extracted data more useful is instead of identifying the overall sentiment of the reviews which has little value to the organisation, they focus on gathering specific information about the brand experience, pre-purchase experience, post-purchase experience, past experience & customer needs and comparing this to the designed attributes of the service so they can identify the areas they need to improve upon.

Madhusudhanan (2018) agrees that the processing of unstructured data can be challenging to classify “due to their variability and missing of labels”. He also uses an advanced AI system to help overcome this issue. The CUIL machine learning algorithm focuses on processing images. This algorithm uses uCLUST & ELM++ for its processing. To process the data firstly CUIL provides uCLUST with clustering attributes & image metadata in the JSON format. This format was chosen as it supports unstructured data unlike csv or xls. uCLUST then “segregates the unlabelled images into clusters by clustering metadata based upon the given attributes” and then assigns each image with a label. This is then passed to ELM++ which creates a model that is used for classifying the unlabelled data. This model was found to have an accuracy of 94% when classifying images.

What is unstructured data being used for?

Kharrazi (2018) explored the value of unstructured electronic health data (primarily free-text doctors notes) using a natural language processing algorithm to identify patients with a Geriatric Syndrome. He was able to conclude that “geriatric syndromes are likely to be missed if unstructured data are not analysed”, emphasising the importance of this data.

To conduct the analysis Kharrazi (2018), first identified “list of geriatric syndromes and identified specific codes and phrases corresponding to each syndrome”. These were “falls, walking difficulty, dementia, vision impairment, absence of fecal control, severe lack of urinary control, malnutrition, weight loss, pressure ulcer, and lack of social support”. Secondly Kharrazi “developed a set of natural language processing (NLP) algorithms to extract information and identify cases in the unstructured EHR data” and lastly these were compared with the populations calculated rates.

To identify the false positive rates a sample of 100 people were tested, and false positive rates ranged between 1% and 15% depending on the geriatric syndrome (Kharrazi 2018). This could help identify patients that are likely to have a geriatric syndrome so they can be further tested.

Another use for unstructured data is to measure service quality. Korfiatis (2019) contends that service quality is “not accurately measured” using structured data (such as numerical scales). In the study he uses online product reviews from Trip Advisor to help to understand customer satisfaction (and thus service quality).

Korfiatis (2019) collected all of the 557,208 airline passenger reviews from Trip Advisor, which included both structured and unstructured data. The structured elements included an overall score of their experience, as well as Seat Comfort, Customer Service, Cleanliness, Food and Beverage, Legroom, In-Flight Entertainment, Value for Money and Check-in and Boarding (Korfiatis 2019).

To process the unstructured element of the reviews, Korfiatis (2019) used a three-step approach. Firstly, they manipulated the text to remove harder to process or irrelevant words/reviews (removed all non-English reviews from the dataset, removed numbers and punctuation marks, removed words less than 3 characters in length etc.). This left 184,502 reviews. Secondly, they ran the reviews through the natural language processing algorithm to identify the key topics to focus on when analysing the reviews. Lastly, they estimated “how the topics change for different review ratings and additional controls” (Korfiatis 2019).

Using the natural processing algorithm, Korfiatis (2019) was able to find and recommend the areas that are most likely to lead to a positive review and/or are less likely to lead to a negative review. It was found that “customer service is the critical factor that is highly connected with increased satisfaction”, which lead to a 4% in the per unit score. Bad experiences relating to delays and/or the refund/cancellation were identified as likely factors to lead to a decrease in the per unit score, at approximately -3% each (Korfiatis 2019).

Ransbotham (2016) explores how Equifax, a data solutions (credit reporting) provider has started “to incorporate unstructured data from sources such as social media to better round out individual profiles”. By expanding beyond the traditional data such as the persons payment history and whether they have filed for bankruptcy, they are able to provide a more comprehensive risk assessment about a person applying for credit. This helps the lender decide what products to offer and on what terms. This improves the market for everyone as it allows people with a ‘good’ credit history to access credit on better terms, it helps the lenders mange risks and it helps ensure that people who will be unable to meet the terms will not be granted access to credit helping to reduce the likelihood of them experiencing financial hardship.

Bratus et al. (2011) explores how to improve the efficiency and effectiveness of a business by processing unstructured technician repair notes “from General Motors’ archives of solved vehicle repair problems, with the goal to develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians”. In the past, when repairing products service technicians would need to search the archives for similar issues, which was often a long process. This algorithm speeds up this process by indexing the unstructured data primarily by finding and matching part names, making it easier for repair technicians to find similar problems that have occurred in the past.

All of these articles involve attempting to partially structure the data before using it, rather than developing an algorithm that would scan the database of unstructured data on-demand. This is likely because this is more efficient due to the level of processing that is required to analyse the data but does come at the cost as there is a delay between the unstructured data being created and it being processed making it unavailable or harder to access during this time.

Conclusion

This report has explored the different ways of defining unstructured data, how unstructured data is turned into useful information and some of the ways this information is used. Typically, unstructured data is processed using one or more natural language algorithms that then converts it into useful information either once off or at pre-defined intervals.

In the future, this is likely to either be done in real-time so the data is almost instantly available for use or the algorithms are likely to be run on demand to help reduce the storage requirements required to hold redundant data as the data is likely being held in its original unstructured form in addition to its processed form. It is also important to recognise that the processing of unstructured data is still a new technology and is likely to have many advances in the future. Whilst overall this is likely to have a positive effect; it is important to note that this will likely also increase privacy concerns because this could lead to organisations collecting more data about their users and building more complete profiles. Many of these concerns are likely to be mitigated through legislation such as GDPR in Europe, however often legislation is reactive and thus will only solve a problem after it occurs.

References

Balducci, B & Marinova, D 2018, “Unstructured data in marketing”, Journal of the Academy of Marketing Science, vol. 46, no. 4, pp. 557-590
Bratus, S, Rumshisky, A, Khrabrov, A, Magar, R & Thompson, P 2011, “Domain-specific entity extraction from noisy, unstructured data using ontology-guided search”, International Journal on Document Analysis and Recognition, vol. 14(2), pp. 201-211.
Kharrazi, H, Anzaldi, LJ, Hernandez, L, Davison, A, Boyd, CM, Leff, B, Kimura, J & Weiner, JP 2018, ‘The Value of Unstructured Electronic Health Record Data in Geriatric Syndrome Case Identification’, Journal of the American Geriatrics Society, vol. 66, no. 8, pp. 1499–1507
Korfiatis, N, Stamolampros, P, Kourouthanassis, P & Sagiadinos, V 2019, ‘Measuring service quality from unstructured data: A topic modeling application on airline passengers’ online reviews’, Expert Systems with Applications, vol. 116, pp. 472–486, viewed 16 October 2019
, S, Jaganathan, S & L S, J 2018, ‘Incremental Learning for Classification of Unstructured Data Using Extreme Learning Machine’, Algorithms, vol. 11, no. 10, p. 158
Müller, O, Debortoli, S, Junglas, I & Vom Brocke, J 2016, “Using Text Analytics to Derive Customer Service Management Benefits from Unstructured Data.”, MIS Quarterly Executive, vol. 15, no. 4, pp. 243-258
Pritchard, S 2019, Unstructured Data: Obstacles and solutions, Computer Weekly, pp. 26-29
Ransbotham, S 2016, “Using Unstructured Data to Tidy Up Credit Reporting”, MIT Sloan Management Review, vol. 57.
Robinson, R, Goh, T & Zhang, R 2012, “Textual factors in online product reviews: a foundation for a more influential approach to opinion mining”, Electronic Commerce Research, vol. 12, no. 3, pp. 301-330.
Sperková, L, Vencovsky, F & Bruckner, T 2015, “How to Measure Quality of Service Using Unstructured Data Analysis: A General Method Design”, Journal of Systems Integration, vol. 6, no. 4, pp. 3-16

Unstructured Data: what is it, how is it being used, what is it being used for?