Library Data & Text Mining Friday, 10 January 2014

The following survey of resources has been compiled by CAUL's Quality & Assessment Advisory Committee (CQAAC):

Okerson (2013, p1) provides a couple of meanings and differentials for data and text mining:

Bernie Reilly, CRL's President, speaks of [text mining] as "automated processing of large amounts of structured digital textual content, for purposes of information retrieval, extraction, interpretation, and analysis." He distinguishes it from data mining, which extracts and analyses data, rather than text, from chosen sources…
In his 2013 report for the Publishing Research Consortium, independent consultant Jonathan
Clark also separates the two types of activity. …He describes text mining as a sophisticated, smart type of indexing, which aims "to extracts the meaning of a passage of text and to store it as a database of facts about the content and not simply as a list of words." He defines data mining as "an analytical process that looks for trends and patterns in data sets that reveal new insights: implicit, previously unknown, and potentially useful."
As with other areas in higher education, big data and analytics underpin data and evidence-based library planning and resource allocation. Libraries produce big (use) data and provide access to big datasets. Okerson views the position of libraries as a great opportunity to support research and development in this arena. Compiled by Vien Nguyen and Ben Mcrae (Victoria), and augmented by material provided by Graham Stone (Huddersfield), the list below brings together resources describing the emerging data or text mining methodologies, projects and initiatives in academic libraries. The information sources in this area fall within the categories of supporting researchers’ licenced trawling of datasets (to extract meaning, facts and opinions, from text), and as a decision making tool for library management (to extract trends and patterns from data). The context remains however, the articulation of library value or return on investment to its stakeholders and as such might be considered a sub-set of the above-list.


Curtin University
A project conducted at Curtin University in WA on students enrolled in 2010 aimed to find the link between academic library use and student retention. This project unlike the ones mentioned earlier was an in-house project. The findings do indicate that there is a correlation between library usage and student continuing their enrolment. However it could not be generalised that this is the case with other institutions. The importance of this study lies in the methodology used since it could be replicated.
Haddow, Gaby 2013, ‘Academic library use and student retention: A quantitative analysis’, Library and Information Science Research, vol. 35, no. 2, pp. 127-136.
Haddow, Gaby & Joseph, Jayanthi 2010, ‘Loans, Logins, and Lasting the Course: Academic Library Use and Student Retention’, Australian Academic & Research Libraries, vol. 41, no. 4, pp. 233-244.
INFOMINE involves the creation of a search tool or database, a virtual library of internet resources, customised for a particular user community, for example: faculty, students, and research staff at the university level. Both utilise data mining technology to create a database or data warehouse with relevant content for researchers.
Intota Assessment
Intota Assessment is a library collection analytics service that provides book and serials analysis and consolidated usage across all formats, providing new views and metrics of a library’s collection, in order to improve collection management. It is being used in the ARL LibValue project.
Library Cube at University of Wollongong
Using IBM Business Analytics/Business Intelligence systems the University of Wollongong has been able to build and manage an institutional data warehouse, and develop new analytical applications to monitor and enhance different areas of the university’s operations. In particular it enabled the building of a cube for the library that links usage of library resources to student demographic data and academic performance (the “Library Cube”).
Jantti, Margie & Cox, Brian 2013 ‘Measuring the Value of Library Resources and Student Academic Performance through Relational Datasets’ Evidence Based Library and Information Practice, vol. 8, no. 2, pp. 163-171.
MINES for Libraries
Measuring the Impact of Networked Electronic Services (MINES) provides data to measure the impact of digital content and thus enable librarians and administrators to assess the value of digital resources and services. The methodology used to collect this data is a short web based survey administered at the point of use of an e-journal, database article, or digital collection or service . The website MINES for Libraries provides detailed description of: what is MINES for Libraries, its aims, origins, and how the survey is conducted. There is also a list of relevant publications and presentations.
MINES has been used by 50 North American libraries since 2003 with 100,000 users surveyed. Among these are the University of Connecticut Libraries. Information on their implementation of MINES can be obtained from the IFLA 2005 presentation:
Franklin, Brinley2005,Successful Web Survey Methodologies for Measuring the Impact of Networked Electronic Services (MINES for Libraries)’, UConn Libraries Presentations Paper 17.
Ontario Council University Libraries (OCUL) also implemented MINES. This is a consortium of Ontario's 21 university libraries. The survey is run via the Scholars Portal for the Ontario Council of University Libraries. The publications below provide details:
Davidson, Catherine et al 2011, ‘Measuring use of licensed electronic resources: results of the second iteration of the for Libraries survey on Scholars Portal and other resources for the Ontario Council of University Libraries (OCUL)’, 9th Northumbria International Conference on Performance Measurement in Libraries and Information Services, August 23-25, 2011, University of York, York, England.
Kyrillidou, Martha et al 2011, Measuring the Impact of Networked Electronic Resources and the Ontario Council of University Libraries’ Scholar Portal: Final Report, Association of Research Libraries, Washington, DC.
The MOSAIC (Making Our Scholarly Activity Information Count) project ran from March to December 2009, funded by JISC. MOSAIC aimed to:
investigate the possibilities for exploiting the user activity and resource use data that might currently or potentially be made available through Higher Education systems to benefit libraries, national services and their users (Kay et al. 2010, p. 3) .
A number of UK HE universities were involved including Dundee, Falmouth, Huddersfield, Lincoln, Sheffield, Sussex, Swansea, Warwick and Wolverhampton. The data they supplied (including circulation data) was used to help develop a recommender service and other user services. The dataset generated was made freely available for re-use under an Open Data licence (Sero Consulting).
MOSAIC Project Wiki:
More details of the project can be found in the report:
Kay, David et al 2010, The JISC MOSAIC Project: Making Our Scholarly Activity Information Count.
WEKA data mining software
Recognised as a landmark in data mining software across both academic and business circles, the WEKA workbench has currently been downloaded over 1.4 million times since being placed on Source-forge in 2000.
Hall, Mark et al 2009, ‘The WEKA data mining software: an update’, ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18.
The Weka Workbench aims to provide a “comprehensive collection of learning machine algorithms and data pre-processing tools to researchers and practitioners” (Hall p10). The software gives users the ability to quickly try and compare different methods on data sets. It allows for the data mining process to be built from the ground up utilising the different algorithms and tools in the kit.
WorldCat Collection Analysis Tool
OCLC Videos 2013, Leveraging Worldcat: Data mining the largest library database in the world, YouTube,
This tool compares library holdings, mainly list-checking comparison, using unique identifiers. Bowker Book Analysis System and Ulrichs Serials Analysis System run similar comparisons:


Britton, Scott 2013, Mining Library and university data to understand user populations and behaviour, University of Miami Libraries.
At the University of Miami Libraries, collection development and acquisitions decisions was made by mining the use of the collection by types of user and circulation and catalogue data.
Cox, Brian & Jantti, Margie H 2012 'Capturing business intelligence required for targeted marketing, demonstrating value, and driving process improvement', Library and information science research, vol. 34, no. 4, pp. 308-316.
Finnell, Joshua & Fontane, Walt 2010, 'Reference Question Data Mining', Reference & User Services Quarterly, vol. 49, no. 3, pp. 278.
Fox, Robert 2010, 'Mining the digital library', OCLC Systems & Services International digital library perspectives, vol. 26, no. 4, pp. 232-238.
Genoni, Paul & Wright, Janetta 2011, "Australia's national research collection: overlap, uniqueness, and distribution", Australian Academic & Research Libraries, vol. 42, no. 3, pp. 162-178.
This article examines the use of data mining in collection development and maintenance. The information is collected through WorldCat and analysed to find unique items.
Laitinen, Marku & Saarti, Jarmo 2012, 'A model for a library-management toolbox: data warehousing as a tool for filtering and analyzing statistical information from multiple sources', Library Management, vol. 33, no. 4/5, pp. 253-260.
Marostica, Matt 2013 (25 June), 'Enigma statistical data mining tool now available to Stanford community', Stanford Libraries Blog.
Enigma makes public data accessible (Enigma in motion 2013) through tapping into resources otherwise obscure from users of Google and Yahoo. It mines big public data that exposes billions of public records across previously siloed datasets.
Matthews, Joseph R 2012, 'Assessing library contributions to university outcomes: the need for individual student level data', Library Management, vol. 33, no. 6/7, pp. 389-402.
Matthews advocates an approach that collaborates with the university’s assessment efforts and allows the library to determine the correlation levels between use of a library collection or service and a desired university outcome.
Mbabu, Loyd Gitari et al 2013, ‘Patterns of Undergraduates' Use of Scholarly Databases in a Large Research University’, The Journal of Academic Librarianship, vol. 39, iss. 2, pp. 189-193.
Authentication data was utilized to explore undergraduate usage of subscription electronic databases. 
McDonald, Diane & Kelly, Ursula 2012, Value and benefits of text mining, JISC, UK.
This JISC report explores the costs, benefits, barriers and risks associated with text mining within UKFHE research using the approach to welfare economics laid out in the UK Treasury best practice guidelines for evaluation.
Meletiou, Aristeidis & Katsirikou, Anthi 2009, 'Methodology of analysis and interrelation of data about quality indexes of library services by using data- and knowledge- mining techniques',Library Management, vol. 30, no. 3, pp. 138-147.
Meletiou and Katsirikou describe the data mining methodology focusing on libraries.
Memmott, Sara & deVries, Susann 2010, ‘Tracking the Elusive Student: Opportunities for Connection and Assessment , Journal of Library Administration, vol. 50, iss. 7-8.’
This case study describes how Google Analytics was used to assess the use of the library web site and online instructional materials by Extended Programs students and explains how the data collected was used to identify further enhancements to the information provided to Extended Programs students.
Mento, Barbara & Rapple, Brendan 2003, Data Mining and Data Warehousing  (SPEC Kit 274) Association of Research Libraries, Washington, DC.
This early survey explores how large quantities of data are being collected and mined in academic and research institutions.
Mimno, David 2012, The library as dataset: text mining at million book scale, YouTube.
This describes a project from Princeton University which involves digitization and analysis of a huge collection of monographs. This is the dataset from which text/data mining is applied looking at individual words with meaning shaped by context.
Mounce, Ross 2013, ‘The (young) researcher POV text & data mining’, 2nd Meeting of the Licences for Europe Stakeholder Dialogue, March 8, 2013, Brussels.
Nackerud, Shane et. al. 2013, 'Analyzing demographics: assessing library use across the institution', portal: Libraries and the Academy, vol.13, no. 2, pp. 131-145.
Nicholson, Scott et al 2003, 'The bibliomining process: data warehousing and data mining for libraries’, Proceedings of the American Society for Information Science and Technology, vol. 40, no. 1, pp. 478-479.
Nicholson explains the biblio-mining process for use as a decision making tool and to justify and defend the existence of library services.
Nicholson, Scott 2006, 'The basis for bibliomining: Frameworks for bringing together usage-based data mining and bibliometrics through data warehousing in digital library services', Information Processing and Management, vol. 42, no. 3, pp. 785-804.
Nicholson, Scott 2006, 'Approaching librarianship from the data: using bibliomining for evidence-based librarianship', Library Hi Tech, vol. 24, no. 3, pp. 369-375.
Okerson, Ann 2013, ‘Text and data mining: a librarian overview’, paper presented at IFLA World Library and Information Congress, 17-23 August 2013, Singapore.
Librarians should develop the expertise to support their users by making data resources available to them on favourable terms and supporting their mining efforts.
Pan, Denise et al 2013, ‘More Than a Number: unexpected benefits of return on investment analysis’, Journal of Academic Librarianship, vol. 39, no. 6, pp. 566.
This case study from the University of Colorado library describes using data-mining software to assist with collection evaluation and maintenance, across their three main libraries. The study showed that, through careful analysis of the data over time, the library could save money by acquiring the minimum number of required items by placing items where they were needed at the time of need. This article demonstrates very clearly how data-mining can be used by academic libraries to positively change collection development practices.
Showers, Ben & Stone, Graham 2013, ‘Safety in Numbers: Developing a Shared Analytics Services for Academic Libraries’ In: 10th Northumbria International Conference on Performance Measurement in Libraries and Information Services, 22-25 July 2013, Royal York Hotel, York.
Showers is also editing a book for Facet on library metrics which will include work from Wollongong and Huddersfield amongst others. There are also plans to discuss library metrics standards with the 3 projects in 2014. Also the projects are in touch with Megan Oakleaf and Carol Tenopir about their individual work on value and impact.
Smit, Eefke & Van Der Graaf, Maurits 2012 ‘Journal article mining: the scholarly publishers' perspective’, Learned Publishing, vol. 25, no. 1, pp. 35-46.
Soria, Krista M 2013, ‘Factors predicting the importance of libraries and research activities for undergraduates’, Journal of academic librarianship, vol. 39, no. 6, 464-470.  
Soria, Krista et. al. 2013, 'Library Use and Undergraduate Student Outcomes: New Evidence for Students' Retention and Academic Success', portal: Libraries and the Academy, vol.13, no.2, pp.147-164.
Speirs, Martha A 2013, ‘Data mining for scholarly journals: challenges and solutions for libraries’, paper presented at: IFLA World Library and Information Congress, 17-23 August 2013, Singapore.
Speirs talks about the explosion of information and how advances in information retrieval and new tools (such as Deep Web Technologies and World Wide will allow research hidden in the deep web, or in another language, to be accessible.
Stone, Graham et. al. 2012, 'Library Impact Data Project: hit, miss or maybe', In: Proving value in challenging times: proceedings of the 9th Northumbria international conference on performance measurement in libraries and information services, 22-26 August 2011, University of York, UK, pp. 385-390.
Stone, Graham & Collins, Ellen 2013, 'Library usage and demographic characteristics of undergraduate students in a UK university', Performance Measurement and Metrics, vol.14, no.1, pp. 25-35.
There is another paper being written from Huddersfield at the moment, and a final paper on retention data (see ) tentatively planned for 2014.
Uppal, Veepu & Chindwani, Gunjan 2013, 'An Empirical Study of Application of Data Mining Techniques in Library System', International Journal of Computer Applications, vol. 74, no. 11, pp.42-46.
White, Sue & Stone, Graham 2010, ‘Maximising use of library resources at the University of Huddersfield’, Serials, vol. 23, no. 2, pp. 83-90.
Wu, C-H 2003, 'Data mining applied to material acquisition budget allocation for libraries: design and development', Expert Systems with Applications, vol. 25, no. 3, pp. 401-411.
In Wu’s paper, a data mining based model is explained utilising circulation data to make decisions about the materials acquisition budget.
Yu, Ping 2011, 'Data Mining in Library Reader Management', 2011 International Conference on Network Computing and Information Security (NCIS) 14-15 May 2011, Guilin, Guangxi, China.


UK's expert on digital technologies for education and research provides relevant and useful advicedigital content and network and IT services that support research and development of new technologies and ways of working. - see Data & Text Mining
Publisher Research Consortium is a (relatively new) group of associations and publishers, which supports global research into scholarly communication in order to enable evidence-based discussion, and publishes a guide to Text Mining and Scholarly Publishing (Jonathan Clark):