Notes from the second day of Code4libBC 2025.
Transforming Unstructured Data in a Knowledge Base: Exploring the Potential of RAG and LLMs at SFU Library
Ian Song, SFU
- potential of the application of LLMs in the library
- data challenge: growth has outpaced traditional systems, traditional search is inadequate and struggle with contextual queries
- 80% of all data is in unstructured formats (text, audio, video, images)
- unstructured assets include
- audio (over 2000 digitized cassettes),
- e-texts (PDF and coded e-text for print disabled),
- internal documents: policy, instructional, administrative
- full text: traditional CMS/databases, no deep contextual analysis
- new opportunities: advances in AI, specifically LLMs, can process and understand human language at scale
- limitations:
- knowledge cutoff: trained on historical data
- hallucinations/inaccuracies: plausible sounding false info
- privacy/bias: inherited from training datasets
- RAG:
- indexing: creating the KB: break docs into chunks, converted to numerical representations (embeddings) and stored in vector db
- retrieval/generation: answering queries matched (using semantic search) against the vector db to find relevant chunks, fed into LLM to generate a grounded response
- semantic search/reasoning: factual, auditable, and contextually rich
- GPT4All open source model
- privacy/security: full control data locally, no info leaves the network
- cost-effective deployment: can run RAG on local machine (CPU, or modest GPU)
- flexibility/control: choose from many open source LLM options (llama, mistral, etc.), fine-grained control over dev pathway (GUI/CLI0 and indexing framework (direct or langchain/llamaindex), controllable chunk size
- optimizing: pre-processing diverse data
- challenge 1: digital audio files, transcribe to generate structured format, enhancement: add metadata to improve retrieval
- challenge 2: complex PDF, OCR, metadata enhancement
- theses: if each file contain full text, would be ideal
- future trend: advanced RAG architectures with knowledge graphs, multimodal, alternative technologies (such as context window expansion)
- RAG addresses key LLM flaws
- local deployment is critical
- data quality is paramount: pre-processing key to high-performance RAG
- next steps: pilot on smaller defined transcribed audio, develop user interface, explore integration with internal knowledge graph
Consent Not Required: (AI) Technology as Connection
Coco Chen, SFU & Rebecca Ardron, Alexander College
- lack of consent, understanding, awareness of data collection: epistemic injustice
- teaching kids in the era of AI: decline in intergenerational bonding, inherited skills, increase in transactional interactions
- products that are marketed to sell, but collecting data
- reduced boundaries: preferred AI in search due to positive bias, human relationship communication, unblurring images, distancing from connections
Building AI Literacy in the Public Library
Jaclyn Fong, West Vancouver Memorial Library
- research AI programs offered in public libraries across BC and Canada
- 2024-Oct co-op student developed a 2-part AI course: part 1 more intro/lecture style, part 2 more hands-on
- 2025-Jan ran first time
- 2025-Summer co-op student added class about writing prompts to make 3-part course
- 2025-Fall ran two more times, develop class on AI privacy
- Understanding, Exploring, Talking (Prompt Creation for Beginners), adding privacy
- full attendance: more than 90 participants
- also Tech Talks with guest speakers
- what’s next: recommended AI resource page on website, more AI-theme tech talks
Defending Library Services Against AI Scrapers
Scott Leslie, BC Libraries Co-op
- unwanted traffic has always been an issue: web scrapers, malicious
- robots.txt was introduced, which worked fairly well for years
- now have AI trainers that are ignoring robots.txt
- used to be infrequent enough that could do IP banning
- needed something more automated
- landed on CrowdSec
- crowd-sourced IP ranges known to be harvesters or bad actors; also learning to your log files
- can pay to get better curated list
- not the only approach: CloudFlare (may still allow AI through if paid), geo-location/blanket range IP blocking
- limit ports
- consideration: can the list be challenged/adjusted
Break
Time to look for a snack.
Working with APIs: Flows and Runner in Postman
Olga Kalachinskaya, Douglas College
- running automatically multiple API calls
- options: programming languages, MS Power Automate (works with Excel), API testing/automation tools (like Postman, Insomnia, etc.)
- using it with Folio
- use case 1: due to a bug, Course Listing records in Reverse were not auto deleted when a course was deleted; internal records that become noise in the system
- use case 2: tech services were preparing to load new Authority records from Backstage and prior to they needed to delete from Folio
- options:
- ask vendor to do it (free, easy, quick)
- use APIs to do it myself
- decided to use Postman, with free version
- flows provide drag-and-drop interface for building API workflows to chain multiple requests
- runner to create collections of API requests and executive in sequence/parallel
- can import CSV or JSON files
- logging
- limit access token permissions
- resources: Folio (tickets, community), Postman docs
Bulk DOI generation in DSpace with the Super-Duper-App!
Daniel Sifton, VIU
- hosted DSpace VIURR for IR
- about 18% or 5500 did not have DOI
- option 1: can manually edit the records
- option 2: extract dspace metadata through csv/api, create payload.json, POST, insert DOI in dspace metadata through csv/api
- searched for scripts online: 3+ and 4+ python scripts, php web app
- need mechanism: export from metadata from dspace > transform to datacite metadata > generate bulk DOI > merge DOI export with dspace metadata > import back to dspace
- csv merger (select data and merge to new datacite DOI import) > datacite-bulk-doi-creator (from CSV file) > csv merger (merge source field from datacite DOI creator to dspace export/import file)
- used Flet: Python GUI framework, to get a web app
- process now through super-duper-app
- mapping DC to datacite, had to account for variation such as
uri
,uri[]
,uri[en]
- needs improvement: fields names logic, secure pasting, source URL match points, year/unknown, duplicate downloads
Building an assignment planner on Playlab
Joyce Wong, Langara College
- typical assignment planner: formulaic, dated, no customizations, one way output; no context, just gives a list of tasks
- Playlab: non-profit AI platform for educators and students, build AI apps but can select to avoid “answer generation”
- POP
- Persona: learning strategist
- Objective: create a schedule
- Parameters: offer Langara library and writing support, positivetone, no answer generation
- app takes type of assignment, start date, due date
- if less than 5 days, prompts to
- will ask if research required, other specific requirements
- similar to reference interview with student
- will ask if certain dates can’t work on assignment
- revises schedule based on answers
- can have a conversation
- workflow can include specific steps
- guidelines and guardrails
- can choose different AI models, variability (20%)
- build tools that students can actually use
End
That’s all the talks. See you next time!