Back in 2020, infoscope was approached by political scientists at Laval University to help them with the decision-maker component of Projet Quorum (www.projetquorum.com), an initiative for scientific outreach and cybersecurity awareness, led by the Leadership Chair in the Teaching of Digital Social Sciences (CLESSN) and the Center for public policy analysis (CAPP). Our client wanted all of the hansard transcriptions and press conferences transcriptions from the National assembly website, starting at the beginning of the pandemic, to be made readily available all in one same database, and for it all to be easy to use and coherent. This implied having to collect, structure and exploit text and video data, from various sources. The initial contract quickly developed into a valuable and enriching long-term collaboration, for both our client and our new company. Let us tell you how !
At first, in order to ensure continuous updates to the website, researchers on this project needed to be able to analyze and the “COVID mass” press conferences held religiously during the pandemic on an almost daily basis. Their aim? To track any evolution within widespread political discourse regarding this matter, on a daily basis. This meant the text data needed to be readily accessible within the data infrastructure as quickly as possible after a press conference was held, so they could update the website that same evening. The project mobilized web-scraping, automatized audio transcriptions (using existing AI models which, back then, failed to recognize the Québécois accent), and the structuring of these data from various sources into one coherent dataset.
At that moment, LLM (Large Language Model) text analysis technologies, such as Open AI’s GPT, weren’t yet made available. This meant we had to build the text analysis models completely from scratch. At some point, we honestly wondered how we were ever going to be able to get there: the size of the task at hand, the varying availability times of new transcriptions and videos to be extracted and the diversity of data sources, as well as the constant evolution of technologies causing frequent updates to algorithms were all challenges that our small team had to face. We definitely had our work cut out. But through collaboration, determination, countless hours of work and exchanges with the research team, we managed to put in place an effective system, which answered their need for a “continuous” data influx.
As the project progressed, researchers brought forward new needs, such as adding Tweets from elected officials twitter accounts, but also faced unforseen challenges. We realized the researchers were so overwhelmed with manual coding tasks in order to continuously feed the website, that they barely had any time left to devote to their research! In order for their team to gain time, we developed an automated data pipeline, like a digital conveyor belt, moving raw data from its source extraction, to then clean, transform and process data into a structured data warehouse, tailored to their specific academic needs, completely automatically. This allowed the team to focus on doing what they were really there for: further innovative political research, produce scientific articles (some to which we contributed directly!) and disseminate knowledge!
In the last steps of automation, alongside the clients team of researchers, we developed an automated textual analysis model which answered the scientific constraints of explainability and reproductivity, which allowed us to automatically identify 1- what a body of text is about (the topic); 2- whether it is more or less focused on the specific topic; and 3- how it treats the topic (more positive or negative connotation). This all contributed to providing more precise analyses, and made possible data-visualisation. This partnership continued over the following years. The data warehouse was bonified with new data sources, relating to various bodies of government. Today, were we to map out what the collaboration looks like, it'd look something like this:
This partnership opened our eyes to how valuable a close-knit relationship with our clients and their teams can be, whether it be for long-term projects or sporadic collaborations. Not only does building these relationships allow us to understand what our clients wants to achieve through these data, but also see the issues they grapple with in their specific context, in order to maximize and optimize the solutions we can offer them according to their needs. We surpassed ourselves in this collaboration, because it’s stimulating for us to get involved in our clients projects and contribute to their wins.
We stepped out of our comfort zone and persevered, even when we felt the task at hand might be too large to take on for this small-scale business. We witnessed the power of teamwork and solid partnerships, and learned the power of words. This allowed us to grow, all the while developing skills complementary to our field of expertise AND making our clients life easier. That’s what we call a win-win. If you also have textual data at your disposition but aren’t too sure how to go about them, we’ll happily take on more similar projects!
Comments