Troels Kaldau
Software Developer with a focus on end-to-end mobile applications
Project for LittleGiants
TL;DR
In this project, I built a prototype to extract package information from screenshots of SMS messages using OCR and regex. I refined the regex patterns by building a test suite with generated test data and iterated through multiple rounds of testing and adjustments to accommodate different message formats, data formats, and text wrapping properties.
Selvhent is a company that provides a service for collecting packages at package shops. Their system primarily consists of an employee application for managing the package inventory, with a storefront mode for customers to collect their packages.
Selvhent Package Collection
To increase the value of their service, Selvhent wanted to develop a client application where users could receive information about packages ready for collection. Automating this process required collaboration with major package delivery services. However, to simplify the initial implementation, they opted for a prototype solution.
Prototype solution
The solution was to extract package information from SMS messages sent by package delivery services. The client application would parse the SMS messages and extract structured data. This data included package location and pickup deadlines, enabling the client to send reminders and display the package shop's location.
I was tasked with implementing the prototype, which involved fetching SMS messages, parsing them, and extracting package information. Users would screenshot their SMS messages, and the application would fetch these images from the gallery. The images were processed using Google Machine Learning Kit's OCR feature, and the extracted text was parsed with regex patterns.
Despite its disadvantages - such as requiring gallery access, manual screenshots, and inflexible regex patterns - I was asked to implement the prototype as a proof of concept. I highlighted these limitations but received approval to proceed.
The first step was to fetch SMS screenshots from the gallery and process them. A background service was created to process images in reverse chronological order, saving the last processed image id, to ensure no duplicate processing. Each image was passed to an OCR service, which extracted text for further parsing using regex patterns. Parsed package data was compared to existing records in the database to eliminate duplicates. Address information was normalized using the Dawa API, and new package details were saved in the database. Users were notified of newly detected packages.
Example of messages with different formats
The main challenge was developing a service to extract structured package data from the messages. The system needed to support six different package delivery services, each with its own unique data and message formats. To create the regex patterns required for parsing these messages, I first compiled a list of sample messages from each service. This provided a comprehensive overview of the various formats in use.
However, these message formats were not static, and the extracted data could appear in a variety of formats. For instance, package numbers varied depending on the service provider, and addresses could differ significantly in structure and length. To ensure the reliability of the regex patterns, I developed a robust test suite.
All collected messages were stored in a JSON file alongside their correctly structured outputs. This allowed the regex patterns to be tested against real-world data. By iteratively refining the patterns and testing them against the dataset, I progressively improved their accuracy and reliability.
The next step was ensuring the regex patterns could handle all potential data formats. Through research, I identified the possible formats for package numbers and addresses used by each delivery service. Using this information, I created a test data generator that injected variations into the identified message formats, such as different lengths and structures. The generator produced thousands of test cases, which were then processed to uncover edge cases. These cases revealed issues that required adjustments to the patterns, which were then retested.
The complexity of the regex patterns made it difficult to predict how changes would affect the results. Each modification introduced new challenges, necessitating further testing and adjustments. This iterative process resembled a form of manual machine learning, where each refinement was an educated guess tested against extensive datasets. After numerous iterations, we achieved a level of confidence in the patterns that my supervisor deemed sufficient for the prototype.
Example of message format changing with screen width
Another significant challenge arose due to the OCR service processing each line on a screenshot independently. The line breaks in the messages varied depending on the device's screen size and text size settings. Initially, I explored the possibility of creating regex patterns that did not rely on line breaks, but this approach proved unfeasible. To address the issue, I developed a new test generator that simulated text wrapping at different widths, allowing the regex patterns to be tested against various line break scenarios. This added a new layer of complexity similar to the previous test cases and required further iterations of adjustment and testing. Ultimately, the regex patterns reached a level of confidence that was deemed sufficient for implementation in the prototype.
This task was initially a test of patience, but as I delved deeper into it, I became increasingly absorbed in the challenge. Despite the inherent inflexibility of using screenshot parsing and regex patterns, and the countless hours spent fine-tuning the regex, I found the process of gradually improving the solution to be surprisingly rewarding. The task significantly enhanced my reasoning skills and improved my proficiency with regex patterns and testing methodologies.
In addition to this feature, I implemented several other functionalities within the Selvhent system, including an internal database, geolocation notifications, and more. You can read about my work on Bluetooth device integration in the article linked below.
Also Read:
Position
Software Developer
Developed full-stack systems for startups, ranging from social apps to drowning prevention systems
Project
Bluetooth Scan and Print
Built an app extension to scan barcodes with Bluetooth scanners and print labels using portable printers