Open AI Data Mining
The Problem:
While developing my program to mine data from antique store websites, I recall how time-consuming it was to categorize and format every piece of data correctly. Each website has its own unique HTML and CSS structure, making the initial setup for data mining a grueling process. If I could reduce the time it takes to initiate the data mining process, I would spend less time locating the correct elements and more time analyzing the data.
The Plan:
The plan I came up with for this problem depended on the complexity of the site I was trying to mine data from. Simple, mom-and-pop websites were relatively easy to handle, as were sites with smaller inventories. With these smaller sites, I could often get OpenAI to help me find the right elements to extract the consistent data I needed. However, I knew this approach was hit or miss, and I knew I would have to figure out the elements myself.
For larger companies like Dell or Microsoft, the elements and selectors were much more dynamic, making consistent data mining a real challenge. To handle these larger sites, I would:
Create the basic Python framework.
Set up a JSON file for each site I wanted to mine.
Populate the JSON with elements myself, that OpenAI struggled to identify (e.g., page scrolling and clicking).
Use the OpenAI API to prompt with a basic command for each product link.
Take the data formatted for SQL from the OpenAI response and store it in a PostgreSQL.
Experience in Sorting: While working for the county, I operated a high-speed sorting machine to process ballots efficiently. Prior to the county's investment in this $600,000 piece of equipment, the majority of ballots were processed manually, with only limited machine assistance.
The Execution:
The execution of the project has slightly deviated from my initial plan. The code is now divided into two parts, with the two GitHub links on the right corresponding to each section.
Part One focuses on iterating through each page of a given site to retrieve a list of product links.
Part Two handles sending that list of product links to OpenAI to extract the desired data points.