Updates July 28
PGrid
- Fixed some inconsistencies with the postgres integration test setup compared to clickhouse. Allows global prisma client to still be used which was causing an issue before.
- Working on a new generic item processing pipeline that can be used by any scraping function to write snapshots and create or update generic items. Before I was trying to create a unified version but the structure for that got complicated so created another new version in a new file with a simpler structure with the help of AI.
- Working with AI to create tests for this new generic item processing pipeline. Testing small parts at a time and found some bugs which I fixed.
- Migrated pgrid baml to use external AI browser server for its perplexity searches.
- Big eBay refactor to both be efficient for unclassified items and handle variant listing pages on eBay. Now generic eBay works in the following way:
- For a category of items:
- Run eBay searches for all items for that category
- From search results, create generic items for items not in our database and price snapshots for everything
- If a new generic item was added, queue a single page crawl for it
- Queue single page crawls for all classified items for that category
- Single page crawls:
- If a variant page was found, queue single page crawls for all variants and ignore the base page
- Otherwise just save price snapshot data
- Added a new pipeline to handle creating generic items, updating details for items like titles and fail dates, adding price snapshots, handling currency that any scraper can use. Includes unit tests.
- Added geoblock check for Amazon. There has been a recent issue where Amazon assumes my proxy is not in Australia and shows out of stock for everything. For GPUs I added a check if every result for a given GPU shows up as no price, to ignore saving that data. Need to find proper fix.
- Started work on direct classification for SSDs. This would be able to run as soon as we scrape new SSD items. It will consist of small efficient tasks for AI: Ignoring items that aren’t SSDs, then doing a classification on brand based on existing brands in the database with the option to queue adding a new brand to the database.
- Added a script that will allow Claude Code to review outputs of AI model classification. Used this to review how the AI is at classifying if a given title is or isn’t an SSD.
- Updated the AI browser server I created which lets me run queries on Perplexity both through an OpenAI spec API as well as MCP. This can now be dockerised and run in a remote host. It can also handle Cloudflare checks and do a partially automated log in to Perplexity. Just requires the user paste in the log in code from email.
- Due to needing to update Perplexity scraper more to be able to select which model + handle more Cloudflare checks during searches I’m still running the AI browser locally for now