Cars Arbitrage
Context
This is one of those ideas from “clever” and most probably meaningless projects.
Recently my friend started his journey with buying cars cheap and selling them at more expensive price. The business old as the world. When I asked him - how do you know which car to buy he said: “I don’t know - I just hope I buy the one that somebody will like”. So it was 100% intuition/luck based. Even with this approach - he was able to succeed with few cars already. However, he was not able to sell the most recent one for few months already.
I thought that there may be a cool problem to tinker with:
- Monitor car listings for some longer period of time (one month at least)
- Find which cars sell fast
- Find cars that seem to be underpriced
With such data much better bets could be made and it could lower the risk of not selling car significantly, right?
Another ideas around this project:
- Monitor multiple car listings in few countries - it’s commonly known that people buy cars in Germany and then sell them in Poland - probably there is much more patterns like that

- Have a way to evaluate a car with given attributes - there is many cars not listed online - they just have a sheet with a phone number on them. Probably each of them is an opportunity to make an arbitrage.

Monitor car listings
So I went to one of polish car listings and to my surprise, I found out that the whole data I needed was accessible in nice json in html!
Monitoring was as simple as:
- Do the simple fetch
- Parse html: const $ = cheerio.load(html);
- Get number of pages: $(‘button[title=“Go to next Page”]‘).closest(“li”).prev().prev().text()
- Then iterate from 1 to N adding page parameter to URL
- For each page repeat 1 and 2, take from JSON what’s needed:
- listingId
- name
- url
- fuelType
- mileage
- brand
- year
- price
- gearbox
- createdAt
- description
- location
- sellerId
- engineCapacity
- enginePower
- model
- version
- image
Quite a lot of information!
Then I just put it into DB with additional field when I saw this state of listing
- I wrapped it into temporal.io primitives so it’s easy to run it regularly and don’t worry about failures on the way
- Rented machine on Hetzner
- I manually picked 15 most popular brands and scheduled temporal.io workflow for each of brand
I didn’t expect to go so far with simple fetch - I thought that some bot protection will catch this but it just let me scrap
const response = await fetch(url, {
headers: {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36",
},
});
After ~30 days I had 5’178’789 entries in DB. Count of unique listings gathered: 208’512.
brand | count
---------------+-------
bmw | 32731
audi | 32593
volkswagen | 31950
ford | 30653
opel | 27953
mercedes-benz | 26116
toyota | 22222
peugeot | 17924
skoda | 17814
renault | 17801
volvo | 15003
nissan | 9996
mazda | 7540
mitsubishi | 3465
lexus | 3059
That’s a good start! Let’s see what can we learn from this data!
Which listings are gone and why?
Since I know when listing was created and I know when it disappeared (it stopped appearing in my automatic searches). Figuring the why may be bit harder, but let’s explore what options we have.
Let’s find all listings that disappeared first:
CREATE TABLE disappeared_listings AS
WITH latest_date AS (
SELECT DATE(MAX(seen_at)) AS max_date
FROM cars_stats
),
seen_today AS (
SELECT DISTINCT content->>'listingId' AS listing_id
FROM cars_stats, latest_date
WHERE DATE(seen_at) = latest_date.max_date
),
all_listings AS (
SELECT
content->>'listingId' AS listing_id,
content->>'url' AS url,
MIN(content->>'createdAt') AS created_at,
MAX(seen_at) AS last_seen_at
FROM cars_stats
GROUP BY content->>'listingId', content->>'url'
)
SELECT
a.listing_id,
a.url,
a.created_at,
a.last_seen_at
FROM all_listings a
LEFT JOIN seen_today t ON a.listing_id = t.listing_id
WHERE t.listing_id IS NULL
ORDER BY a.last_seen_at DESC;
149’685 cars disappeared since I started observing. Probably it is a bit too little to draw some worthy conclusions, but let’s see anyway what we have there so far :)
Sanity check: when I go to url it indeed shows that listing expired. There was one case which it wasn’t true. I think that 2 scenarios are possible why that happened:
Scenario 1: race condition
- When I start scraping page 1, car is on page 2
- Some car expires on page 1
- Desired car jumps back to page 1
- Workflow for page 2 fires
Scenario 2: offer bump
- I see at 2:00 AM that listing expired
- Owner refreshes the listing later on
For now I’m just skipping this fact and treating it as an edge case, but for sure we could find ways around this:
- do the request for each url and make sure listing is expired
- require that car was not seen for last 3 days or more
OK, so how do I say which of these 60k cars were sold and what just expired? Let’s take a look for how long listings were being seen.
SELECT diff, COUNT(1) FROM (
SELECT
EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) diff
FROM disappeared_listings
)
GROUP BY diff
ORDER BY diff DESC;
diff | count
------+-------
75 | 1
74 | 1
73 | 1
72 | 1
71 | 3
70 | 2
69 | 2
68 | 3
67 | 3
66 | 1
65 | 1
64 | 2
63 | 5
62 | 4
61 | 13
60 | 229
59 | 455
58 | 48
57 | 69
56 | 65
55 | 74
54 | 94
53 | 82
52 | 105
51 | 159
50 | 161
49 | 200
48 | 204
47 | 213
46 | 212
45 | 278
44 | 1225
43 | 305
42 | 339
41 | 366
40 | 329
39 | 303
38 | 336
37 | 349
36 | 376
35 | 372
34 | 420
33 | 452
32 | 438
31 | 552
30 | 900
29 | 77542
28 | 4833
27 | 2675
26 | 1556
25 | 1484
24 | 1527
23 | 1597
22 | 1573
21 | 1693
20 | 1745
19 | 1639
19 | 1639
18 | 1558
17 | 1601
16 | 1736
15 | 1765
14 | 3936
13 | 1996
12 | 1844
11 | 1786
10 | 1869
9 | 1912
8 | 2006
7 | 2170
6 | 2244
5 | 2214
4 | 2291
3 | 2391
2 | 2594
1 | 2896
0 | 3259
Knowing the fact, that on this platform your listing expires after 15 or 30 days (depends on how much you pay) - we can easily find these that expired!
SELECT COUNT(1) FROM disappeared_listings
WHERE EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) % 15 = 14;
count
-------
83'159
(1 row)
So we can assume that listing that were seen for any multiple of 15 is an expired listing. Probably it’s not 100% true - sometimes one could sell the car and didn’t do anything about the listing - then it just expired.
CREATE TABLE sold_cars AS
SELECT * FROM disappeared_listings
WHERE EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) % 15 != 14
It’s potentially 66’526 sold cars observed in last 30 days. Now I just visit every url and see if indeed it’s expired. If not - I remove it from this table.
After this I have 64’368 listings that seems to be sold out.
What types of cars sell fast
So, if I was about buying cars cheap and selling them at more expensive rate, I would like to know which cars sell well.
As the results I want to have a “car type” -> “average days to sell”. Maybe seeing p99 would be valuable as well (meaning: 99% of cars of this type sell before this amount of days).
By car type I mean:
- brand
- model
- fuel type
- year
- mileage
So, with little bit of SQL gymnastics we can come to conclusions that I have the biggest chance to sell the car if it’s Hybrid Toyota Corolla that is hybrid, from 2023 year, it has between 10k-50k mileage. There was 77 such cars sold in last 30 days. On average such car waited for buyer 18 days, median is 20 days and 99% of these cars were sold below 48 days.
What types of cars don’t sell well
CREATE TABLE not_sold_cars AS
SELECT * FROM disappeared_listings
WHERE EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) % 15 = 14;
So we can assume 83’159 listings that expired.
Not suprisingly - most of them have 200k+ mileage and are diesel. Hardest car to sell is BMW serie 3 that is diesel and has 200k+ mileage. There were 94 such listing that expired.
Find cars that seems to be underpriced
First, I need to be able to find out similiar cars. Again, no magic here, we can achieve it with simple SQL. Just search for exact brand, fuel type and ranges of year (i.e. +/- 1 year) and mileage (allow some %).
Having that we can calculate median price for given group of attributes. Then we just decide if the car is below median and how much it is below median.
Also, having previous data, we can sort out such opportunities by “how fast it could be sold”.
So now, seeing the car on the street with phone number - we can just call and find out if potentially we can buy this car below the median.
Monitor multiple car listings in other countries
The system would be better if we had even more data. For example coming from marketplace from other countries. I tried with one German platform, but their Akamai bot protection was surprisingly effective!
What I tried:
- set up crawlee
- it worked well locally but got rejected after few requests
- when running on Hetzner I was rejected immediately
- set up a NordVPN proxy with gluetun
- after few requests routed via proxy it got blocked again
Probably to go further I would need to have multiple proxies (preferably residential ones), rotate requests between them and introduce more human-like behaviours. However, for now it didn’t sound like a problem I want to play with. Another option would be to explore other platforms that care less about bots.
So now what?
Now I talk with my friend if it would be useful for him. Probably make some calls and find out if cars in my city can be bought at prices lower than the median observed on the platform. If I have some strong signals that it’s valuable idea to push forward - then I can build on what I have there :)