← back
RSS

Cars Arbitrage

9 min read ·

Context

This is one of those ideas from “clever” and most probably meaningless projects.

Recently my friend started his journey with buying cars cheap and selling them at more expensive price. The business old as the world. When I asked him - how do you know which car to buy he said: “I don’t know - I just hope I buy the one that somebody will like”. So it was 100% intuition/luck based. Even with this approach - he was able to succeed with few cars already. However, he was not able to sell the most recent one for few months already.

I thought that there may be a cool problem to tinker with:

With such data much better bets could be made and it could lower the risk of not selling car significantly, right?

Another ideas around this project:

Monitor car listings

So I went to one of polish car listings and to my surprise, I found out that the whole data I needed was accessible in nice json in html!

Monitoring was as simple as:

  1. Do the simple fetch
  2. Parse html: const $ = cheerio.load(html);
  3. Get number of pages: $(‘button[title=“Go to next Page”]‘).closest(“li”).prev().prev().text()
  4. Then iterate from 1 to N adding page parameter to URL
  5. For each page repeat 1 and 2, take from JSON what’s needed:

Quite a lot of information!

Then I just put it into DB with additional field when I saw this state of listing

  1. I wrapped it into temporal.io primitives so it’s easy to run it regularly and don’t worry about failures on the way
  2. Rented machine on Hetzner
  3. I manually picked 15 most popular brands and scheduled temporal.io workflow for each of brand

I didn’t expect to go so far with simple fetch - I thought that some bot protection will catch this but it just let me scrap

  const response = await fetch(url, {
    headers: {
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/148.0.0.0 Safari/537.36",
    },
  });

After ~30 days I had 5’178’789 entries in DB. Count of unique listings gathered: 208’512.

   brand       | count
---------------+-------
 bmw           | 32731
 audi          | 32593
 volkswagen    | 31950
 ford          | 30653
 opel          | 27953
 mercedes-benz | 26116
 toyota        | 22222
 peugeot       | 17924
 skoda         | 17814
 renault       | 17801
 volvo         | 15003
 nissan        |  9996
 mazda         |  7540
 mitsubishi    |  3465
 lexus         |  3059

That’s a good start! Let’s see what can we learn from this data!

Which listings are gone and why?

Since I know when listing was created and I know when it disappeared (it stopped appearing in my automatic searches). Figuring the why may be bit harder, but let’s explore what options we have.

Let’s find all listings that disappeared first:

CREATE TABLE disappeared_listings AS
 WITH latest_date AS (
   SELECT DATE(MAX(seen_at)) AS max_date
   FROM cars_stats
 ),
 seen_today AS (
   SELECT DISTINCT content->>'listingId' AS listing_id
   FROM cars_stats, latest_date
   WHERE DATE(seen_at) = latest_date.max_date
 ),
 all_listings AS (
   SELECT
     content->>'listingId' AS listing_id,
     content->>'url' AS url,
     MIN(content->>'createdAt') AS created_at,
     MAX(seen_at) AS last_seen_at
   FROM cars_stats
   GROUP BY content->>'listingId', content->>'url'
 )
 SELECT
   a.listing_id,
   a.url,
   a.created_at,
   a.last_seen_at
 FROM all_listings a
 LEFT JOIN seen_today t ON a.listing_id = t.listing_id
 WHERE t.listing_id IS NULL
 ORDER BY a.last_seen_at DESC;

149’685 cars disappeared since I started observing. Probably it is a bit too little to draw some worthy conclusions, but let’s see anyway what we have there so far :)

Sanity check: when I go to url it indeed shows that listing expired. There was one case which it wasn’t true. I think that 2 scenarios are possible why that happened:

Scenario 1: race condition

Scenario 2: offer bump

For now I’m just skipping this fact and treating it as an edge case, but for sure we could find ways around this:

OK, so how do I say which of these 60k cars were sold and what just expired? Let’s take a look for how long listings were being seen.

SELECT diff, COUNT(1) FROM (
  SELECT
    EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) diff
  FROM disappeared_listings
)
GROUP BY diff
ORDER BY diff DESC;
 diff | count
------+-------
   75 |     1
   74 |     1
   73 |     1
   72 |     1
   71 |     3
   70 |     2
   69 |     2
   68 |     3
   67 |     3
   66 |     1
   65 |     1
   64 |     2
   63 |     5
   62 |     4
   61 |    13
   60 |   229
   59 |   455
   58 |    48
   57 |    69
   56 |    65
   55 |    74
   54 |    94
   53 |    82
   52 |   105
   51 |   159
   50 |   161
   49 |   200
   48 |   204
   47 |   213
   46 |   212
   45 |   278
   44 |  1225
   43 |   305
   42 |   339
   41 |   366
   40 |   329
   39 |   303
   38 |   336
   37 |   349
   36 |   376
   35 |   372
   34 |   420
   33 |   452
   32 |   438
   31 |   552
   30 |   900
   29 | 77542
   28 |  4833
   27 |  2675
   26 |  1556
   25 |  1484
   24 |  1527
   23 |  1597
   22 |  1573
   21 |  1693
   20 |  1745
   19 |  1639
   19 |  1639
   18 |  1558
   17 |  1601
   16 |  1736
   15 |  1765
   14 |  3936
   13 |  1996
   12 |  1844
   11 |  1786
   10 |  1869
    9 |  1912
    8 |  2006
    7 |  2170
    6 |  2244
    5 |  2214
    4 |  2291
    3 |  2391
    2 |  2594
    1 |  2896
    0 |  3259

Knowing the fact, that on this platform your listing expires after 15 or 30 days (depends on how much you pay) - we can easily find these that expired!

SELECT COUNT(1) FROM disappeared_listings
WHERE EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) % 15 = 14;

 count
-------
 83'159
(1 row)

So we can assume that listing that were seen for any multiple of 15 is an expired listing. Probably it’s not 100% true - sometimes one could sell the car and didn’t do anything about the listing - then it just expired.

CREATE TABLE sold_cars AS
SELECT * FROM disappeared_listings
WHERE EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) % 15 != 14

It’s potentially 66’526 sold cars observed in last 30 days. Now I just visit every url and see if indeed it’s expired. If not - I remove it from this table.

After this I have 64’368 listings that seems to be sold out.

What types of cars sell fast

So, if I was about buying cars cheap and selling them at more expensive rate, I would like to know which cars sell well.

As the results I want to have a “car type” -> “average days to sell”. Maybe seeing p99 would be valuable as well (meaning: 99% of cars of this type sell before this amount of days).

By car type I mean:

So, with little bit of SQL gymnastics we can come to conclusions that I have the biggest chance to sell the car if it’s Hybrid Toyota Corolla that is hybrid, from 2023 year, it has between 10k-50k mileage. There was 77 such cars sold in last 30 days. On average such car waited for buyer 18 days, median is 20 days and 99% of these cars were sold below 48 days.

What types of cars don’t sell well

  CREATE TABLE not_sold_cars AS
  SELECT * FROM disappeared_listings
  WHERE EXTRACT(DAY FROM last_seen_at::timestamp - created_at::timestamp) % 15 = 14;

So we can assume 83’159 listings that expired.

Not suprisingly - most of them have 200k+ mileage and are diesel. Hardest car to sell is BMW serie 3 that is diesel and has 200k+ mileage. There were 94 such listing that expired.

Find cars that seems to be underpriced

First, I need to be able to find out similiar cars. Again, no magic here, we can achieve it with simple SQL. Just search for exact brand, fuel type and ranges of year (i.e. +/- 1 year) and mileage (allow some %).

Having that we can calculate median price for given group of attributes. Then we just decide if the car is below median and how much it is below median.

Also, having previous data, we can sort out such opportunities by “how fast it could be sold”.

So now, seeing the car on the street with phone number - we can just call and find out if potentially we can buy this car below the median.

Monitor multiple car listings in other countries

The system would be better if we had even more data. For example coming from marketplace from other countries. I tried with one German platform, but their Akamai bot protection was surprisingly effective!

What I tried:

Probably to go further I would need to have multiple proxies (preferably residential ones), rotate requests between them and introduce more human-like behaviours. However, for now it didn’t sound like a problem I want to play with. Another option would be to explore other platforms that care less about bots.

So now what?

Now I talk with my friend if it would be useful for him. Probably make some calls and find out if cars in my city can be bought at prices lower than the median observed on the platform. If I have some strong signals that it’s valuable idea to push forward - then I can build on what I have there :)