I have a table called PendingExpense
with a few simple columns but a lot of rows. I'm working on some queries for paginated GET responses, but running into some confusion working with the queries and the MySQL optimizer seemingly making a senseless decision to do a full index scan for the ORDER BY
clause before filtering from the WHERE
clause.
This is on MySQL version 8.0.23.
PendingExpense
DDL (note, a companyId and loginCredentialId is how I specify a user in my schema):
create table PendingExpense
(
ID bigint auto_increment primary key,
LOGINCREDENTIALID int null,
COMPANYID int null,
DATE datetime null,
-- ... other rows that don't pertain, e.g. amount, status, type, state, country, merchant
)
create index IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC
on PendingExpense (COMPANYID, LOGINCREDENTIALID);
create index IN_PendingExpense_LOGINCREDENTIALID_ASC
on PendingExpense (LOGINCREDENTIALID);
create index IN_PendingExpense_Date
on PendingExpense (DATE);
Then here are the two queries I'm comparing, they are identical other than the index hint. I'm including the execution plans for both immediately below:
Query 1 (no hints):
explain analyze select id from PendingExpense
where COMPANYID = 1641 and LOGINCREDENTIALID = 2451
order by date DESC, id DESC
limit 101; -- takes 5.5 seconds
-> Limit: 101 row(s) (cost=2356102.00 rows=101) (actual time=2292.676..4474.843 rows=101 loops=1)
-> Filter: ((PendingExpense.LOGINCREDENTIALID = 2451) and (PendingExpense.COMPANYID = 1641)) (cost=2356102.00 rows=105) (actual time=2292.675..4474.818 rows=101 loops=1)
-> Index scan on PendingExpense using IN_PendintExpense_Date (reverse) (cost=2356102.00 rows=5660) (actual time=0.088..4371.774 rows=1491859 loops=1)
Query 2 (index hint):
explain analyze select id from PendingExpense use index (IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC)
where COMPANYID = 1641 and LOGINCREDENTIALID = 2451
order by date desc, id desc
limit 101; -- .184 seconds
-> Limit: 101 row(s) (cost=9722.30 rows=101) (actual time=38.255..38.267 rows=101 loops=1)
-> Sort: PendingExpense.`DATE` DESC, PendingExpense.ID DESC, limit input to 101 row(s) per chunk (cost=9722.30 rows=27778) (actual time=38.254..38.259 rows=101 loops=1)
-> Index lookup on PendingExpense using IN_PendingExpense_COMPANYID_ASC_LOGINCREDENTIALID_ASC (COMPANYID=1641, LOGINCREDENTIALID=2451) (actual time=0.046..35.410 rows=14170 loops=1)
Essentially, I'm confused why MySql chooses to do the full index scan first before filtering on companyId / loginCredentialId when the index already exists for those two, causing significant inefficiencies. I'd much prefer to not have to specify index hints in my code/queries for cleanliness. I was under the impression MySQL generally chooses to run the where clause filtering first, especially if an index already exists for it.
Any help / hints / insight would be appreciated here. Thanks!
