Written by Marvin Hsu
on October 11, 2023

如何在EFCore中使用Posrgres的Jsonb(2)

@Alex Shuper

雖然在設計階段應該避免讓Jsonb的資料成為查詢條件，但開發的時候仍然難以避免這樣的需求。Postgres針對Jsonb的查詢提供了不少高效的API，如果想要搭配EFCore使用就會需要預先定義型別或者搭配JsonDocument使用。

準備資料

先在本機以docker準備資料庫

docker run --name my-postgres -e POSTGRES_PASSWORD=postgres -p 5432:5432 -d postgres

接下來我們要先準備幾個類別

// 我們來準備一些動物的資料
public class Animal : IDisposable
{
    public Guid Id { get; set; }
    public JsonDocument JsonbData { get; set; }

    public void Dispose() => JsonbData.Dispose();
}

// 每一種動物都有不同的資料
public record Dog(string Name, bool IsHealthy);
public record Cat(string Name, Guid CategoryId);
public record Bird(string Name);

// 另外貓的種類另外拉出來放到一個資料表中
public class CatCategory
{
    public Guid Id { get; set; }
    public string CategoryName { get; set; }
}

可以用下面的C#程式碼來準備測試用的資料，記得更換自己的連線字串:
https://gist.github.com/marvin-hsu/a7ea9bd6d514ccddb094144b298e39b3

Jsonb欄位作為Join條件

在預先準備的資料中，由於我們把貓的種類紀錄到JsonbData中，假設我想要找到前十隻波斯貓的話，查詢語法就會變成:

(await (from animal in context.Animal
            join category in context.CatCategory
                on animal.JsonbData.RootElement.GetProperty(nameof(Cat.CategoryId)).GetGuid() equals category.Id
            where category.CategoryName == "Persian"
            select animal)
        .Take(10)
        .ToListAsync())
    .Select(x=> x.JsonbData.Deserialize<Cat>().Name)
    .ToList()
    .ForEach(Console.WriteLine);

神奇吧!Jsonb中的欄位竟然可以跟一般的欄位Join，我們來看一下轉譯出來的語法:

SELECT animal.Id AS animal_id, animal.JsonbData AS animal_data
FROM Animal AS animal
INNER JOIN CatCategory AS category ON CAST(animal.JsonbData->>'CategoryId' AS uuid) = category.Id
WHERE category.CategoryName = 'Persian'
LIMIT 10

再來就是執行計畫:

Limit  (cost=23.44..73.87 rows=10 width=55) (actual time=0.039..0.076 rows=10 loops=1)
  ->  Hash Join  (cost=23.44..2378.79 rows=467 width=55) (actual time=0.038..0.075 rows=10 loops=1)
"        Hash Cond: (((animal.""JsonbData"" ->> 'CategoryId'::text))::uuid = category.""Id"")"
"        ->  Seq Scan on ""Animal"" animal  (cost=0.00..2054.00 rows=100000 width=55) (actual time=0.006..0.014 rows=146 loops=1)"
        ->  Hash  (cost=23.38..23.38 rows=5 width=16) (actual time=0.007..0.008 rows=1 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
"              ->  Seq Scan on ""CatCategory"" category  (cost=0.00..23.38 rows=5 width=16) (actual time=0.005..0.006 rows=1 loops=1)"
"                    Filter: (""CategoryName"" = 'Persian'::text)"
                    Rows Removed by Filter: 3
Planning Time: 0.107 ms
Execution Time: 0.093 ms

資料很簡單，但這邊可以看到因為是由貓的類型回來尋找動物的名稱，因此不會走到索引。

使用Where查詢

先來個簡單的查詢，我想要找到第一支健康的狗:

await context.Animal
    .FirstOrDefaultAsync(x => x.JsonbData
        .RootElement
        .GetProperty(nameof(Dog.IsHealthy))
        .GetBoolean());

// SELECT a."Id", a."JsonbData"
// FROM "Animal" AS a
// WHERE CAST(a."JsonbData"->>'IsHealthy' AS boolean)
// LIMIT 1

現在來介紹一下npgsql提供的幾個函數:

EF.Functions	SQL
`EF.Functions.JsonContains(customer, @"{""Name"": ""Joe"", ""Age"": 25}")`	`customer @> '{"Name": "Joe", "Age": 25}'`
`EF.Functions.JsonContained(@"{""Name"": ""Joe"", ""Age"": 25}", customer)`	`'{"Name": "Joe", "Age": 25}' <@ customer`
`EF.Functions.JsonExists(customer, "Age")`	`customer ? 'Age'`
`EF.Functions.JsonExistsAny(customer, "Age", "Address")`	`customer ? \\| ARRAY['Age','Address']`
`EF.Functions.JsonExistsAll(customer, "Age", "Address")`	`customer ?& ARRAY['Age','Address']`
`EF.Functions.JsonTypeof(customer.GetProperty("Age")) == "number"`	`jsonb_typeof(customer->'Age') = 'number'`

這幾個函數的回傳值都是bool，JsonContains與JsonContained都是以Json的型態去找察相關的物件。Exist系列則是查詢Json結構中的key。最後一個JsonTypeof則是以value的型別為條件。
現在使用EF.Function來查詢前十隻波斯貓，這次直接用CategoryId來查詢

await context.Animal
    .Where(x => EF.Functions.JsonContains(x.JsonbData, """{"CategoryId": "b9daf2a3-d022-4414-b83b-2719afb98a85"}"""))
    .Take(10).ToListAsync();

// SELECT a."Id", a."JsonbData"
// FROM "Animal" AS a
// WHERE a."JsonbData" @> '{"CategoryId": "b9daf2a3-d022-4414-b83b-2719afb98a85"}'
// LIMIT @__p_1

這兩個查詢比較一下可以發現直接使用GetProperty轉譯出來的語法會多一次轉型，而使用EF.Function則是直接比對資料的型態。在平常的查詢中，如果遇到需要轉換欄位型別的狀況，往往容易造成索引失效，在Jsonb的查詢中也是盡量避免，通常->>之類的語法應該在已經查詢好資料要整理成所需的格式時使用，使用@>比對結構通常會比較高效，如果是像Join這種必須取出欄位的情況下則是需要確保查詢上都是以jsonb這方為約束另一張表索引的條件。

關於幾個EF.Function可以參考這裡

Jsonb索引

在Postgres中其實可以對Jsonb下索引來增進查詢效能，我們來稍微是一下Btree跟Gin

Btree
這是針對Jsonb中的特定欄位來下索引，我們來針對Name這個欄位增加索引:

CREATE INDEX animal_name_btree ON "Animal" USING BTREE (("JsonbData"->>'Name'));

接下來比較兩種查詢方式:

explain analyse
select * from "Animal"
where "JsonbData" ->> 'Name' = 'de7e7';

-- "Bitmap Heap Scan on ""Animal""  (cost=12.17..886.51 rows=500 width=55) (actual time=0.023..0.024 rows=1 loops=1)"
-- "  Recheck Cond: ((""JsonbData"" ->> 'Name'::text) = 'de7e7'::text)"
--   Heap Blocks: exact=1
--   ->  Bitmap Index Scan on animal_name_btree  (cost=0.00..12.04 rows=500 width=0) (actual time=0.017..0.018 rows=1 loops=1)
-- "        Index Cond: ((""JsonbData"" ->> 'Name'::text) = 'de7e7'::text)"
-- Planning Time: 0.064 ms
-- Execution Time: 0.041 ms

explain analyse
select * from "Animal"
where "JsonbData" @> '{"Name":"de7e7"}';

-- "Seq Scan on ""Animal""  (cost=0.00..2304.00 rows=10 width=55) (actual time=0.011..20.234 rows=1 loops=1)"
-- "  Filter: (""JsonbData"" @> '{""Name"": ""de7e7""}'::jsonb)"
--   Rows Removed by Filter: 99999
-- Planning Time: 0.053 ms
-- Execution Time: 20.247 ms

使用btree可以改善->>的查詢效能，但是有欄位要求彈性並沒有那麼大，從執行計畫中可以看到@>不會走btree index

Gin
Gin index則是對整個欄位做索引，比較適合範圍的查詢

CREATE INDEX animal_name_gin ON "Animal" USING gin ("JsonbData");

explain analyse
select * from "Animal"
where "JsonbData" ->> 'Name' = 'de7e7';

-- "Seq Scan on ""Animal""  (cost=0.00..2554.00 rows=500 width=55) (actual time=0.012..16.196 rows=1 loops=1)"
-- "  Filter: ((""JsonbData"" ->> 'Name'::text) = 'de7e7'::text)"
--   Rows Removed by Filter: 99999
-- Planning Time: 0.110 ms
-- Execution Time: 16.211 ms

explain analyse
select * from "Animal"
where "JsonbData" @> '{"Name":"de7e7"}';

-- "Bitmap Heap Scan on ""Animal""  (cost=28.08..65.28 rows=10 width=55) (actual time=0.043..0.043 rows=1 loops=1)"
-- "  Recheck Cond: (""JsonbData"" @> '{""Name"": ""de7e7""}'::jsonb)"
--   Heap Blocks: exact=1
--   ->  Bitmap Index Scan on animal_name_btree  (cost=0.00..28.07 rows=10 width=0) (actual time=0.038..0.038 rows=1 loops=1)
-- "        Index Cond: (""JsonbData"" @> '{""Name"": ""de7e7""}'::jsonb)"
-- Planning Time: 0.114 ms
-- Execution Time: 0.057 ms

這邊可以看到這次就換@>走到索引了，‵Gin‵比較適合範圍的查詢，不需要指定結構

接下來看一下這兩個索引的大小

SELECT
    indexname AS index_name,
    pg_size_pretty(pg_total_relation_size(indexname::regclass)) AS total_size
FROM pg_indexes
WHERE tablename = 'Animal' AND indexname in ('animal_name_btree','animal_name_gin');

+-----------------+----------+
|index_name       |total_size|
+-----------------+----------+
|animal_name_btree|2184 kB   |
|animal_name_gin  |5608 kB   |
+-----------------+----------+

理所當然的，Gin的大小比較大

因為這次使用的測試資料比較簡單所以看不太出差異，但在專案中測試的時候，沒有下index的查詢大概需要花上12s，而經過btree或是gin優化後查詢都有顯著的提昇(1~2ms左右)，不過Gin的大小真的會大非常多!

Top