从头实现数据库-10

Date Unknown

二级索引

在本章中,我们将向数据库添加额外的索引(也称为二级索引)。查询将不再局限于主键。
 

索引定义

在表定义中添加了索引和索引前缀字段。与表本身一样,每个索引在KV存储中都分配了一个键前缀。
// table definition type TableDef struct { // user defined Name string Types []uint32 // column types Cols []string // column names PKeys int // the first `PKeys` columns are the primary key Indexes [][]string // auto-assigned B-tree key prefixes for different tables/indexes Prefix uint32 IndexPrefixes []uint32 }
要通过索引找到行,索引必须包含主键的副本。我们将通过将主键列附加到索引中来实现这一点;这也使索引键唯一,这是B树查找代码的假设。
func checkIndexKeys(tdef *TableDef, index []string) ([]string, error) { icols := map[string]bool{} for _, c := range index { // check the index columns // omitted... icols[c] = true } // add the primary key to the index for _, c := range tdef.Cols[:tdef.PKeys] { if !icols[c] { index = append(index, c) } } assert(len(index) < len(tdef.Cols)) return index, nil } func colIndex(tdef *TableDef, col string) int { for i, c := range tdef.Cols { if c == col { return i } } return -1 }
在创建新表之前,会检查索引并附加主键。
func tableDefCheck(tdef *TableDef) error { // verify the table definition // omitted... // verify the indexes for i, index := range tdef.Indexes { index, err := checkIndexKeys(tdef, index) if err != nil { return err } tdef.Indexes[i] = index } return nil }
创建新表时分配多个键前缀。
 
// create a new table func (db *DB) TableNew(tdef *TableDef) error { if err := tableDefCheck(tdef); err != nil { return err } // check the existing table // omited... // allocate new prefixes tdef.Prefix = /* omited... */ for i := range tdef.Indexes { prefix := tdef.Prefix + 1 + uint32(i) tdef.IndexPrefixes = append(tdef.IndexPrefixes, prefix) } // update the next prefix ntree := 1 + uint32(len(tdef.Indexes)) binary.LittleEndian.PutUint32(meta.Get("val").Str, tdef.Prefix+ntree) _, err = dbUpdate(db, TDEF_META, *meta, 0) if err != nil { return err } // store the definition // omited... }
 

维护索引

在更新行后,我们需要从索引中删除旧行。B树接口被修改为返回更新前的值。
 
type InsertReq struct { tree *BTree // out Added bool // added a new key Updated bool // added a new key or an old key was changed Old []byte // the value before the update // in Key []byte Val []byte Mode int } type DeleteReq struct { tree *BTree // in Key []byte // out Old []byte } func (tree *BTree) InsertEx(req *InsertReq) func (tree *BTree) DeleteEx(req *DeleteReq)
下面是添加或删除索引中记录的函数。在这里,我们遇到了一个问题:更新具有二级索引的表涉及KV 存储中的多个键,这应该以原子方式进行。我们将在以后的章节中解决这个问题。
 
 
const ( INDEX_ADD = 1 INDEX_DEL = 2 ) // maintain indexes after a record is added or removed func indexOp(db *DB, tdef *TableDef, rec Record, op int) { key := make([]byte, 0, 256) irec := make([]Value, len(tdef.Cols)) for i, index := range tdef.Indexes { // the indexed key for j, c := range index { irec[j] = *rec.Get(c) } // update the KV store key = encodeKey(key[:0], tdef.IndexPrefixes[i], irec[:len(index)]) done, err := false, error(nil) switch op { case INDEX_ADD: done, err = db.kv.Update(&InsertReq{Key: key}) case INDEX_DEL: done, err = db.kv.Del(&DeleteReq{Key: key}) default: panic("what?") } assert(err == nil) // XXX: will fix this in later chapters assert(done) } }
更新或删除行后维护索引:
// add a row to the table func dbUpdate(db *DB, tdef *TableDef, rec Record, mode int) (bool, error) { // omitted... req := InsertReq{Key: key, Val: val, Mode: mode} added, err := db.kv.Update(&req) if err != nil || !req.Updated || len(tdef.Indexes) == 0 { return added, err } // maintain indexes if req.Updated && !req.Added { decodeValues(req.Old, values[tdef.PKeys:]) // get the old row indexOp(db, tdef, Record{tdef.Cols, values}, INDEX_DEL) } if req.Updated { indexOp(db, tdef, rec, INDEX_ADD) } return added, nil }
// delete a record by its primary key func dbDelete(db *DB, tdef *TableDef, rec Record) (bool, error) { // omitted... deleted, err := db.kv.Del(&req) if err != nil || !deleted || len(tdef.Indexes) == 0 { return deleted, err } // maintain indexes if deleted { // likewise... } return true, nil }

使用二级索引

选择索引

我们还将使用索引的前缀执行范围查询。例如,我们可以在包含前缀【a】的索引【a, b, c】上执行x < a AND a < y。选择索引只是通过输入前缀匹配列。主键在二级索引之前考虑。
func findIndex(tdef *TableDef, keys []string) (int, error) { pk := tdef.Cols[:tdef.PKeys] if isPrefix(pk, keys) { // use the primary key. // also works for full table scans without a key. return -1, nil } // find a suitable index winner := -2 for i, index := range tdef.Indexes { if !isPrefix(index, keys) { continue } if winner == -2 || len(index) < len(tdef.Indexes[winner]) { winner = i } } if winner == -2 { return -2, fmt.Errorf("no index found") } return winner, nil } func isPrefix(long []string, short []string) bool { if len(long) < len(short) { return false } for i, c := range short { if long[i] != c { return false } } return true }

编码索引前缀

如果输入键使用索引的前缀而不是完整的索引,我们可能需要编码额外的列。例如,对于带有索引【a,b】的查询v1 < a,我们不能使用【v1】 < key作为底层B树查询,因为任何键【v1,v2】都满足【v1】 < 【v1,v2】,而违反v1 < a。
相反,在这种情况下,我们可以使用【v1,MAX】 < key,其中MAX是列b的最大可能值。以下是编码带有额外列的部分查询键的函数。
// The range key can be a prefix of the index key, // we may have to encode missing columns to make the comparison work. func encodeKeyPartial( out []byte, prefix uint32, values []Value, tdef *TableDef, keys []string, cmp int, ) []byte { out = encodeKey(out, prefix, values) // Encode the missing columns as either minimum or maximum values, // depending on the comparison operator. // 1. The empty string is lower than all possible value encodings, // thus we don't need to add anything for CMP_LT and CMP_GE. // 2. The maximum encodings are all 0xff bytes. max := cmp == CMP_GT || cmp == CMP_LE loop: for i := len(values); max && i < len(keys); i++ { switch tdef.Types[colIndex(tdef, keys[i])] { case TYPE_BYTES: out = append(out, 0xff) break loop // stops here since no string encoding starts with 0xff case TYPE_INT64: out = append(out, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff) default: panic("what?") } } return out }
对于 int64 类型,最大值编码为所有 0xff 字节。问题是字符串没有最大值。我们可以做的就是使用"\xff" 作为 "伪最大字符串值" 的编码,并将正常字符串编码更改为不以 "\xff" 开头。
如果一个字符串的第一字节是"\xff"或"\xfe",那么它就会被"\xfe"字节转义。因此,所有字符串编码都小于"\xff"。
// 1. strings are encoded as null-terminated strings, // escape the null byte so that strings contain no null byte. // 2. "\xff" represents the highest order in key comparisons, // also escape the first byte if it's 0xff. func escapeString(in []byte) []byte { // omitted... pos := 0 if len(in) > 0 && in[0] >= 0xfe { out[0] = 0xfe out[1] = in[0] pos += 2 in = in[1:] } // omitted... return out }

通过索引检索行

索引键包含所有主键列,因此我们可以找到完整的行。Scanner类型现在知道选择了哪个索引。
// the iterator for range queries type Scanner struct { // omitted... db *DB tdef *TableDef indexNo int // -1: use the primary key; >= 0: use an index iter *BIter // the underlying B-tree iterator keyEnd []byte // the encoded Key2 }
// fetch the current row func (sc *Scanner) Deref(rec *Record) { assert(sc.Valid()) tdef := sc.tdef rec.Cols = tdef.Cols rec.Vals = rec.Vals[:0] key, val := sc.iter.Deref() if sc.indexNo < 0 { // primary key, decode the KV pair // omitted... } else { // secondary index // The "value" part of the KV store is not used by indexes assert(len(val) == 0) // decode the primary key first index := tdef.Indexes[sc.indexNo] ival := make([]Value, len(index)) for i, c := range index { ival[i].Type = tdef.Types[colIndex(tdef, c)] } decodeValues(key[4:], ival) icol := Record{index, ival} // fetch the row by the primary key rec.Cols = tdef.Cols[:tdef.PKeys] for _, c := range rec.Cols { rec.Vals = append(rec.Vals, *icol.Get(c)) } // TODO: skip this if the index contains all the columns ok, err := dbGet(sc.db, tdef, rec) assert(ok && err == nil) } }

将所有部件组合在一起

修改后的 dbScan 函数使用二级索引。现在还可以通过索引前缀进行范围查询。它还可以扫描整个表,而不需要任何键(如果没有列,则选择主键)。
func dbScan(db *DB, tdef *TableDef, req *Scanner) error { // sanity checks // omitted... // select an index indexNo, err := findIndex(tdef, req.Key1.Cols) if err != nil { return err } index, prefix := tdef.Cols[:tdef.PKeys], tdef.Prefix if indexNo >= 0 { index, prefix = tdef.Indexes[indexNo], tdef.IndexPrefixes[indexNo] } req.db = db req.tdef = tdef req.indexNo = indexNo // seek to the start key keyStart := encodeKeyPartial( nil, prefix, req.Key1.Vals, tdef, index, req.Cmp1) req.keyEnd = encodeKeyPartial( nil, prefix, req.Key2.Vals, tdef, index, req.Cmp2) req.iter = db.kv.tree.Seek(keyStart, req.Cmp1) return nil }
我们实现了关系数据库的一些主要功能:表、范围查询和二级索引。我们可以开始向数据库添加更多功能和查询语言。然而,还有一些主要方面仍然缺失:事务和并发性,这些将在以后的章节中进行探索。

Copyright © 2025 later

logo