433: fix(filter): Fix two bugs. r=Kerollmops a=irevoire
- Stop lowercasing the field when looking in the field id map
- When a field id does not exist it means there is currently zero
documents containing this field thus we return an empty RoaringBitmap
instead of throwing an internal error
Will fix https://github.com/meilisearch/MeiliSearch/issues/2082 once meilisearch is released
Co-authored-by: Tamo <tamo@meilisearch.com>
426: Fix search highlight for non-unicode chars r=ManyTheFish a=Samyak2
# Pull Request
## What does this PR do?
Fixes https://github.com/meilisearch/MeiliSearch/issues/1480
<!-- Please link the issue you're trying to fix with this PR, if none then please create an issue first. -->
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
## Changes
The `matching_bytes` function takes a `&Token` now and:
- gets the number of bytes to highlight (unchanged).
- uses `Token.num_graphemes_from_bytes` to get the number of grapheme clusters to highlight.
In essence, the `matching_bytes` function now returns the number of matching grapheme clusters instead of bytes.
Added proper highlighting in the HTTP UI:
- requires dependency on `unicode-segmentation` to extract grapheme clusters from tokens
- `<mark>` tag is put around only the matched part
- before this change, the entire word was highlighted even if only a part of it matched
## Questions
Since `matching_bytes` does not return number of bytes but grapheme clusters, should it be renamed to something like `matching_chars` or `matching_graphemes`? Will this break the API?
Thank you very much `@ManyTheFish` for helping 😄
Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com>
- Stop lowercasing the field when looking in the field id map
- When a field id does not exist it means there is currently zero
documents containing this field thus we returns an empty RoaringBitmap
instead of throwing an internal error
432: Fuzzer r=Kerollmops a=irevoire
Provide a first way of fuzzing the indexing part of milli.
It depends on [cargo-fuzz](https://rust-fuzz.github.io/book/cargo-fuzz.html)
Co-authored-by: Tamo <tamo@meilisearch.com>
The `matching_bytes` function takes a `&Token` now and:
- gets the number of bytes to highlight (unchanged).
- uses `Token.num_graphemes_from_bytes` to get the number of grapheme
clusters to highlight.
In essence, the `matching_bytes` function returns the number of matching
grapheme clusters instead of bytes. Should this function be renamed
then?
Added proper highlighting in the HTTP UI:
- requires dependency on `unicode-segmentation` to extract grapheme
clusters from tokens
- `<mark>` tag is put around only the matched part
- before this change, the entire word was highlighted even if only a
part of it matched
2076: fix(dump): Fix the import of dump from the v24 and before r=ManyTheFish a=irevoire
Same as https://github.com/meilisearch/MeiliSearch/pull/2073 but on main this time
Co-authored-by: Irevoire <tamo@meilisearch.com>
2066: bug(http): fix task duration r=MarinPostma a=MarinPostma
`@gmourier` found that the duration in the task view was not computed correctly, this pr fixes it.
`@curquiza,` I let you decide if we need to make a hotfix out of this or wait for the next release. This is not breaking.
Co-authored-by: mpostma <postma.marin@protonmail.com>
2057: fix(dump): Uncompress the dump IN the data.ms r=irevoire a=irevoire
When loading a dump with docker, we had two problems.
After creating a tempdirectory, uncompressing and re-indexing the dump:
1. We try to `move` the new “data.ms” onto the currently present
one. The problem is that if the `data.ms` is a mount point because
that's what peoples do with docker usually. We can't override
a mount point, and thus we were throwing an error.
2. The tempdir is created in `/tmp`, which is usually quite small AND may not
be on the same partition as the `data.ms`. This means when we tried to move
the dump over the `data.ms`, it was also failing because we can't move data
between two partitions.
------------------
1 was fixed by deleting the *content* of the `data.ms` and moving the *content*
of the tempdir *inside* the `data.ms`. If someone tries to create volumes inside
the `data.ms` that's his problem, not ours.
2 was fixed by creating the tempdir *inside* of the `data.ms`. If a user mounted
its `data.ms` on a large partition, there is no reason he could not load a big
dump because his `/tmp` was too small. This solves the issue; now the dump is
extracted and indexed on the same partition the `data.ms` will lay.
fix#1833
Co-authored-by: Tamo <tamo@meilisearch.com>
424: Store the geopoint in three dimensions r=Kerollmops a=irevoire
Related to this issue: https://github.com/meilisearch/MeiliSearch/issues/1872
Fix the whole computation of distance for any “geo” operations (sort or filter). Now when you sort points they are returned to you in the right order.
And when you filter on a specific radius you only get points included in the radius.
This PR changes the way we store the geo points in the RTree.
Instead of considering the latitude and longitude as orthogonal coordinates, we convert them to real orthogonal coordinates projected on a sphere with a radius of 1.
This is the conversion formulae.
![image](https://user-images.githubusercontent.com/7032172/145990456-eefe840a-384f-4486-848b-81d0036814ec.png)
Which, in rust, translate to this function:
```rust
pub fn lat_lng_to_xyz(coord: &[f64; 2]) -> [f64; 3] {
let [lat, lng] = coord.map(|f| f.to_radians());
let x = lat.cos() * lng.cos();
let y = lat.cos() * lng.sin();
let z = lat.sin();
[x, y, z]
}
```
Storing the points on a sphere is easier / faster to compute than storing the point on an approximation of the real earth shape.
But when we need to compute the distance between two points we still need to use the haversine distance which works with latitude and longitude.
So, to do the fewest search-time computation possible I'm now associating every point with its `DocId` and its lat/lng.
Co-authored-by: Tamo <tamo@meilisearch.com>
2060: chore(all) set rust edition to 2021 r=MarinPostma a=MarinPostma
set the rust edition for the project to 2021
this make the MSRV to v1.56
#2058
Co-authored-by: Marin Postma <postma.marin@protonmail.com>
When loading a dump with docker, we had two problems.
After creating a tempdirectory, uncompressing and re-indexing the dump:
1. We try to `move` the new “data.ms” onto the currently present
one. The problem is that if the `data.ms` is a mount point because
that's what peoples do with docker usually. We can't override
a mount point, and thus we were throwing an error.
2. The tempdir is created in `/tmp`, which is usually quite small AND may not
be on the same partition as the `data.ms`. This means when we tried to move
the dump over the `data.ms`, it was also failing because we can't move data
between two partitions.
==============
1 was fixed by deleting the *content* of the `data.ms` and moving the *content*
of the tempdir *inside* the `data.ms`. If someone tries to create volumes inside
the `data.ms` that's his problem, not ours.
2 was fixed by creating the tempdir *inside* of the `data.ms`. If a user mounted
its `data.ms` on a large partition, there is no reason he could not load a big
dump because his `/tmp` was too small. This solves the issue; now the dump is
extracted and indexed on the same partition the `data.ms` will lay.
fix#1833
2059: change indexed doc count on error r=irevoire a=MarinPostma
change `indexed_documents` and `deleted_documents` to return 0 instead of null when empty when the task has failed.
close#2053
Co-authored-by: Marin Postma <postma.marin@protonmail.com>
2056: Allow any header for CORS r=curquiza a=curquiza
Bug fix: trigger a CORS error when trying to send the `User-Agent` header via the browser
`@bidoubiwa` thanks for the bug report!
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
429: Benchmark CIs: not use a default label to call the GH runner r=irevoire a=curquiza
Since we now have multiple self-hosted github runners, we need to differentiate them calling them in the CI. The `self-hosted` label is the default one, so we need to use the unique and appropriate one for the benchmark machine
<img width="925" alt="Capture d’écran 2022-01-04 à 15 42 18" src="https://user-images.githubusercontent.com/20380692/148079840-49cd7878-5912-46ff-8ab8-bf646777f782.png">
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>