dp3t-documents/FAQ.md

148 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# FAQ: Decentralized Proximity Tracing
This FAQ attempts to answer frequently asked questions about the DP-3T project, the problems it tries to address, and its design choices. It is by no means complete. Well be updating this FAQ as we go, for now we have been focussing on answering the technical questions first. Feedback is very welcome.
* [Protocol Questions](#protocol-questions)
* [P1: Why dont infected users upload the ephemeral Bluetooth identifiers (EphIDs) they have observed to the backend server, so that other apps can download them and check for contacts locally?](#p1-why-dont-infected-users-upload-the-ephemeral-bluetooth-identifiers-ephids-they-have-observed-to-the-backend-server-so-that-other-apps-can-download-them-and-check-for-contacts-locally)
* [P2: Why dont infected users upload the ephemeral Bluetooth identifiers (EphIDs) they have observed to the backend server, so that other apps can ask the server if there is a match with their own EphIDs?](#p2-why-dont-infected-users-upload-the-ephemeral-bluetooth-identifiers-ephids-they-have-observed-to-the-backend-server-so-that-other-apps-can-ask-the-server-if-there-is-a-match-with-their-own-ephids)
* [P3: Why not use multi party computation or custom privacy\-preserving protocols (PSI, PIR, etc\.) instead to query the server for the observed ephemeral Bluetooth identifiers?](#p3-why-not-use-multi-party-computation-or-custom-privacy-preserving-protocols-psi-pir-etc-instead-to-query-the-server-for-the-observed-ephemeral-bluetooth-identifiers)
* [P4: Why is the system not using public key cryptography when broadcasting identifiers?](#p4-why-is-the-system-not-using-public-key-cryptography-when-broadcasting-identifiers)
* [P5: Why not use mixnets or other anonymous communication systems to query the server?](#p5-why-not-use-mixnets-or-other-anonymous-communication-systems-to-query-the-server)
* [P6: Why do infected people upload a seed (which enables recreating EphIDs) instead of their individual EphIDs?](#p6-why-do-infected-people-upload-a-seed-which-enables-recreating-ephids-instead-of-their-individual-ephids-)
* [P7: Why do you call your design "decentralized" while having a backend?](#p7-why-do-you-call-your-design-decentralized-while-having-a-backend)
## Protocol Questions
Questions regarding the underlying protocol and mitigations for known vulnerabilities
### P1: Why dont infected users upload the ephemeral Bluetooth identifiers (`EphIDs`) they have observed to the backend server, so that other apps can download them and check for contacts locally?
*Short answer:** The bandwidth cost of downloading all observed Bluetooth
identifiers from all infected individuals is high. Furthermore, it facilitates
attacks that insert or remove contact events. Finally, it reveals interactions
between pseudonymous users to the backend server, without providing extra
privacy in comparison with publishing the infected users seeds.
*Long answer:** It is possible to build a privacy-friendly contact tracing
system by letting diagnosed patients upload the list of observed ephemeral
Bluetooth identifiers (EphIDs). All other smartphones would then download this list,
and check if any of the identifiers they generated was seen by (and therefore in
close physical proximity to) an infected patient.
This option, however, is very costly. In Europe there are more than 30,000
patients a day. The number of observed EphIDsis also high. We expect people to
be in close physical proximity with many people. For instance, spending 24 hours
at home with your partner will already yield 96 recorded EphIDs(assuming they
change every 15 minutes). So lets say an infected person uploads 5000 unique
contact events for 21 days. We then need to transfer 150 million records. Even
using efficient representations (e.g., a cuckoo filter) this would take at least
600MB to be downloaded by every app, every day.
Sending observed contacts also increases the likelihood that a tech-savvy user
creates fake contact events, which in turn can lead to unnecessary anxiety. To
fake at-risk contacts, an infected user simply inserts additional EphIDs from
other users to his local storage. In DP3T, in which an infected user shares
their own EphIDs, the barrier to fake contact events is much higher. The infected
user would have to actively broadcast their own EphIDs via another device to fake
contacts with other users.
### P2: Why dont infected users upload the ephemeral Bluetooth identifiers (`EphIDs`) they have observed to the backend server, so that other apps can ask the server if there is a match with their own `EphIDs`?
*Short answer:* This results in a high load on the server and either reveals
privacy sensitive information to the server, or requires anonymous
communication.
*Long answer:** In this solution, rather than apps downloading a list of all
EphIDs observed by infected patients, they would instead query the backend
server with their own EphIDs to ask if any of them has been in contact with an
infected patient. The consequence is a significant increase in bandwidth usage.
In particular, the apps must daily query all the EphIDs that they broadcasted
in the last 21 days (as newly diagnosed patients might have seen these in the
past), which is estimated as approximately 2,000 EphIDs per day per user.
For privacy reasons, it is essential that the server cannot link all EphIDs of a
single user. Therefore, users must query their EphIDs separately and via an
anonymous communication network so that their identifiers remain unlinkable. For
50 million users, the server must therefore be able to process more than a
million lookup queries per second.
### P3: Why not use multi party computation or custom privacy-preserving protocols (PSI, PIR, etc.) instead to query the server for the observed ephemeral Bluetooth identifiers?
We all love privacy-preserving cryptography. However, the scale at which this
system must operate is significant: a server set size of 150 million entries of
16 bytes each (corresponding to 30k new infections a day and 5000 distinct
recorded EphIDs), a client set of 2,000 items, and 50 million daily queries
(>500 queries per second).
It might be possible to design and deploy special purpose cryptographic
techniques that scale to this level and we are aware of research prototypes that
might be able to fulfil the requirements and for which code might be available.
However, a significant investment of time and engineering effort would still be
needed to take such prototypes and develop them to the point where they could be
deployed in a mobile application.
### P4: Why is the system not using public key cryptography when broadcasting identifiers?
In DP-3T any device must communicate with all of their neighbours, meaning that
authentication is impossible. Thus, a malicious party can inject their own
traffic and hence participate in any exchange.
Secondly, any application of public key cryptography would require a connection
between devices or multiple broadcasts (each broadcast is limited to only 11
bytes and the smallest public keys are around 32 bytes). In a crowded
environment there is substantial message loss from interference between
messages. It is unlikely that performing N^2 connections or exchanges between N
apps would function effectively, in contrast to N broadcasts in the current
protocol.
### P5: Why not use mixnets or other anonymous communication systems to query the server?
Our design uses a small amount of dummy messages to provide traffic analysis protection for uploads to the backend and epidemiologists with respect to network adversaries. The use of a mixnet, Tor or other anonymous system would in addition conceal the IP address of users submitting reports with respect to the backend.
We considered using an anonymous communication system. However, we decided against doing so for the following reasons:
1. Relying on any form of anonymous communication system increases the
complexity of the system. Both in terms of integrating anonymous
communication into the app, as well as the server infrastructure needed to
support tens of millions of apps. (Even Tor, the most widely deployed
anonymous communication network, would struggle under this load.)
2. All anonymous communication systems must trade anonymity, latency, and
bandwidth overhead. It is not clear what is the good choice in this
scenario.
3. We need to take the security properties of the anonymous communication
system into account in our analysis. E.g., should we protect against a
global passive adversary or not? How well does the system protect against
intersection attacks?
In future versions of the app, if an approppriate anonymous communication network appears, we may include the option of submitting data anonymously to the backend.
### P6: Why do infected people upload a seed (which enables recreating `EphIDs`) instead of their individual EphIDs ?
This is a choice that is made purely for performance reasons. It is much more
efficient to send a single 32 byte seed than sending all EphIDs generated during
the infectious period (e.g., 21 days). We are aware that this makes the EphIDs
of infected patients linkable during the infectious period.
For comparison, sending 21 days of EphIDs rotated every 15 minutes requires
sending 32kB per infected patient. Even when compressing these EphIDs in a
cuckoo filter, wed need around 8kB per infected patient. So smartphones would need
to download at least 2 orders of magnitude more (e.g., for 30k infected a day:
from around 1 MB to 230 MB per day).
We are working on alternatives, and explored: sending all generated EphIDs in a
cuckoo filter, using a hierarchical structure (keys per day, keys per 4 hours
etc.) so that users can do a more granular release, and smaller-region versions
of those (e.g., work per state rather than country to lower communication cost).
All of these have downsides, either in computation, bandwidth or
interoperability/leakage cost. Were trying to find a good middle ground. If you
have an idea that we did not yet list, please do reach out to us!
### P7: Why do you call your design "decentralized" while having a backend?
We call our design decentralized because there is not central point of trust for
security and privacy. All critical operations: creating EphIDs and matching
observations are done locally in each phone. The backend server is only needed
to ensure availability. However, it does not maintain any secrets. Attackers do
not gain anything by compromising the backend. All privacy-sensitive information
is decentralized, and stored on individuals phones.