Safe storage of data on ledgers

blog
One of the topics under research in the PRIViLEDGE project is the safe storage of sensitive data on ledgers. This post will give some insights from the conducted research. 

In the first section it will provide information why generally it is considered a good practice to avoid storing bulk data on ledgers and on the second will examine the type of data stored on ledgers today.

Avoid Storing (Large) Data on Ledgers

One of the main reasons for following such a suggestion is to keep the size of the ledger small. In this way, more nodes, members of the network, are able to store the whole ledger and vouch for its validity. This is important as these nodes, which are called full nodes, are responsible for the validity of the ledger and so their number should be large enough to avoid centralization. To gain some insight, the size of the Bitcoin blockchain is about 252 GB and the size of the Ethereum blockchain is about 13 TB.

Apart from the size of the whole blockchain, the size of blocks may be a limiting factor. For example, the block size for Bitcoin is 1 MB, which is enough to store information about transactions but not enough for bigger files.

Another reason to avoid storing data on the blockchain itself is to enhance privacy and security of the stored data. It should be noted that, even if the data are not stored in their plaintext form but encrypted on the ledger, adversaries may find ways to violate privacy. For example, storing the hash value of a university diploma in a ledger, may lurk dangers as an adversary could use the template of the document used by the university, guess the plaintext, generate its hash value and then check whether it is identical with the one stored on the ledger.

Ways to Store Data Off-Ledger

Today, most systems store on ledgers transactions, commitments, and metadata. For larger files, and for the reasons outlined above, the strategy most often used is to store small volumes of data on ledgers that link to actual data that reside elsewhere, i.e., to an external, centralized or decentralized, database.

A simplified scenario that combines a ledger with an external database is shown in the following picture. When a decentralized application (Dapp) is searching for a resource (1), it interacts with the ledger and gets the link for the corresponding data (2). After obtaining the link the Dapp uses it to get to the place where the requested resource is stored (3). Finally the data are returned to the Dapp (4).


Centralized databases have some benefits such as higher speed, low latency, higher availability, and quick throughput, but also experience some disadvantages like security breaches and privacy concerns; moreover, users have no control of their personal data when it is stored centrally on a database operated by others.

One way to overcome the centralized databases drawbacks and achieve enhanced privacy for the data stored is to store data on a decentralized database.

A well known project whose primary goal is to give users the possibility to own their own data and define where they will be stored and who will have access to them, is Solid. In Solid, users’ data are separated from the applications that use them and they don’t have to be saved on servers run by each application. Users choose a place to store their data, which is called a Solid Pod. That is a server that runs Solid software and it could be hosted either by the user themselves or by a provider that the user chooses to save their data. Every user has a unique identifier, called WebID, and users can define in their pods which WebIDs have access to specific files. A WebID can be used to form a URL to denote a user and point to the place where user’s data can be retrieved. In this way, a user can store all the data in a server of their choice, controlling which applications can read them.

A simplified scenario to combine a ledger and Solid storage could be to use the ledger as an index, as is shown in the following figure. When a decentralized application (Dapp) is searching for a resource (1) it could interact with the ledger and find the URL where more information about the resource can be found (2). After obtaining the URL, the Dapp could perform an HTTP GET request to the URL to retrieve the data (3). If the Dapp was granted read access, to the requested file the data will be retrieved (4).



Another type of decentralized databases use end-to-end encryption and sharding to preserve privacy of users data. More specifically, to store data, the data are divided into chunks; then they are encrypted; having done that, they are distributed among many different nodes; users may choose which nodes they want to store their files. To retrieve data, users must use their private keys for decryption. Distributed tables with references are used to track the pieces of the data for retrieval

Today such systems for decentralization of storage are IPFS, Swarm, Storj, Maidsafe or Safenetwork, Sia. Further reading for those interested in related projects are provided in the links:


Written by Athina Styliani Kleinaki, GRNET.