Evolution of Filehandling - What I have learned and when to use S3
Storing files and images has come a long way, I myself went through all levels of storing and providing files in applications, here's what I have learned.
It's important to question yourself: "How are the users in my application going to interact with files", to see which file-handling serves you the best. But we will talk about specific cases later, let's dive into the various methods you can handle and provide files to your users and what their upsides and downsides are.
1. The Database
When a developer is raised his first logical thinking is "Everything that needs to be persisted has to be in a database". Which is technically most of the times correct and in this case it also actually works. You could just convert your files into base64 or just directly store them in binary in a database table if it supports binary persistence. After that you could retrieve and provide those images to your users.
The benefits of this method is that you don't have to care much about file handling and also have a backup of them when you are doing backups of your database (which I hope you are constantly doing btw). This way it's also more portable since you only care about your database instead of caring about database and file storage.
This is however also one of the downsides. Imagine you are doing a backup once every 24 hours. Now when you have 10 Gb of files hosted inside your database, your backups are going to take longer and your backup server might hit its limits faster or you have to pay for more storage. Have you ever tried running a very large old database backup to production? It sucks. Additionally your performance is lacking behind since you have to retrieve large amounts of data to your backend. Is your file size larger than 256k? Then a file storage is more performant. Jim Gray wrote about this exact issue in To Blop or Not.
If you are only saving users profile pictures, just use the database, it's the most convenient one in this case.
2. The File System
Okey so let's say you have a system where users can upload files in their dashboards. For this case it's not wise to use a database as we have learned above, since we are dealing with many large files. So you could instead just use your servers file system, because that's where files are stored right? And we also learned in our first years how to interact with the file system via our code. So we build an endpoint in our service and two functions to save and retrieve files.
The upsides of this method is that we have a clean database with just pointers to the filenames and we have solved all of the database downsides.
The downsides however are a bit more crucial. Lets start with backups: you would have to create some kind of script to pack all files in the specific folder into an archive like a zip and move them somewhere safe. Now imagine you want to scale your backend, usually in this case you are running your service in a container, and for all containers to have access to the file storage folder you are storing your images in, you would have to create links to it. You see where this is going, so it's a bit of work to make it work magically.
3. The Encapsuled SFTP User
So to solve the above mentioned issues we are just going to use a separate machine where we store and retrieve files from. We will also go as far as using SFTP instead of FTP and using a dedicated user who has only write and read access to the filestorage folder on that machine to increase our security. We will then write two functions in our backend code where we use an SFTP connection to store and retrieve files using the created users credentials.
The upsides of this are that we can for one create easier backups since we have a dedicated machine for our files and more importantly, we can now scale without any weird issues. We can also run an nginx on a seperate folder if we want to, to serve public files directly via a simple GET.
The big downside of this is however that it's just a great hassle to do. No one wants to create all of the steps needed to create the above solution. I did however because I started from the very beginning and worked myself through all methods back in the day. Also scaling and providing fast streams for different zones are going to be a challenge which you will only face though once you are playing in the big league.
4. S3 & MinIO
Now we are going to play with the big boys. The big data industry standard is something like S3, MinIO and other high performance file storage services. Usually you just deploy your container and it's ready to go. In MinIO for example you can just create a bucket with a custom policy where you define that the files inside the bucket are publicly accessible but the index contents not. Then you can run a web server in front of it and voilĂ , you have a simple public image file server.
There are plenty of benefits, such as easy scalability, lots of integrations for your backend, UI dashboard for easy configuration & directly crawling files and backups are simpler. You can self host your container or use dedicated host such as Amazon S3 that does the scaling part for you. It serves an overall easier handling and solves all of the issues we have had above.
Thank you!
You did it, you have learned how the industry handles files and how we handled them back in the day. I hope you enjoyed reading! You want to read more like this?? Just subscribe with your email and you will be notified. I'm not writing much, so it won't spam you, I promise!