The capacity to store digital data has nearly doubled every 40 months since the 1980s. According to the International Data Corporation, there will be 163 zettabytes of data by 2025.
Big data is a field that deals with such massive amounts of data. It involves analyzing or systematically extracting information from datasets that are too complex or large to be handled by conventional data-processing application software.
With an ever-increasing growth of databases in activities, such as e-learning, online shopping, video content, and social networking, managing vast amounts of information has become a very big challenge. These challenges include transferring and storing data, querying, visualization, and updating information.
There have been several different forms of big data repositories, usually developed by tech giants for specific requirements. If you are working on a project that requires a giant database for machine learning or testing purposes, we’ve got you covered.
We’ve gathered both open-source and private databases designed for a particular environment. They all are true workhorses of the big data field, and each of them makes it feasible to mine and learn more about big data.
Features of FlockDB
Type: Graph database
FlockDB is an open-source distributed database for storing adjacency lists. It is designed to support high rates of add/update/delete operations and perform complex arithmetic queries.
Unlike other databases, FlockDB tries to solve fewer problems. It does not dive any deeper into the information than it should. But since it uses a broad range of network graphs, one can use it to implement web applications efficiently.
Initially, Twitter used it to store relationships between users (following and favorites) and secondary indices. In 2010, the Twitter FlockDB had more than 13 billion edges and sustained peak traffic of 100,000 reads/second and 20,000 writes/second.
12. Apache CouchDB
Type: Document-oriented NoSQL database
CouchDB is an open-source database that embraces the web. It allows you to store data with JSON documents and access them with your web browser through HTTP. It features ACID semantics with eventual consistency and distributed architecture with replication.
Each database contains a group of independent documents, and every single document maintains its own data and self-contained schema. CouchDB also uses the multi-version concurrency control, so it doesn’t lock database files during writes.
Features of OrientDB
Type: NoSQL, multi-model database
Written in Java, OrientDB is an open-source, multi-model database supporting document, graph, key/value, and object models. It is the first NoSQL DBMS that brings together the flexibility of documents and the power of graphs into one high-performance, scalable database.
The database has been designed to be very fast. It can store up to 220,000 records per second on commodity hardware. Users can traverse thousands of records in a few milliseconds.
It scales out well on multiple machines: it can have 278 records for the maximum capacity of over 19 quadrillion terabytes of data on a single server or multiple nodes. OrientDB supports schema-full, schema-mixed, and schema-less modes, and has a strong security profiling system based on user and roles.
10. Wikipedia and Stackoverflow Data
Files for StackExchange
If your project needs a massive amount of textual content, Wikipedia offers free databases containing all available articles. You can use these databases in your personal projects and run SQL queries from the browser using Quarry.
All text content is multi-licensed under GNU Free Documentation Licence and Creative Commons Licence. Audio, videos, and pictures are available under different terms.
The anonymized dump of all user-contributed content on the Stack Exchange Network is also available on the Internet Archive. Every site has a separate archive containing zipped XML files. Each archive includes Users, Posts, Comments, PostLinks, Votes, and PostHistory.
9. Yahoo Webscope Program
Type: Different projects have different forms of datasets
The Yahoo Webscope Program is an initiative for academic researchers. It contains various scientifically useful datasets for non-commercial use by faculty members, research employees, and students from an accredited university.
The program provides datasets in six different categories:
- Computing systems data
- Language data
- Competition data
- Graph and social data
- Rating and classification data
- Advertising and market data
Each database has been reviewed to comply with Yahoo’s privacy and data protection standards.
Type: Graph database
Neo4j is a native graph database, created from scratch to leverage both data and data relationships. Unlike conventional databases that put data in rows and columns, Neo4j has a flexible structure established by stored relationships between data records.
Everything is stored in the form of a node, edge, or attribute. Individual nodes and edges can have ‘n’ number of attributes. Both nodes and edges can be labeled to facilitate narrow searchers. Connections between data are stored — they are computed at query time.
While Neo4j is implemented in Java, it can be accessed from programs written in other languages through the Cypher (declarative graph) query language. Cypher queries are simpler and easier to write than large SQL JOINs. Since this database does not have any table, there is no need to worry about JOINs.
Example use cases of Riak
Type: NoSQL key-value data store
Riak is an open-source, distributed database that offers scalability, operational simplicity, fault tolerance, and high availability. It is available for enterprise, cloud, web, and mobile platforms.
The database has two versions: Riak KV (a distributed NoSQL database)and Riak TS (well-optimized for IoT and time series data). Both integrate with various big data technologies such as Redis Caching, Apache Solr, Apache Spark, and Apache Mesos.
If you need private cloud storage for reasonably large files that can be called directly through a browser, or if you want to test your database via a simple REST interface, you can try Riak.
It is currently being used (or has been used) by many companies, including The Weather Channel, Comcast, Best Buy, GitHub, AT&T, and UK National Health Services.
Type: Document-oriented database
Rethink is a free and open-source database for the realtime web. It stores JSON documents with dynamic schemas and continuously pushes updated query results to web applications in realtime, significantly reducing the effort and time required to develop scalable realtime apps.
The database is easy to set up and learn. It provides a flexible query language, monitoring APIs, and intuitive operations. While it is available in the Amazon AWS and Compose.io cloud, users can also deploy it in their own infrastructures with constraints.
It has been used by hundreds of consulting studios, technology startups, and some Fortune 500 companies. Workshape.io and Platzi, for example, use RethinkDB to power realtime analytics; Narrative Clip uses it to power cloud infrastructure for linked devices; Mediafly uses it to power reactive mobile and web apps.
ArangoDB query uses three database features: multi-model, joins, and transactions.
Type: Multi-model database system
ArangoDB is an open-source, scalable database that natively supports document, graph, and search. All supported data models and access patterns can be merged in queries, enabling maximal flexibility.
Although it is a NoSQL database system, ArangoDB query language is similar in many ways to SQL. It supports CRUD (create, read, update, delete) operations for both documents (nodes) and edges, as well as geospatial queries.
Other features of ArangoDB include multi-thread capability, command-line and web interface tools for interaction with the server, and different storage engines for handling large (bigger than RAM) datasets.
4. Apache HBase
Features of HBase
Type: Non-relational distributed database
HBase is written in Java and modeled after Google’s Bigtable. Developed as a part of the Apache Hadoop project, it is designed to provide quick random access to large amounts of structured data (billions of rows and X millions of columns).
It runs on top of the Hadoop Distributed File System (HDFS) and allows the storage of large quantities of sparse data in a fault-tolerant way. This type of database is suitable for heavy applications that require faster read/write operations with high throughput and low input/output latency.
It also features a SQL layer and Java Database Connectivity driver, which can be integrated with a broad range of business intelligence and analytics programs.
Facebook, Alibaba Group, Netflix, Spotify, Imgur, Adobe, Airbnb, Yahoo, Pinterest, and Xiaomi are some of the major companies that are using or have used HBase.
3. Apache Cassandra
Which industries use Cassandra?
Type: NoSQL database
Initially developed by Facebook (in 2008), Apache Cassandra is an open-source, wide-column store, distributed NoSQL database. It is designed to handle vast amounts of data across several commodity servers and deliver continuous availability (zero downtime) with no single point of failure.
It offers automatic data replication to multiple nodes for fault-tolerance. It replaces failed nodes with almost no downtime and offers low latency operations for all users.
The database can be monitored and managed through Java Management Extensions. Nodetool, for example, can efficiently handle a Cassandra cluster; clients can use it to decompress nodes, add nodes to the ring, and drain nodes.
Cassandra is used by more than 30% of the Fortune 100. Following are the examples of some large production deployments:
- Apple uses 100,000 Cassandra nodes storing over 10 petabytes of data.
- Netflix uses 2,500 nodes with over 420 terabytes of data and 1 trillion requests per day.
- Soundcloud uses Cassandra to store its users’ dashboards.
- Netflix uses it as a back-end database for their streaming services.
Overall, Cassandra has traditionally been known as an advanced and efficient database that stands up to the most demanding uses cases.
2. Oracle NoSQL Database
Type: Distributed key-value database
Oracle NoSQL Database offers transactional semantics for horizontal scalability, data monitoring, manipulation, and simple administration. It doesn’t adhere to relational Structured Query Language, which depends on tables and predetermined data schemas.
In 2018, the company launched a managed cloud service named Oracle Autonomous NoSQL Database Cloud for advanced applications that demand flexible data models, low latency responses, and elastic scaling for dynamic workloads.
It can be integrated with various Oracle products, including Enterprise Manager, BerkeleyDB, Fusion Middleware, and Communication Elastic Charging Engine. The database supports Python, Java, Node.js C, C#, and REST APIs, so that users can focus on developing applications without worrying about back-end software and hardware infrastructure.
Some typical uses cases of this database are mobile applications, online advertising, social network, online gaming, 360-degree customer view, and anomaly detection. According to the company, the cloud service delivers response times of less than 10 milliseconds.
Type: Document-oriented database
MongoDB is an open-source NoSQL database written in C++. It stores data in JSON-like documents, which is more powerful and expressive than the conventional row/column model.
What are the advantages of rich JSON documents?
- It supports arrays and nested objects as values.
- Enables flexible and dynamic schemas.
- Since queries are themselves JSON, they can be easily composed. No need to concatenate strings to dynamically create SQL queries.
This database has official drivers for all popular programming languages and development environments. It also has several community-supported (unofficial) drivers for lesser-known frameworks.
MongoDB is developed for modern applications and the cloud era. It is available as a fully managed service on Google Cloud, Azure, and AWS.
The database is known to be used by many major companies, including IBM, Cisco, HSBC, Uber, Bosh, eBay, Coinbase, and Codeacademy.