Intel, one of the world’s leading technology companies, recently made a significant cost-saving move by evaluating their database workloads for better efficiency. Databases have been crucial in managing data for decades, but it’s easy to take them for granted and miss opportunities to optimize their use, especially when it comes to cost. To address this, Intel decided to assess their current massively parallel processing (MPP) relational database management system (RDBMS) and explore alternative solutions.
To begin the evaluation process, Intel’s IT department needed a comprehensive understanding of their database workloads and establish a benchmark to represent those workloads accurately. They had a general idea about the volume of manufacturing data stored in the MPP RDBMS and the number of engineers who queried the data. However, they required more detailed information.
The team at Intel posed vital questions to guide their evaluation process:
1. What are the types of jobs included in the overall database workload?
2. What do the queries look like?
3. How many concurrent users are there for each query type?
To address these queries, an analogy was presented. Imagine opening a beauty salon in your town. To ensure its success, you need to estimate the number of people who will visit during peak hours, so you can set up the appropriate number of stations. The services you offer, the speed at which your beauticians work, and the number of beauticians available all factor into how many customers you can serve. The workload in this scenario depends on customer preferences and fluctuates over time. Understanding these dynamics is crucial to avoid overcrowding and customer dissatisfaction.
Similarly, Intel’s database workload consists of different types of interactions between the database and engineers (consumption) and systems that transfer data (ingestion). Ingestion involves processes like extraction-transformation-loading (ETL), critical path ETL, bulk loads, and various insert/update/delete requests within the database. Consumption includes running reports and queries, some of which are batch jobs, while others are ad hoc.
To accurately characterize their workload, Intel employed machine learning techniques such as k-means clustering and Classification and Regression Trees (CARTs). These methods help identify patterns and similarities within the data.
Referring back to the beauty salon analogy, Intel’s team used k-means clustering and CART to analyze customers and categorize them into groups with similarities, such as those seeking just hair services, hair, and nail services, or solely nail services.
In Intel’s database workload, k-means clustering and CART analysis revealed that ETL requests could be clustered into seven groups based on factors like CPU time, highest thread I/O, and running time. Similarly, SQL requests could be grouped into six clusters based on CPU time.
Once the groupings were established, the next step involved characterizing various peak periods analogous to regular, pre-Valentine’s, and post-Valentine’s workload types. By analyzing historical database logs, Intel generated counts of requests for each group during different hours of the day. K-means clustering was then used again, this time to group one-hour slots with similar request counts. Finally, sample workloads were created by selecting the one-hour slots from each group with the highest overall CPU utilization.
What made this process particularly valuable was its reliance on data-driven insights and reliable machine-learning analyses. Unlike the beauty salon analogy, which relied on personal conjecture, Intel’s workload characterization was guided by concrete data and machine learning techniques, ensuring a robust benchmarking process. This allowed Intel to evaluate the cost and performance of their existing MPP RDBMS against several alternative solutions, leading to cost savings and improved efficiency.
For a detailed account of how Intel created a custom benchmark and conducted multiple proofs of concept to run the benchmark, readers can refer to the IT@Intel white paper titled Minimizing Manufacturing Data Management Costs.
Intel’s cost-saving move serves as a reminder to businesses to regularly evaluate and optimize their database workloads. By leveraging machine learning techniques and detailed workload analysis, companies can identify opportunities for cost reduction and enhanced efficiency. As technology continues to evolve, it is important to leverage the power of data to make informed decisions and stay ahead in today’s competitive landscape.