Data Glossary

In a data-driven world is hard to not get lost in the big-data jargon and tech-data-related world. The Data Glossary is a free resource for people new to Big-data, data engineering, machine learning, and all things data-related. This constantly updating list of data-related terms is here to keep you up to demystify commonly used terminology and to help you better understand the data world.

Data Glossary

Lost in the data jargon? We've put together a hand-picked list of the most popular data terms in the data engineering and ML field. Each term is carefully defined and explained.


All


A

ACID Transactions

When it comes to databases and data storage systems, a transaction refers to an action that is treated as a cohesive unit of work. It either successfully completes as a whole or fails entirely, ensuring that the storage system remains in a consistent state. An illustrative instance of a transaction is the process of withdrawing money from your bank account. Either the funds are deducted from your account, or they remain untouched; there is no middle ground.

delta-lake-1-min.png

A.C.I.D. properties: Atomicity, Consistency, Isolation, and Durability

ACID is an acronym representing the four fundamental properties that define a transaction: Atomicity, Consistency, Isolation, and Durability. When a database operation exhibits these ACID properties, it is referred to as an ACID transaction, and data storage systems that implement such operations are known as transactional systems. ACID transactions provide assurance that every read, write, or modification of a table adheres to the following principles:

  • Atomicity: Each statement within a transaction, whether it involves reading, writing, updating, or deleting data, is treated as a unified unit. It is either executed entirely or not at all. This property safeguards against data loss or corruption, such as when a streaming data source fails abruptly during processing.

  • Consistency: Transactions ensure that modifications to tables occur in predefined and predictable manners. By enforcing consistency, the integrity of the table remains intact, preventing unintended consequences or errors from compromising the data.

  • Isolation: When multiple users concurrently read from and write to the same table, transaction isolation guarantees that their actions do not interfere with or impact one another. Each transaction is isolated, allowing them to proceed as if they were occurring sequentially, even though they are executed simultaneously.

  • Durability: The durability property ensures that changes made by successfully executed transactions persist, even in the event of system failures or crashes. The data changes are reliably saved and can be recovered to maintain data integrity and availability.

Why are ACID transactions a good thing to have?

ACID transactions guarantee the utmost data reliability and integrity by preventing data inconsistencies resulting from incomplete operations. Without ACID transactions, if you were in the midst of writing data to a database table and experienced an unexpected power outage, it is possible that only a portion of your data would be saved, leaving the database in an inconsistent state. Recovering from such a situation would be arduous and time-consuming. ACID transactions ensure that your data never ends up in such an inconsistent state, providing a robust mechanism for maintaining data integrity and minimizing potential data loss.

Anomaly Detection

Anomaly detection involves the identification of uncommon events or observations that deviate statistically from the majority of the data. These "anomalous" occurrences often indicate issues such as credit card fraud, malfunctioning machinery, or cyber-attacks. In the realm of finance, where there are thousands or millions of transactions to monitor, anomaly detection can play a crucial role in pinpointing error occurrences, facilitating root cause analysis, and promptly seeking assistance to address the problem. By detecting outliers and notifying relevant stakeholders, anomaly detection supports the goals of chaos engineering in monitoring and managing potential disruptions. Increasingly, machine learning and AI techniques are being employed for anomaly detection purposes, particularly in the realms of fraud detection and Anti-Money Laundering (AML).

Algorithm

What is an algorithm?

An algorithm is a systematic set of instructions or a defined procedure designed to solve a problem or accomplish a specific computation. It serves as a precise guide that outlines step-by-step actions to be executed, whether in hardware or software environments. Algorithms find widespread applications across various domains of information technology. In the fields of mathematics and computer science, algorithms are often referred to as concise procedures that address recurring problems. They also serve as blueprints for data processing operations and play a pivotal role in automated systems. From simple tasks like sorting sets of numbers to complex endeavours such as personalized content recommendations on social media platforms, algorithms start with an initial input and detailed instructions for a particular computation, ultimately producing an output as the result of the process.

How does an algorithm work?

Algorithms can be expressed using different formats such as natural languages, programming languages, pseudocode, flowcharts, and control tables. While natural language expressions are less common due to their ambiguity, programming languages are commonly utilized to convey algorithms that are executed by computers.

Algorithms involve an initial input along with a series of instructions. The input represents the initial data necessary for making decisions and can be in the form of numbers or words. The input data is processed through a set of instructions, encompassing computations like arithmetic operations and decision-making processes. The output, which typically takes the form of additional data, represents the final step of an algorithm.

For instance, a search algorithm takes a search query as input and follows instructions to search through a database for relevant items matching the query. Automation software serves as another illustration of algorithms, as it adheres to predefined rules to carry out tasks. Automation software comprises multiple algorithms that collectively automate a specific process.

What are the different types of algorithms?

There are several types of algorithms, all designed to accomplish different tasks. For example, algorithms perform the following:

  • Search engine algorithm - This algorithm takes search strings of keywords and operators as input, searches its associated database for relevant web pages and returns results.
  • Encryption algorithm - This computing algorithm transforms data according to specified actions to protect it. A symmetric key algorithm, such as the Data Encryption Standard, for example, uses the same key to encrypt and decrypt data. As long as the algorithm is sufficiently sophisticated, no one lacking the key can decrypt the data.
  • Greedy algorithm - This algorithm solves optimization problems by finding the locally optimal solution, hoping it is the optimal solution at the global level. However, it does not guarantee the most optimal solution.
  • Recursive algorithm - This algorithm calls itself repeatedly until it solves a problem. Recursive algorithms call themselves with a smaller value every time a recursive function is invoked.
  • Backtracking algorithm - This algorithm finds a solution to a given problem in incremental approaches and solves it one piece at a time.
  • Divide-and-conquer algorithm - This common algorithm is divided into two parts. One part divides a problem into smaller subproblems. The second part solves these problems and then combines them together to produce a solution.
  • Dynamic programming algorithm - This algorithm solves problems by dividing them into subproblems. The results are then stored to be applied to future corresponding problems.
  • Brute-force algorithm - This algorithm iterates all possible solutions to a problem blindly, searching for one or more solutions to a function.
  • Sorting algorithm - Sorting algorithms are used to rearrange data structure based on a comparison operator, which is used to decide a new order for data.
  • Hashing algorithm - This algorithm takes data and converts it into a uniform message with a hashing
  • Randomized algorithm - This algorithm reduces running times and time-based complexities. It uses random elements as part of its logic.

<!DOCTYPE html>

A

  • Algorithm

    When it comes to databases and data storage systems, a transaction refers to an action that is treated as a cohesive unit of work. It either successfully completes as a whole or fails entirely, ensuring that the storage system remains in a consistent state. An illustrative instance of a transaction is the process of withdrawing money from your bank account. Either the funds are deducted from your account, or they remain untouched; there is no middle ground. delta-lake-1-min.png ### A.C.I.D. properties: Atomicity, Consistency, Isolation, and Durability ACID is an acronym representing the four fundamental properties that define a transaction: Atomicity, Consistency, Isolation, and Durability. When a database operation exhibits these ACID properties, it is referred to as an ACID transaction, and data storage systems that implement such operations are known as transactional systems. ACID transactions provide assurance that every read, write, or modification of a table adheres to the following principles: - Atomicity: Each statement within a transaction, whether it involves reading, writing, updating, or deleting data, is treated as a unified unit. It is either executed entirely or not at all. This property safeguards against data loss or corruption, such as when a streaming data source fails abruptly during processing. - Consistency: Transactions ensure that modifications to tables occur in predefined and predictable manners. By enforcing consistency, the integrity of the table remains intact, preventing unintended consequences or errors from compromising the data. - Isolation: When multiple users concurrently read from and write to the same table, transaction isolation guarantees that their actions do not interfere with or impact one another. Each transaction is isolated, allowing them to proceed as if they were occurring sequentially, even though they are executed simultaneously.

    Read more

  • Association Rules

    Association rules are patterns or relationships discovered in data that indicate the co-occurrence or dependency between different items or events. They are commonly used in market basket analysis to identify frequent itemsets and generate recommendations.

    Read more

B

  • Big Data

    Big data refers to extremely large and complex datasets that cannot be easily managed, processed, or analyzed using traditional data processing techniques.

    Read more

  • Bias

    Bias in data science refers to the systematic error or deviation in the results or predictions of a model due to incorrect assumptions, flawed methodology, or discriminatory factors.

    Read more