Monday, February 2, 2026

The Gravity of Context: From Transformers’ Attention to Oracle 26ai

This post will be about AI. We 'll start by going back to early days of transformers, the 2017 moment, the attention mechanism and all that, and in the final we 'll use a metaphorical bridge to connect the Transformer math with Oracle’s in-database applied AI.  Our metaphorical bridge will be "the internal attention mechanisms". A little bit of autonomous database , a little bit about 23AI / 26AI, but we 'll get there... Okay. Let's start and we'll see :)


Let's start with a specific point in time we call the "2017 moment". I'm talking about the release of the Google paper "Attention Is All You Need." Attention mechanisms actually existed as early as 2014 (Bahdanau) , but the 2017 Moment was the declaration of independence. 

*To the architect of the 2017 moment / Ashish Vaswani: thank you for proving that attention is indeed all we need to reshape the world of AI.

While Bahdanau added attention as a support for the old RNNs in 2014, the 2017 paper made the bold claim that we didn't need the RNNs at all -> Attention was all we needed. 

Here we note that; actually, the industry is now moving towards hybrid architectures, to "Attention where you need it." But, I will not get into that. Let's get straight to the point without causing too much confusion.

The paper "Attention Is All You Need" did exactly what the title said. Before that, we (of course not the actual "we", the AI models) were stuck in the linear age, processed one word at a time, left to right, and it was slow. The memories these models had was short, and subject-specific processing couldn’t be parallelized. But! the 2017 paper showed we could build an entire architecture using only attention, and it fundamentally reset the AI timeline. It provided the ability to seeing sentences as a big, interconnected web of context.

I will explain this with my favorite example. The excuse of the young students (not those from Turkey, but most likely those from America.. An old one, there must be some modern versions now :)
That is, the sentence "The Dog ate my homework"

So, just like in the old days of RNNs, models tried to carry the entire past context in a single, ever-changing hidden state. However, as the sentence grew, that memory faded, and the model would often lose track of the subject by the time it reached the end. If the model fails to explicitly link the word my back to dog across that distance, an error becomes almost unavoidable...

I mean let's imagine we give the input "The dog ate my" to the model and want it to tell us what's the next word should be... The word "my" itself doesn't make the model to choose the word "homework" easily. 

In fact the model gets lost here. So what I want to emphasize is that, in order to be able to guess the correct word,  ("homework" in this case), the model should  attend back to the word "dog". So, it is like catching the context.. Without the context, you generate wrong answers, or you get lost. You need to have a complete understanding of the sentence and somehow indirectly grasp the context from it.

Okay we begin with the transformer, but what is it? Well.. The Self-Attention Mechanism is the key. The Transformer is essentially just a efficient frame built specifically to let the attention mechanism run properly at scale. 

With Attention, transformer  becomes a machine that can see every word in a sentence simultaneously. And what's going on in the background is mathematics.. The cool math tricks made these things appear as they are.. In the example above, the Transformer creates a direct mathematical link between the word my and the word dog instantly.

Here, it's more about vector math. The vectors are being massaged... The vectors corresponding to the words are being related and massaged together, and the final vector is no longer the same vector it was when it started; it somehow becomes related to the previous vectors, and from this point on the next word is predicted more easily and accurately. That's basically what's happening.

Let's go a little far, let’s look under the hood to see how it actually works  The logic of Transformer is based on the high-speed dance of three vectors: Query (Q), Key (K), and Value (V).

Query: When the Transformer processes the word "my", it uses the vector generated for that token. So the Query vector asks: "Who in this sentence is relevant to me?" Every Key vector answers: "Here's what I offer." The dot product measures compatibility..!

Key : Every other word ("The", "dog", "ate" ...) has a Key vector. Its like an identity card.. These are for describing what they offer to the sentence.

So, we calculate a dot-product (a well known matrix operation. note that there are other methods as well) between query and key (Hint: Query vector of token “my” is compared against key vectors of all tokens) and run the scores through a Softmax function, which squashes the numbers into probabilities. 

If the word 'dog' gets a high score, the model focuses its attention there. But! this does not mean the model looks only at the word dog.. But we can say that, the value vector of “dog” contributes the most to the resulting representation.

Dot-products measures how close vectors are. So if they are close/aligned in vector space, a large number, high score is obtained from dot-product.

Softmax converts these to probabilities. I mean, it takes these numbers and compresses them all between 0 and 1. It makes their sum equal to 1.

An example for score(s) and softmax finale:
-----------------------
"My" is the focus, so we calculate dot-product of "my" with other words.

Scores: [The: 1.2, dog: 5, ate : 3.5, my: 2] 

Actually what happens is; Query of "My" -Dot Product- Key of "Dog" = 5 (High! Because "my" is looking for an owner, and "dog" is an object that can be owned.).. Note that, this is just for the analogy, actually transformers do not encode semantic intent like ownership, the model does not know why vectors align... :)

After Softmax : [The: 0.02 , dog: 0.75 , ate: 0.18 , my: 0.05]

Decision: I need to focus on the word "dog" 75% of the time (This is the attention!), when processing the word "my".
-----------------------

As for the finalization, we multiply these scores by the value vectors.. So, to get the contextualized vector for "my", we multiply each value vector by its score and sum them all up. The result is a new vector and "dog" contributes the most to to it. I mean dog contributes the most to the contextualized representation..
(remember -> 75% after softmax). 

So, the word "my" doesn't go alone when it moves to the next layer. 

The model takes the value vector of the word "dog," multiplies it by 0.75 (meaning it takes the value vector of dog with almost full strength), and then adds the value vector of "ate," scaled by 0.18. It even looks at the word "the" with its tiny weight of 0.02, and the result becomes : "my" is no longer just "my"; its like having my' (my prime) in hand, a dynamic, contextualized vector that has been mathematically pulled toward the dog and influenced by the act of eating.

This allows the model to predict the word "homework" in the next step, knowing that the word "my" is related to "dog" and the act of eating.

Okay.. This is like massaging, and it happens over and over through multiple layers. Each layer refines the meaning, making the vector more intelligent and context-aware as it moves toward the final prediction. 

By the time data flows through the transformer layers, the word "my" has been literally reshaped by the context of the "dog"'. The Transformer is the architecture that allows this massage to happen across billions of words at the same time. It doesn't care about the order of the words initially, it sees the whole web at once but! it uses smart positional tags to keep the story straight. And it doesn't just do one massage at a time; it runs multiple versions of this attention (Multi-Head) in parallel, layers deep, to refine the meaning until the alignment is spot-on.


So far so good. Let's take a look at the training phase a little bit. There we see weights.. In the training phase, the model doesn't store the final vectors; it stores weights. When those weights are hit by a word, they produce the vectors. So, training adjusts the weights so that the resulting vectors point in the right direction. Training isn't just about guessing; it’s about adjusting the weights so that the resulting vectors point in the right direction at runtime. 

Okay.. The weights are important, they indirectly result vectors and result things like where they point and their alignment. Vectors are there in the beginning, but they are not static. 

In the training, weights are learned that will produce good vectors. So during the runtime, we get input, we already have weights, and using these two, contextual vectors are computed fresh.

For instance; the word "bank" transforms into completely different vectors in the case of the following 2 sentences; "I went to the bank to withdraw money" and "I sat on the river bank".

These new, contextual vectors are calculated at runtime. There's no pre-existing vector of "riverbank" in memory; the model constructs that meaning instantly, using the weights of the surrounding words.

In other words; the word "bank" doesn't have a stored "river" version or "money" version. The weights know how to construct the right vector based on context at runtime.

Anyways; during the training of the model, the model guesses answers.. Cross-Entropy Loss function takes place here.. If the probability that is given by the model for the correct answer is low, the loss is high. At this stage, back propagation kicks in by regarding the loss (the loss tells you how much it missed the target) and It takes actions to update weights so that future vector representations align more closely with the desired outputs. So, in other words, indirectly, those actions are taken to bring vectors that need to be close together closer together. ( this is very high level, but! I won't go any deeper here)


Now we have arrived at the point where we will cross our metaphorical bridge, where we will connect our subject to the Oracle.

Oracle didn't just watch the 2017 revolution, they built the new database with the result of it. The ability to perform AI tasks directly within the database was introduced in Oracle 23AI, and further improved in 26AI. 

In fact, we conducted RAG (Retrieval-Augmented Generation) studies with 23AI, created Sales Assistant solution using Oracle Vector Space and RAG, implemented software solutions for speech-to-SQL demos.

*To the pioneer of RAG / Douwe Kiela: thank you for giving AI a reliable memory and leading the charge as the ultimate hallucination reducer.

Since Oracle Database 23AI , the Vector is a native data type. That similar math and approach we just explained is now happening directly inside the database kernel.

By performing In-Database Vector Search, we can feed the transformer’s context window with rock-solid facts, effectively becoming Hallucination Killers through RAG (Retrieval-Augmented Generation). Note that, actually RAG doesn't eliminate hallucinations but reduces them :)

Anyways; I mean by implementing RAG on the Vector data that sits inside the Oracle Database, we basically say: "Do not guess! Use this data!"

It is an important unification of AI (Vectors, and DB integrated GEN-AI with RAG) , Graph and Native JSON in the context of Converged Oracle Database. With 23 AI and now with the evolved version: 26AI, Oracle goes beyond traditional database logic and has the ability to run inference-related AI operations directly within the database. 

This means that instead of constantly moving data to external AI environments ( or to things like vector databases), we can now perform core AI operations within Oracle itself. In older versions (like 19C) , when these types of processing is required: we needed to export data to a ML environment, run Python, TensorFlow, or OCI Data Science jobs externally.  Then we needed to push the results back into Oracle. In 26AI (actually started with 23AI), this changes completely.

Now the database itself understands vector data, offers AI-oriented SQL syntax, and includes built-in packages for inference, embeddings, and vector search.


In addition to that, Oracle 26 AI offers support for open table formats, reinforcing its converged database strategy. Actually, this is not a new thing, but 26AI i is the version that matures and standardizes this feature. That is,  Oracle can natively integrate with Apache Iceberg and Apache Parquet, allowing you to store or access data in these universal formats. You can create a table (connected to the Iceberg in the backend) using DDL as part of the Oracle database and query iceberg data directly inside an Oracle database session using SQL. (Interesting, isn't it?).


So, Oracle can seamlessly integrate Iceberg by directly querying it without the need for importing or moving the data. We can even join this external Iceberg table with a relational Oracle table.

For instance;

Oracle Session / SQL:

SELECT
    c.customer_id,
    c.first_name,
    c.last_name,
    c.email,
    s.segment,
    s.risk_score
FROM sys.customers_iceberg c --> data retrieved from iceberg / queries iceberg table.
LEFT JOIN sys.customer_segments s --> data retrieved from relational Oracle database table
  ON s.customer_id = c.customer_id
ORDER BY c.customer_id;

This means significantly reduced data movement and less complex ETL processes. The philosophy "query the data where it lives" is fully applied here, making it a truly AI-native Database. It's worth a try... (Read for more:
https://www.oracle.com/tr/database/ai-native-database-26ai/)"


Let's wrap things up, and before we finish, I'd like to remind you of Kepler.

In general, we must do what Kepler did with the stars! Kepler didn't just look at the stars and say, "If this is here today, it must be there tomorrow." Kepler changed the game by discovering the underlying laws that governed them.

In this case, having AI inside the database may be considered as the Kepler Moment for enterprise data. 

While the Transformer models are the engines of generation, Oracle provides the high-speed vector engine for retrieval. Same dot-product and similarity search operations that allow a transformer to find the 'dog' in a sentence are now happening directly inside the database. 

So, it is a Keplerian shift: the database is no longer a passive storage room; it's an active participant that understands the gravitational pull between our data points. We are moving from "What happened?" to "What is the underlying pattern?"

Of course, a new turning point is still needed. Let's see when that turning point from 2017 will come again. 

Therefore, I would like to conclude my writing with these words;

In a broader context, I think we should be waiting for a new fundamental shift, maybe another 2017 Moment, a fundamental shift that will take the AI world beyond the limits of sophisticated pattern matching. We should be looking for an architectural leap where reasoning, causal understanding, and the essence of human experience are no longer just simulated, but are inherently woven into the very fabric of the model.