We propose a distributed representation approach for domain names based on DNS queries. This distributed representation enables domains to be embedded into vector spaces with reflecting data exchange in networks. Since the ground truth of the distributed representation is unknown, we indirectly evaluate our distributed representation based on the premise that the accuracy of the distributed representation is strongly related to the validity of similarity between domains in the distributed representation. The results suggest the feasibility of the concise and versatile representation for numerous domain names with accurately capturing their interrelations.
DNS: Domain Name System
The DNS (Domain Name System) is closely related to the network activities of devices because the devices directly interact with the DNS before data exchange via networks [1]. These activities must be monitored by administrators to ensure network security. Recent work has been performed to assess the effectiveness of graph-theoretical techniques in gaining insight into these activities [2,3]. Graph-based approaches represent the relations of domains using vertices and edges to numerically analyze complex structures. However, the main problem is the strict limitation on the acceptable number of domains. This is because the graph scale rapidly increases with the number of unique domains and the graph structure is ultimately represented by an adjacency matrix in which the values become exceedingly sparse and uneven. Our challenge is to establish a concise and versatile representation for numerous domain names with accurately capturing their interrelations, and the results contribute to advances in network security.
The remainder of this paper is organized as follows: In Section 2, we propose a distributed representation approach for domain names based on DNS queries. We describe experiments conducted to analyze the effectiveness of our approach in Section 3. Finally, we summarize our conclusions and future work in Section 4.
In this paper, we propose a distributed representation approach for domain names based on DNS queries. This distributed representation enables domains to be embedded into low-dimensional, dense, and continuous vector spaces. Our approach is motivated by the following observation: since several DNS queries occur behind data exchange in a network, the queried domains have strong dependencies; thus, individual domains are indirectly characterized by those relations. Notably, the concept of distributed representations for domain names is not novel [4]. However, the main focus has traditionally been topical classification, which maps “kyutech.ac.jp” to the “education” category and “bbc.com” to the “news and media” category. In contrast, our concern lies in vector representations for domain names reflecting data exchange in networks, and this representation is highly suitable as an input for machine learning and deep learning in network security [5].
Figure 1 shows an overview of our approach. This approach mainly consists of a data preprocessing function and a distributed representation training function. The following sections detail the two functions.
A query log for the input of our approach is a record of queries to recursive DNS servers from devices on a network. In the query log, each query has attributes such as a timestamp, source address, queried domain, and record type.
This function divides a query log into query sub-logs. A query sub-log is a set of consecutive queries that have the same source address within a time-interval of Tα seconds or less. In this division, any query in a query sub-log must satisfy the following conditions:
∀xi, xi+1 ∈ Xn : f(xi, xi+1) ≤ Tα and g(xi, xi+1) .
Here, xi and xi+1 indicate the i-th and (i+1)-th consecutive queries in query sub-log Xn; f(xi, xi+1) is the time-interval between xi and xi+1; and g(xi, xi+1) is boolean corresponding to the same or different source address in xi and xi+1. The total N query sub-logs Xn, where n ∈ {1 … N}, resulting from this process are passed to subsequent functional units to train the distributed representation.
A distributed representation is initially devised for semantic analysis in natural language processing. Word2Vec, a representative model, trains the relations between words and their surrounding words through a neural network [6]. We modify Word2Vec to shift its focus from words in a sentence to queries in a query sub-log as follows: (1) we replace words with queried domains; and (2) to measure co-occurrences, we adopt the time-interval between queries instead of the distance between words.
First, this function selects the pairs of queries and their surrounding queries, where “surrounding” means occurrences within Tβ seconds before and after a query. A set of surrounding queries, xj ∈ Si, paired with query xi in query sub-log Xn must satisfy the following conditions:
∀xj ∈ Si : xi, xj ∈ Xn and f(xi, xj) ≤ Tβ and h(xi, xj) .
Here, f(xi, xj) is the time-interval between xi and xj; and h(xi, xj) is true if both queries xi and xj have A-records. Only such queries are considered because they arise from data exchange.
Next, the pairwise relations between query xi and surrounding queries Si are trained in accordance with the general Word2Vec model. Specifically, the weights in the neural network are optimized to infer the domains for surrounding queries Si from the domain for query xi. By iteratively training the pairwise relations, our approach finally yields a distributed representation of domains.
In our approach, we set the parameters to the following values: Tα=15.0 and Tβ=1.0. For other parameters in the general Word2Vec model, the number of vector dimensions, number of iterations, initial learning rate, down-sampling rate, batch size, and negative sampling size are set to 200, 500, 0.01, 1e-5, 1024, and 5, respectively. Refer to the literature [6] for details of these parameters.
We collected a dataset from a recursive DNS server in our campus wireless network during a one-month period beginning on 1 March 2020. The dataset comprised a total of 26266740 queries, with a total size of approximately 8.5GB. For the queries with A-records, we aggregated domains with a frequency of less than 10 as “Other”. The resulting number of unique domains was 36261. Note that the number of domains was extensive, and the true distributed representation was unknown. Accordingly, based on the premise that the accuracy of the distributed representation was strongly related to the validity of similarity between domains in the distributed representation, we indirectly evaluated the distributed representation by validating each domain and its similar domains in the dataset.
Figure 2 shows the number of domains similar to each domain in the dataset, where the horizontal and vertical axes indicate the number of similar domains and the cumulative rate. The similarity criterion between domains involved a cosine similarity value of more than 0.7 for the distributed representation. The results indicate that (A) 90% of domains in the dataset had less than 9 similar domains, (B) 1% of them had more than 109 similar domains, and (C) the maximum number of similar domains reached 255.
To indirectly evaluate our distributed representation, we assessed the relations between domains with more than 0.7 cosine similarity. We found that they could be categorized into the following cases. In the first case, domain di directly co-occurs with domain dj, and they are strongly dependent on data exchange. In the second case, co-occurring domains with domain di are similar to those with domain dj, and they are potential alternates in data exchange. The third case involves indirect similarities. Specifically, since the relation between domains di and dk could fall within either of the above two cases and the same is true for domains dj and dk, domain di is indirectly similar to domain dj. The relations between domains in the first, second, and third cases account for 39%, 29%, and 25% of the total, respectively. Thus, our approach realizes to accurately embed domains into vector space while maintaining their relations. The remaining 7% experience embedding errors caused by the low frequency of co-occurrence between domains. In conclusion, the results suggest the feasibility of the concise and versatile representation for numerous domain names with accurately capturing their interrelations.
We proposed a distributed representation approach for domain names and indirectly confirmed the accuracy of our distributed representation. The evaluation results indicated the feasibility of the concise and versatile representation for numerous domain names with accurately capturing their interrelations. Our distributed representation is highly suitable as input for machine learning and deep learning models in the field of network security. Consequently, it can be applied to novel security systems based on such models, including the visualization of domain interrelations, detection of malware infections, inference of unknown domains, and enhancement of threat intelligence. Moreover, in security-specialized LLMs (Large Language Models), this representation is expected to facilitate the automation of tasks traditionally performed by network operators, such as the analysis of security logs and security reports, as it constitutes a fundamental technology for the semantic understanding of domain names. In the future, we plan to deeply analyze the results and the influential parameters.
This work was supported by JSPS KAKENHI Grant Number JP24K14932.
SignUp to our
Content alerts.
Are you the author of a recent Preprint? We invite you to submit your manuscript for peer-reviewed publication in our open access journal.
Benefit from fast review, global visibility, and exclusive APC discounts.