In cyber threat intelligence (CTI), we’re not just analysts; we’re digital detectives, piecing together clues from a vast and chaotic sea of data. You must separate the signal from the noise and find the golden nuggets that will help protect your business. That’s where data collection methods come in!
Data collection is a fundamental pillar in your CTI strategy. This guide serves as your treasure map to the most effective data collection methods, helping to inform this key pillar.
We’ll explore core collection strategies, the tools you need, and practical advice to turn raw data into your most powerful weapon. Let’s arm ourselves with the knowledge to hunt smarter, not harder!
Want to listen on the go? Check out this article in podcast form!
CTI Data Sources Recap
Before we talk about how to collect data, let’s quickly recap where it comes from.
CTI data sources are incredibly diverse, ranging from highly technical malware samples to human conversations in clandestine forums. A successful CTI program draws from a wide array of these sources to build a layered, comprehensive view of the threat landscape.

We can broadly categorize them into:
Internal vs. External
- Internal data, generated from within your own network, is your ground truth. It includes security logs from firewalls and EDR systems, vulnerability scan results, and past incident reports. This information is high-fidelity and directly relevant to your organization.
- External data provides a view of the global threat landscape, including adversary reports, industry breach trends, and malware analysis from peers within your industry or the broader security community.
Open vs. Closed
- Open-Source Intelligence (OSINT) encompasses a universe of freely available information, ranging from security blogs and news articles to public government reports. Its main advantage is accessibility.
- Closed sources require special access. This could mean proprietary data from commercial vendors that offer curated, low-noise threat feeds, or private communities like Information Sharing and Analysis Centers (ISACs), where membership is required to share and receive sensitive intelligence.
Technical vs. Human
- Technical intelligence consists of machine-readable data like Indicators of Compromise (IOCs), malware signatures, and adversary Tactics, Techniques, and Procedures (TTPs). This is the data that fuels our automated security tools.
- Human Intelligence (HUMINT) is information gathered from people. It provides the invaluable context that machines can’t—like understanding an adversary’s motivation from their chatter on a hidden forum or learning about a planned attack before it happens.
Understanding these foundational intelligence collection sources is the first step. Now, let’s explore the methods we use to tap into them.
What are Data Collection Methods?
If data sources are the what, then collection methods are the how. This is the engine room of the threat intelligence lifecycle, the phase where we roll up our sleeves and actively gather the raw information needed to fulfill our intelligence requirements.
It’s not a passive act of just looking at data; it’s the deliberate and systematic process of extracting it.
Think of it this way: a notorious ransomware group’s leak site is a source. Your choice of data collection method determines the kind of intelligence you’ll get, how much, and how fast. You could manually monitor discussions to understand the actors’ internal dynamics and their tactics, techniques, and procedures (TTPs) – a slow but deep approach.
Alternatively, you could use automated web scraping to quickly pull down victim names and IOCs—a fast method that gathers scalable data but misses the nuance. A high-risk, high-reward method might even involve using a persona to engage with the actors.
A robust CTI program skillfully applies a variety of collection methods, tailoring them to its specific needs and understanding the trade-offs of each.
Let’s examine the primary data collection methods used for CTI.
Core Collection Methods for CTI
Think of the data methods as different tools in your analytical toolkit, each with its own specific purpose and strength.
Just as a master craftsperson wouldn’t use a sledgehammer for delicate work, a savvy CTI analyst must know when to deploy large-scale automated collection and when to apply the precision of manual, human-driven investigation.

The following techniques form the foundation of a modern collection strategy, blending broad-net automation with deep-dive analysis to ensure no threat goes unnoticed.
Manual Collection
This is the classic, hands-on approach where nothing beats the human brain. Manual collection involves an analyst directly gathering and interpreting information. This collection has its roots in HUMINT, but predominantly falls under OSINT in the cyber arena.
Manual collection is how we understand the subtleties that automated tools miss: the sarcasm in a forum post, the shifting alliances between threat groups, or the desperation in a hacker’s plea for help with their code.
Examples of manual collection in action include:
- Deep Reading: Immersing yourself in technical reports from security vendors or individual researchers. A single, well-written report on a new malware variant can yield not just IOCs, but also novel TTPs you can then hunt for in your own environment. Translating these TTPs often requires the work of a human analyst.
- Community Engagement: Actively participating in trusted sharing communities (e.g., ISACs) like the NCSC’s Cyber Security Information Sharing Partnership (CiSP). Here, a casual mention of a suspicious IP from a trusted peer can be the early warning you need to avert a major incident.
- Observational Analysis: Patiently monitoring discussions on social media, security forums, and private messaging groups to understand adversary chatter, identify emerging threats, and gauge sentiment.
While time-intensive, this method is invaluable for gathering rich, qualitative intelligence and spotting threats at their inception, long before they become structured, widely distributed IOCs.
It is where most analysts start when fulfilling new intelligence requirements or ad-hoc Request for Information (RFI) tasks.
Google Fu
“Google Fu” is the art of wielding a search engine with the precision of a martial artist. It transforms a simple search bar into a powerful reconnaissance tool.
It is a specialized skill that falls under the broader category of manual collection and helps turn tedious Google searches into streamlined hunts for nuggets of information.
Google Fu is about knowing the syntax to ask the exact right question to the internet. For example, instead of a generic search, an analyst might use a query like "index of" /"passwords.txt" "last modified"
to find misconfigured servers that are publicly exposing sensitive files.

filetype
: OperatorBy mastering Boolean operators (AND
, OR
, NOT
/ -
), exact phrase matching (""
), and advanced operators like site:
, filetype:
, and inurl:
, you can uncover specific malware analysis reports, leaked credentials, and technical discussions hidden deep within the web’s noise.
This is a foundational OSINT skill because it provides speed and accuracy at the very start of an investigation, allowing you to validate a hypothesis or find initial leads quickly.
Third-Party Bulk Collection
Why spend a fortune collecting data (be it manually with analysts or automatically with a sensor network) when you can effectively rent one? That’s the principle behind third-party bulk collection.
This method involves accessing large, curated datasets from CTI vendors who have already done the heavy lifting of collecting, processing, and storing vast amounts of data. This provides you with a “needle-rich haystack”—a dataset with a significantly higher probability of containing relevant threat information.
The key benefit here is access to historical data for retroactive hunting over a massive dataset without the headache of maintaining that dataset.
For instance, if a C2 server is identified today, you can query your vendor’s passive DNS data to see if any of your systems communicated with it within the last six, twelve, or even eighteen months. This is invaluable for uncovering long-dormant breaches.

Popular CTI vendors with bulk collections you can tap into include:
- Shodan
- FOFA
- urlscan.io
- VirusTotal
- MalwareBazaar
- AlienVault OTX
- Popular commercial CTI vendors, such as Recorded Future, Anomali, and Group-IB
Many CTI vendors sell their products based on the quality of their data. They gather large amounts of data from their security tools, the dark web, and other CTI vendors, and then incorporate this data into their offerings. Picking CTI products is more about the data they give access to than the actual product quality.
API Access
Application Programming Interfaces (APIs) are the bridges that allow your security tools to talk to each other, forming the backbone of modern security automation.
Many security services and platforms, such as VirusTotal, Shodan, and MISP, offer APIs that enable you to query their data programmatically. This method moves you from manual, one-off checks to automated, at-scale enrichment.
Imagine a workflow: a SIEM alert for a suspicious file hash triggers a script. This script utilizes the VirusTotal API to instantly retrieve the malware family, detection ratios, and community comments associated with a given hash. It then adds this rich context directly to the alert ticket, transforming a cryptic hash into an actionable intelligence packet for the incident response team.
You can either reactively collect threat intelligence through an API to be used in the moment, as shown in the example above, or proactively collect data and store it, as a CTI vendor would do.
This data collection method offers fast retrieval of technical data, but is often limited by the data sources for numerous reasons:
- Do they offer an API to begin with?
- Do they have limits on the amount of data you can retrieve through the API?
- Do you have the necessary infrastructure to support data collection using an API or the technical skills to implement a solution?
- Is it cost-effective to store all the data you collect?
- Can you efficiently search through the collected data?
These factors often lead CTI teams to make use of third-party providers (e.g., bulk collection) to avoid the headache of storage and search, or web scraping to bypass API limitations.
Web Scraping
When a data source is valuable but doesn’t offer a convenient API, web scraping becomes your go-to collection method.
This is an automated technique for extracting specific information directly from a website’s HTML. You can build scrapers to systematically pull down IOCs from security blogs, monitor paste sites for mentions of your company’s domains, or track the prices of new zero-day exploits on dark web marketplaces.
The goal is not just to obtain the data, but to receive it in a structured format (such as a CSV or JSON file) that you can then analyze or integrate into other tools.
While powerful, scraping requires a careful approach. It can be technically fragile (a website redesign can break your scraper) and legally and ethically complex. Always respect a site’s robots.txt file and terms of service, and be aware that aggressive scraping can alert your target or even result in your IP address being blocked.
In recent years, no-code automation tools like Octoparse have lowered the barrier to entry for many CTI teams. They allow you to build scrapers without extensive coding knowledge.
The collection methods you choose will heavily depend on your intelligence requirements and various other factors. These include your team’s budget, vendors or sharing groups you have access to, your team’s skill set, time requirements, and the size of your team, among others.
With these factors in mind, let’s explore some tools and technologies you can use to implement these collection methods in the real world.
Collection Tools and Technologies
To effectively execute the methods we’ve discussed, you need the right tools in your digital arsenal. These technologies are force multipliers that enable a small team to have a significant impact, automating mundane tasks so analysts can focus on the complex.

Threat Intelligence Platforms (TIPs)
A TIP like MISP, OpenCTI, or Recorded Future is the central nervous system of your CTI function. It serves as a single source of truth, automating the ingestion of data from dozens of feeds via TAXII, email, or direct API connections. More importantly, it helps you correlate disparate data points.
A TIP can automatically link a malicious IP address from one feed to a malware hash from another, and then connect both to a campaign report from a third source, revealing the bigger picture of an attack.
They help you tie together API, bulk collection, and even manual collection into one platform.
OSINT Tools
Beyond search engines, a suite of specialized OSINT tools is essential.
- Tools like Maltego excel at automated link analysis, visually mapping the relationships between domains, IPs, email addresses, and threat actors, turning a flat list of data into an intuitive graph.
- Services like Netlas.io act as a search engine for online assets, allowing you to find internet-connected devices, from exposed databases to vulnerable industrial control systems.
- Frameworks like Recon-ng provide a powerful, modular platform for automating common reconnaissance workflows and Google Fu.
If your team is performing manual collection (i.e., using Google Fu) or wants to automate API queries, finding an appropriate OSINT tool will save you countless hours!
Analysis Tools
For the day-to-day work of wrangling data, nothing beats the “Cyber Swiss Army Knife,” CyberChef. This tool is indispensable for rapid, on-the-fly analysis.
It allows you to decode data snippets, deobfuscate malicious scripts, and convert data between formats with a simple drag-and-drop interface. Instead of writing a one-off script, you can build a complex “recipe” in seconds to peel back layers of encoding, making it a perfect tool for rapid prototyping and analysis.
Although this is not a tool for data collection, it helps you parse, format, and wrangle the data you collect so it can be analyzed effectively.
Another excellent tool for wrangling data is the Data Wrangler extension for Visual Studio Code (VSCode). It is integrated into VSCode and helps you view, analyze, and clean your data through a user-friendly interface with statistics and visualizations.
Web Scraping Tools
To perform web scraping, you can use a range of tools. No-code platforms like Octoparse offer a visual interface, allowing you to click and select the data you want to extract without writing any code.

For more control and customization, coding libraries are the way to go.
In Python, libraries like BeautifulSoup are excellent for parsing HTML and XML from a web page, while Scrapy is a more comprehensive framework for building powerful web crawlers (or “spiders”) that can navigate and scrape multiple pages on a site. For dynamic, JavaScript-heavy websites, browser automation tools like Selenium or Playwright can control a real web browser to extract data that simpler tools can’t see.
Modern web scraping is not an easy task, and using a framework or no-code tool with built-in functionality that can help you bypass web scraping protections is essential.
Coding and API Interaction Tools
The ability to write simple scripts is a superpower for CTI analysts. Python is the dominant language in the field for a reason.
With the requests library, you can easily send HTTP requests to interact with almost any API. Combined with a data analysis library like pandas, you can build powerful automated workflows.
For example, a Python script could read a list of suspicious domains from a CSV file, use an API to get WHOIS and passive DNS data for each, and then save the enriched data to a new spreadsheet, all in a matter of seconds.
Tools like Jupyter Notebooks provide an interactive environment perfect for developing and documenting these scripts.
If you want to use API access as a collection method, you’d better know how to code (or vibe code with AI)!
Secure Browsing Environments
When your manual collection takes you into risky corners of the internet like the dark web, operational security (OPSEC) is paramount. Using a secure browsing environment is non-negotiable.
An analyst should never use their corporate machine to visit these sites. Instead, use a dedicated virtual machine running a privacy-focused OS like Tails or Whonix.
These systems route all traffic through the Tor network, masking your actual IP, and are “amnesiac,” meaning they leave no trace of your activity on the host machine after you shut them down. This protects both you and your organization from malware and from tipping off the adversaries you’re investigating.
You can even go a step further and build your own malware analysis environment for ultimate control over your data.
Now you know the primary data collection methods and tools you can use to implement them, it’s time for some practical advice on collecting threat intelligence in practice.
Practical Advice on Collection Methods
Having the right methods and tools is one thing, but wielding them with strategy is what separates a novice from an expert analyst. A truly effective collection operation is deliberate, disciplined, and defensible.
Here are the core principles that should guide your efforts.
- Start with a Plan: Never collect for the sake of collecting. An aimless collection strategy will drown you in data and yield very little actionable intelligence. Before considering a source, start with your intelligence requirements. Use a Collection Management Framework (CMF) to map your specific questions to potential sources and methods formally. This ensures that every collection effort is targeted and directly supports the needs of a decision-maker.
- Validate Your Sources: Not all data is created equal. The internet is awash with old, inaccurate, and outright false information. Acting on bad intelligence is often worse than having no intelligence at all; it can lead to wasted analyst hours, business disruptions from blocking legitimate traffic, and a loss of credibility. Therefore, rigorously assess every source for its reliability and accuracy.
- Prioritize and Automate: An analyst’s most valuable resource is their cognitive bandwidth. Don’t waste it on tasks a machine can do. The goal of automation is to handle high-volume, low-complexity tasks, freeing up your human experts for high-complexity analysis where they truly excel. Automate the collection of technical data, such as IOCs, from trusted feeds directly into your TIP or SIEM.
- Stay Legal and Ethical: A single misstep here can jeopardize an entire intelligence program. Ensure your collection methods comply with all relevant jurisdictions, particularly those related to privacy laws such as the GDPR. When in doubt, always consult your legal and compliance teams—they are your partners in building a defensible and sustainable collection program.
Conclusion
Effective data collection is a multi-faceted discipline that combines technical know-how with a curious and analytical mindset.
It’s not about a single tool or method, but about building a diverse and resilient collection strategy based on your intelligence requirements. By combining manual analysis with intelligent automation and selecting the most suitable collection methods for specific sources, you can create a comprehensive intelligence picture.
Remember, the goal is not just to gather data, but to gather the right data that empowers you to make timely and informed decisions to defend your organization. Now go forth and master the hunt!
Frequently Asked Questions
How is Data Collected in Cyber Threat Intelligence?
Data is collected through a blend of automated and manual methods. Key collection methods include Manual Collection (reading reports and monitoring forums), Open-Source Intelligence (OSINT) techniques such as “Google Fu,” accessing Third-Party Bulk Data from vendors, programmatic API Access to security tools, and automated Web Scraping from sites that don’t have APIs. The most effective approach utilizes multiple methods to cover various sources.
What Are The Sources of CTI Data?
CTI data comes from a wide variety of sources, which can be categorized as:
- Technical vs. Human: Machine-readable data (e.g., IOCs, TTPs) vs. intelligence gathered from people (e.g., forum chatter).
- Internal vs. External: Data from your own network (e.g., firewall logs) vs. data from the outside world (e.g., vendor reports).
- Open vs. Closed: Publicly available OSINT (e.g., blogs, social media) vs. private data requiring payment or membership (e.g., commercial feeds, ISACs).
What Are the Three Types of CTI?
Cyber threat intelligence is typically categorized by its audience and purpose:
- Strategic CTI: High-level intelligence for leadership (e.g., CISOs, boards) about the threat landscape, trends, and business risk to inform long-term security strategy.
- Tactical CTI: Detailed information about adversary Tactics, Techniques, and Procedures (TTPs) used by security teams to improve defensive controls and detection mechanisms.
- Operational CTI: Highly technical, time-sensitive information about specific, ongoing, or impending attacks, including Indicators of Compromise (IOCs), used by front-line defenders like SOC analysts and incident responders.
What is the CTI Process?
The CTI process follows a six-step cycle known as the “intelligence lifecycle“:
- Planning & Direction: Defining the goals and questions that need to be answered.
- Collection: Gathering the raw data to address those requirements (the focus of this article).
- Processing: Converting the raw data into a structured, usable format.
- Analysis: Making sense of the processed data to create finished intelligence.
- Dissemination: Delivering the finished intelligence to the people who need it.
- Feedback: Gathering input from stakeholders to improve the entire cycle.