Automating LinkedIn Job Searches with Puppeteer and RxJS
Back to all articles
LinkedIn
jobs
Scraping

Automating LinkedIn Job Searches with Puppeteer and RxJS

SHEMANTI PAL
SHEMANTI PAL
Jun 9, 2025
14 min read

Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1742474806170/99ad1b63-4151-4ee0-99f7-676dfcbfbd6a.png align="center")

Web scraping may seem like a simple task, but there are many challenges to overcome. In this blog, we will dive into how to scrape LinkedIn to extract job listings. To do this, we will use Puppeteer and RxJS. The goal is to achieve web scraping in a declarative, modular, and scalable manner.

What is Web Scraping?

Web scraping is an automated method of extracting valuable data from websites. It allows users to retrieve specific information—such as text, images, links, and structured content—without manually copying and pasting. This technique is widely used for various purposes, including market research, data analysis, job listings aggregation, and competitive intelligence.

By leveraging web scraping tools, developers can efficiently collect, process, and utilize web data, transforming unstructured online information into structured insights.

Puppeteer: A Powerful Web Scraping Tool

Puppeteer is a JavaScript library that provides programmatic control over headless or full browsers like Chrome. It allows developers to automate tasks such as navigating web pages, interacting with elements, and extracting data, making it an excellent choice for web scraping.

One of Puppeteer's biggest advantages is its ability to handle dynamic content. Unlike traditional scraping techniques that rely solely on fetching raw HTML, Puppeteer can execute JavaScript, ensuring that all elements—including those loaded asynchronously—are properly rendered before extraction. This makes it particularly useful for scraping websites with complex structures or content hidden behind interactive elements.

Understanding RxJS

RxJS is a powerful JavaScript library designed for reactive programming, making it easier to handle asynchronous data streams efficiently. In this project, we leverage RxJS due to its numerous advantages:

Streamlined Asynchronous Workflow – Enables a declarative approach to managing async operations.
Robust Error Handling – Provides built-in mechanisms to catch and handle errors gracefully.
Effortless Retry Logic – Allows automatic retries when scraping issues arise.
Flexible and Scalable Code – Simplifies adaptation as project complexity grows.
Extensive Operator Support – Offers a rich set of functions to process and manipulate data efficiently.

1. Puppeteer initialization

The code snippet below initializes a Puppeteer browser instance in a non-headless mode and subsequently creates a new web page. This represents the most fundamental and straightforward initialization process for Puppeteer:

src/index.ts

ts
1(async () => { 2 console.log('Launching Chrome...'); 3 const browser = await puppeteer.launch({ 4 headless: false, 5 // devtools: true, 6 // slowMo: 250, // slow down puppeteer script so that it's easier to follow visually 7 args: [ 8 '--disable-gpu', 9 '--disable-dev-shm-usage', 10 '--disable-setuid-sandbox', 11 '--no-first-run', 12 '--no-sandbox', 13 '--no-zygote', 14 '--single-process', 15 ], 16 }); 17 18 const page = await browser.newPage() 19 20 /** 21 * 1. Go lo linkedin jobs url 22 * 2. Get the jobs 23 * 3. Repeat step 1 with other search parameters 24 */

2. Accessing LinkedIn Job Listings and Extracting Data

This is the core section of our blog, where we delve into the process of navigating LinkedIn’s job listings, parsing the HTML content, and extracting job details in a structured JSON format. Our approach ensures that we retrieve relevant job information efficiently while handling potential roadblocks during the scraping process.

2.1. Construct the URL for navigating to LinkedIn job offers page

To access LinkedIn's job listings, we need to construct a URL using the function urlQueryPage:

src/linkedin.ts

ts
1export const urlQueryPage = (searchParams: ScraperSearchParams) => 2 `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText} 3 &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`

In this case, I have already conducted the necessary research to identify a suitable URL for scraping. Our goal is to find a URL that can be dynamically parameterized based on our desired search criteria.

For this example, the key search parameters will include:

  • searchText – The job title or keyword.

  • pageNumber – The pagination index to navigate through job listings.

  • locationText (optional) – A specific location filter to refine search results.

By structuring the URL accordingly, we can efficiently retrieve job listings that match our specified criteria.

Examples of url can be:

  1. https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Angular&start=0

  2. https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=React&location=Barcelona&start=0

  3. https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&start=0

2.2. Navigate to the URL and extract the job offers

With our target URL identified, we can proceed with the two primary actions required:

  1. Navigating to the Job Listings URL: This step involves directing our web scraping tool to the URL where the job listings are hosted.

  2. Extracting the job offers data and converting to JSON: Once we're on the jobs listings page, we'll employ web scraping techniques to extract the jobs data and return them in JSON format.

src/linkedin.ts

ts
1 2export interface ScraperSearchParams { 3 searchText: string; 4 locationText: string; 5 pageNumber: number; 6} 7 8/** main function */ 9export function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> { 10 return defer(() => fromPromise(navigateToJobsPage(page, searchParams))) 11 .pipe(switchMap(() => getJobsFromLinkedinPage(page))); 12} 13 14/* Utility functions */ 15export const urlQueryPage = (searchParams: ScraperSearchParams) => 16 `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText} 17 &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}` 18 19function navigateToJobsPage(page: Page, searchParams: ScraperSearchParams): Promise<Response | null> { 20 return page.goto(urlQueryPage(searchParams), { waitUntil: 'networkidle0' }); 21} 22 23export const stacks = ['angularjs', 'kubernetes', 'javascript', 'jenkins', 'html', /* ... */]; 24 25export function getJobsFromLinkedinPage(page: Page): Observable<JobInterface[]> { 26 return defer(() => fromPromise(page.evaluate((pageEvalData) => { 27 const collection: HTMLCollection = document.body.children; 28 const results: JobInterface[] = []; 29 for (let i = 0; i < collection.length; i++) { 30 try { 31 const item = collection.item(i)!; 32 const title = item.getElementsByClassName('base-search-card__title')[0].textContent!.trim(); 33 const imgSrc = item.getElementsByTagName('img')[0].getAttribute('data-delayed-url') || ''; 34 const remoteOk: boolean = !!title.match(/remote|No office location/gi); 35 36 const url = ( 37 (item.getElementsByClassName('base-card__full-link')[0] as HTMLLinkElement) 38 || (item.getElementsByClassName('base-search-card--link')[0] as HTMLLinkElement) 39 ).href; 40 41 const companyNameAndLinkContainer = item.getElementsByClassName('base-search-card__subtitle')[0]; 42 const companyUrl: string | undefined = companyNameAndLinkContainer?.getElementsByTagName('a')[0]?.href; 43 const companyName = companyNameAndLinkContainer.textContent!.trim(); 44 const companyLocation = item.getElementsByClassName('job-search-card__location')[0].textContent!.trim(); 45 46 const toDate = (dateString: string) => { 47 const [year, month, day] = dateString.split('-') 48 return new Date(parseFloat(year), parseFloat(month) - 1, parseFloat(day) ) 49 } 50 51 const dateTime = ( 52 item.getElementsByClassName('job-search-card__listdate')[0] 53 || item.getElementsByClassName('job-search-card__listdate--new')[0] // less than a day. TODO: Improve precision on this case. 54 ).getAttribute('datetime'); 55 const postedDate = toDate(dateTime as string).toISOString(); 56 57 58 /** 59 * Calculate minimum and maximum salary 60 * 61 * Salary HTML example to parse: 62 * <span class="job-result-card__salary-info">$65,000.00 - $90,000.00</span> 63 */ 64 let currency: SalaryCurrency = '' 65 let salaryMin = -1; 66 let salaryMax = -1; 67 68 const salaryCurrencyMap: any = { 69 ['€']: 'EUR', 70 ['$']: 'USD', 71 ['£']: 'GBP', 72 } 73 74 const salaryInfoElem = item.getElementsByClassName('job-search-card__salary-info')[0] 75 if (salaryInfoElem) { 76 const salaryInfo: string = salaryInfoElem.textContent!.trim(); 77 if (salaryInfo.startsWith('€') || salaryInfo.startsWith('$') || salaryInfo.startsWith('£')) { 78 const coinSymbol = salaryInfo.charAt(0); 79 currency = salaryCurrencyMap[coinSymbol] || coinSymbol; 80 } 81 82 const matches = salaryInfo.match(/([0-9]|,|\.)+/g) 83 if (matches && matches[0]) { 84 // values are in USA format, so we need to remove ALL the comas 85 salaryMin = parseFloat(matches[0].replace(/,/g, '')); 86 } 87 if (matches && matches[1]) { 88 // values are in USA format, so we need to remove ALL the comas 89 salaryMax = parseFloat(matches[1].replace(/,/g, '')); 90 } 91 } 92 93 // Calculate tags 94 let stackRequired: string[] = []; 95 title.split(' ').concat(url.split('-')).forEach(word => { 96 if (!!word) { 97 const wordLowerCase = word.toLowerCase(); 98 if (pageEvalData.stacks.includes(wordLowerCase)) { 99 stackRequired.push(wordLowerCase) 100 } 101 } 102 }) 103 // Define uniq function here. remember that page.evaluate executes inside the browser, so we cannot easily import outside functions form other contexts 104 const uniq = (_array) => _array.filter((item, pos) => _array.indexOf(item) == pos); 105 stackRequired = uniq(stackRequired) 106 107 const result: JobInterface = { 108 id: item!.children[0].getAttribute('data-entity-urn') as string, 109 city: companyLocation, 110 url: url, 111 companyUrl: companyUrl || '', 112 img: imgSrc, 113 date: new Date().toISOString(), 114 postedDate: postedDate, 115 title: title, 116 company: companyName, 117 location: companyLocation, 118 salaryCurrency: currency, 119 salaryMax: salaryMax, 120 salaryMin: salaryMin, 121 countryCode: '', 122 countryText: '', 123 descriptionHtml: '', 124 remoteOk: remoteOk, 125 stackRequired: stackRequired 126 }; 127 console.log('result', result); 128 129 results.push(result); 130 } catch (e) { 131 console.error(`Something when wrong retrieving linkedin page item: ${i} on url: ${window.location}`, e.stack); 132 } 133 } 134 return results; 135 }, {stacks})) as Observable<JobInterface[]>) 136} 137

The code provided extracts the information of all jobs from the page. While it may not be the most aesthetically pleasing code, it gets the job done. It is not aesthetic because parsing this type of HTML always leads to many fallbacks and checks.

In a standard programming context, breaking code into smaller, isolated functions improves readability and maintainability. However, when working with page.evaluate in Puppeteer, we face certain limitations. Since this code executes within the Puppeteer (Chrome) instance rather than our Node.js environment, all logic must be self-contained within the page.evaluate call.

The only exception is simple variables (such as stacks in our case), which can be passed as arguments to page.evaluate. However, these variables must not contain functions or complex objects that cannot be serialized, as Puppeteer does not support passing non-serializable data between Node.js and the browser context.

In this case, the most challenging part of scraping is extracting salary information, as it requires converting a text format like "$65,000.00 - $90,000.00" into separate salaryMin and salaryMax values.

To handle potential issues gracefully, we have encapsulated the entire code within a try/catch block. While we currently log errors to the console, it is highly recommended to implement a mechanism for storing error logs on disk. This is especially important because websites frequently update their structure, requiring regular adjustments to the HTML parsing logic.

Finally, we consistently use the defer and fromPromise operators to convert Promises into Observables, ensuring a reactive and efficient data flow throughout the scraping process.

typescript
1defer(() => fromPromise(myPromise()));

This approach is a recommended best practice that works reliably in all scenarios. Promises are eager, whereas Observables are lazy and only initiate when someone subscribes to them. The defer operator allows us to make a Promise lazy. Go to this link for more information about it

3. Add an asynchronous loop to iterate through all pages

In the previous step, we learned how to obtain all job offers data from a LinkedIn page. Now, we want to use that code as many times as possible to gather as much data as we can. To achieve this, we first need to iterate through all available pages:

src/linkedin.ts

ts
1function getJobsFromAllPages(page: Page, initSearchParams: ScraperSearchParams): Observable<ScraperResult> { 2 const getJobs$ = (searchParams: ScraperSearchParams) => goToLinkedinJobsPageAndExtractJobs(page, searchParams).pipe( 3 map((jobs): ScraperResult => ({jobs, searchParams} as ScraperResult)), 4 catchError(error => { 5 console.error(error); 6 return of({jobs: [], searchParams: searchParams}) 7 }) 8 ); 9 10 return getJobs$(initSearchParams).pipe( 11 expand(({jobs, searchParams}) => { 12 console.log(`Linkedin - Query: ${searchParams.searchText}, Location: ${searchParams.locationText}, Page: ${searchParams.pageNumber}, nJobs: ${jobs.length}, url: ${urlQueryPage(searchParams)}`); 13 if (jobs.length === 0) { 14 return EMPTY; 15 } else { 16 return getJobs$({...searchParams, pageNumber: searchParams.pageNumber + 1}); 17 } 18 }) 19 ); 20}

The code above increments the page number until we reach a page where there are no jobs. To perform this loop in RxJS, we use the operator expand, which recursively projects each source value to an Observable that is merged into the output Observable. Its functionality is well explained here.

In RxJS, we cannot use a for loop as we do with await/async. We are required to use another technique like expand operator or a recursive loop instead. While it might initially appear as a limitation, in an asynchronous context, this method proves to be more advantageous in numerous situations.

So, what would the equivalent code using Promises look like? Here's an example:

typescript
1export async function getJobsFromAllPages( 2 page: Page, 3 searchParams: ScraperSearchParams 4): Promise<ScraperResult> { 5 const results: ScraperResult = { jobs: [], searchParams }; 6 7 try { 8 while (true) { 9 const jobs = await getJobsFromLinkedinPage(page, searchParams); 10 console.log( 11 `Linkedin - Query: ${searchParams.searchText}, Location: ${ 12 searchParams.locationText 13 }, Page: ${searchParams.nPage}, nJobs: ${ 14 jobs.length 15 }, url: ${urlQueryPage(searchParams)}` 16 ); 17 18 results.jobs.push(...jobs); 19 20 if (jobs.length === 0) { 21 break; 22 } 23 24 searchParams.nPage++; 25 } 26 } catch (error) { 27 console.error('Error:', error); 28 results.jobs = []; // Clear the jobs in case of an error. 29 } 30 31 return results; 32}

This code is nearly equivalent to the Observable-based one, with one critical difference: it only emits when all pages have finished processing. In contrast, the implementation using Observables emits after each page. Creating a stream is crucial in this case because we want to handle the jobs as soon as they are resolved.

Certainly, we could introduce our logic following the line:

typescript
1const jobs = await getJobsFromLinkedinPage(page, searchParams); 2 3/* Handle the jobs here */

...but this would unnecessarily couple our scraping code with the part that handles the jobs data. Handling the jobs data may involve some transformations, API calls, and finally, saving the data into a database.

In this example, we clearly see one of the many benefits Observables offer over Promises.

4. Implementing an Asynchronous Loop for Multiple Search Parameters

Now that we've established how to iterate through multiple pages for a given search query, it's time to take the next step: expanding our search across multiple search parameters.

To achieve this, we'll introduce an additional asynchronous loop that cycles through various search criteria, ensuring comprehensive data extraction.

The first step is defining a structured data format to store these search parameters. We'll call this list searchParamsList, which will hold different combinations of keywords, locations, or other relevant filters:

src/data.ts

ts
1const searchParamsList: { searchText: string; locationText: string }[] = [ 2 { searchText: 'Angular', locationText: 'Barcelona' }, 3 { searchText: 'Angular', locationText: 'Madrid' }, 4 // ... 5 { searchText: 'React', locationText: 'Barcelona' }, 6 { searchText: 'React', locationText: 'Madrid' }, 7 // ... 8];

To iterate through the searchParamsList array, we essentially need to convert it from an Array to an Observable using the fromArray operator. Subsequently, we will use the concatMap operator to sequentially process each searchText and locationText pair. The power of RxJS here is that, in the case where we may want to switch from sequential to parallel processing, we just need to change the concatMap for a mergeMap. In this case, it is not recommended because we will exceed LinkedIn's rate limits, but it's something to consider in other scenarios.

src/linkedin.ts

ts
1/** 2 * Creates a new page and scrapes LinkedIn job offers data for each pair of searchText and locationText, recursively retrieving data until there are no more pages. 3 * @param browser A Puppeteer instance 4 * @returns An Observable that emits scraped job offers data as ScraperResult 5 */ 6export function getJobsFromLinkedin(browser: Browser): Observable<ScraperResult> { 7 // Create a new page 8 const createPage = defer(() => fromPromise(browser.newPage())); 9 10 // Iterate through search parameters and scrape jobs 11 const scrapeJobs = (page: Page): Observable<ScraperResult> => 12 fromArray(searchParamsList).pipe( 13 concatMap(({ searchText, locationText }) => 14 getJobsFromAllPages(page, { searchText, locationText, pageNumber: 0 }) 15 ) 16 ) 17 18 // Compose sequentially previous steps 19 return createPage.pipe(switchMap(page => scrapeJobs(page))); 20}

This code will loop through different search parameters, retrieving job listings for each combination of technology and location efficiently.

🎉 Congratulations! You now have the skills to scrape LinkedIn job postings! 🎉

However, like many other platforms, LinkedIn employs anti-scraping measures to prevent automated data extraction. Let’s explore how to handle these challenges 👇

Common Errors When Scraping LinkedIn

Running the code as it is will quickly lead to various errors, making it challenging to scrape a substantial amount of data. The two most common issues are:

1. 429 Status Code (Too Many Requests)

This error occurs when we send too many requests in a short period. To avoid being blocked, we need to slow down the request rate and introduce random delays until the error subsides.

2. LinkedIn Authwall

Occasionally, instead of the job listings page, LinkedIn may redirect us to an authentication wall. When this happens, the best approach is to pause requests for a while before trying again.

Handling 429 Errors & LinkedIn Authwall

To tackle these issues, we modify the getJobsFromLinkedinPage function by introducing a separate function, getLinkedinJobsFromJobsPage, to handle the HTML scraping logic. The updated code structure looks like this:

src/linkedin.ts

ts
1const AUTHWALL_PATH = 'linkedin.com/authwall'; 2const STATUS_TOO_MANY_REQUESTS = 429; 3const JOB_SEARCH_SELECTOR = '.job-search-card'; 4 5function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> { 6 return defer(() => fromPromise(page.setExtraHTTPHeaders({'accept-language': 'en-US,en;q=0.9'}))) 7 .pipe( 8 switchMap(() => navigateToLinkedinJobsPage(page, searchParams)), 9 tap(response => checkResponseStatus(response)), 10 switchMap(() => throwErrorIfAuthwall(page)), 11 switchMap(() => waitForJobSearchCard(page)), 12 switchMap(() => getJobsFromLinkedinPage(page)), 13 retryWhen(retryStrategyByCondition({ 14 maxRetryAttempts: 4, 15 retryConditionFn: error => error.retry === true 16 })), 17 map(jobs => Array.isArray(jobs) ? jobs : []), 18 take(1) 19 ); 20} 21 22/** 23 * Navigate to the LinkedIn search page, using the provided search parameters. 24 */ 25function navigateToLinkedinJobsPage(page: Page, searchParams: ScraperSearchParams) { 26 return defer(() => fromPromise(page.goto(urlQueryPage(searchParams), {waitUntil: 'networkidle0'}))); 27} 28 29/** 30 * Check the HTTP response status and throw an error if too many requests have been made. 31 */ 32function checkResponseStatus(response: any) { 33 const status = response?.status(); 34 if (status === STATUS_TOO_MANY_REQUESTS) { 35 throw {message: 'Status 429 (Too many requests)', retry: true, status: STATUS_TOO_MANY_REQUESTS}; 36 } 37} 38 39/** 40 * Check if the current page is an authwall and throw an error if it is. 41 */ 42function throwErrorIfAuthwall(page: Page) { 43 return getPageLocationOperator(page).pipe(tap(locationHref => { 44 if (locationHref.includes(AUTHWALL_PATH)) { 45 console.error('Authwall error'); 46 throw {message: `Linkedin authwall! locationHref: ${locationHref}`, retry: true}; 47 } 48 })); 49} 50 51/** 52 * Wait for the job search card to be visible on the page, and handle timeouts or authwalls. 53 */ 54function waitForJobSearchCard(page: Page) { 55 return defer(() => fromPromise(page.waitForSelector(JOB_SEARCH_SELECTOR, {visible: true, timeout: 5000}))).pipe( 56 catchError(error => throwErrorIfAuthwall(page).pipe(tap(() => {throw error}))) 57 ); 58}

In this code, we address the previously mentioned errors, that is, the 429 response error and the authwall issue. Overcoming these errors is very important for successfully web scraping on LinkedIn.

To handle the errors, the code employs a custom retry strategy implemented by the retryStrategyByCondition function:

src/scraper.utils.ts

ts
1export const retryStrategyByCondition = ({maxRetryAttempts = 3, scalingDuration = 1000, retryConditionFn = (error) => true}: { 2 maxRetryAttempts?: number, 3 scalingDuration?: number, 4 retryConditionFn?: (error) => boolean 5} = {}) => (attempts: Observable<any>) => { 6 return attempts.pipe( 7 mergeMap((error, i) => { 8 const retryAttempt = i + 1; 9 if ( 10 retryAttempt > maxRetryAttempts || 11 !retryConditionFn(error) 12 ) { 13 return throwError(error); 14 } 15 console.log( 16 `Attempt ${retryAttempt}: retrying in ${retryAttempt * 17 scalingDuration}ms` 18 ); 19 // retry after 1s, 2s, etc... 20 return timer(retryAttempt * scalingDuration); 21 }), 22 finalize(() => console.log('retryStrategyOnlySpecificErrors - finalized')) 23 ); 24};

This strategy essentially increases the time between each retry after a failure. This way, we ensure that we will wait long enough for LinkedIn to allow us to make requests again

⚠️ Important Note: LinkedIn has strict anti-scraping measures, and excessive requests from a single IP address can lead to IP blacklisting. Simply increasing wait times between requests may not be a foolproof solution. To minimize the risk of detection and reduce errors, it's highly advisable to rotate IP addresses periodically. This can be achieved by using proxy services or VPNs, ensuring a more sustainable and uninterrupted scraping process.

Final Words

Web scraping can sometimes violate a website's terms of service, so it's crucial to review and respect the robots.txt file and Terms of Service before scraping any site. In this case, the provided code is intended strictly for educational and hobby purposes. LinkedIn specifically prohibits any data extraction from its website; you can read more here.

I encourage using web scraping as a learning tool, but always be mindful of ethical practices. Avoid excessive requests, respect the website's resources, and use the extracted data responsibly.

You can find the complete, updated code in this repository, don't doubt to give an star if it helped! 🙏⭐

Related Articles

Categories

Docker
containerization
container orchestration
TypeScript
React
LinkedIn
jobs
Scraping
hooks
Docker optimization
How to optimize Docker images for Next.js applications
Best practices for Docker image optimization in Next.js
Improving Next.js performance with Docker Reducing Docker image size for Next.js apps
Multi-stage builds for Next.js Docker images
Next.js performance
docker images
Web Development
GitHub
Git
merge
git rebase
git merge --squash
prepverse
Data Science
dataanalytics
data analysis
ReduxVsZustand
zustand
Zustand tutorial
State Management
Redux
redux-toolkit
technology
version control
github-actions
Zustand store
repository
2025 technology trends
opensource
Developer
portfolio
preparation
interview
engineering
Interview tips
#ai-tools
Technical Skills
remote jobs
Technical interview
JavaScript
Open Source
software development